XGBoost is a supervised machine learning method for classification and regression and is used by the Train Using AutoML tool. XGBoost is short for extreme gradient boosting. This method is based on decision trees and improves on other methods such as random forest and gradient boost. It works well with large, complicated datasets by using various optimization methods.
To fit a training dataset using XGBoost, an initial prediction is made. Residuals are computed based on the predicted value and the observed values. A decision tree is created with the residuals using a similarity score for residuals. The similarity of the data in a leaf is calculated, as well as the gain in similarity in the subsequent split. The gains are compared to determine a feature and a threshold for a node. The output value for each leaf is also calculated using the residuals. For classification, the values are typically calculated using the log of odds and probabilities. The output of the tree becomes the new residual for the dataset, which is used to construct another tree. This process is repeated until the residuals stop reducing or for a specified number of times. Each subsequent tree learns from the previous trees and is not assigned equal weight, unlike how Random Forest works.
To use this model for prediction, the output from each tree multiplied by a learning rate is added to the initial prediction to arrive at a final value or classification.
XGBoost uses the following parameters and methods to optimize the algorithm and provide better results and performance:
- Regularization—A Regularization parameter (lambda) is used while calculating the similarity scores to reduce the sensitivity to individual data and avoid overfitting.
- Pruning—A Tree Complexity Parameter (gamma) is selected to compare the gains. The branch where the gain is smaller than the gamma value is removed. This prevents overfitting by trimming unnecessary branches and reducing the depth of the trees.
- Weighted quantile sketch—Instead of testing every possible value as the threshold for splitting the data, only weighted quantiles are used. The selection of quantiles is done using a sketch algorithm, which estimates a distribution on multiple systems over a network.
- Parallel Learning—This method divides the data into blocks that can be used in parallel to create the trees or for other computations.
- Sparsity-aware split finding—XGBoost handles sparsity in data by trying both directions in a split and finding a default direction by calculating the gain.
- Cache-aware Access—This method uses the cache memory of the system to calculate the similarity scores and output values. The cache memory is a faster access memory compared to the main memory and improves the overall performance of the model.
- Blocks for Out-of-core Computation—This method works with large datasets that cannot fit in the cache or the main memory and that must be kept in hard drives. The dataset is divided into blocks and compressed. Uncompressing the data in the main memory is faster than reading from the hard drive. Another technique called sharding is used when the data must be kept on multiple hard drives.
Additional Resources
Chen Tianqi and Guestrin Carlos. 2016. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794.