Skip to the content.

Xtreme Gradient Boosting Algorithm (XGBoost)

XGBoost algorithm belongs to the family of Boosting algorithms which provide better and faster results than the traditional classification/regression algorithms. They are widely being used as the go-to algorithms for a lot of Machine Learning Tasks.


The idea of boosting came out of the idea of whether a weak learner can be modified to become better. A weak hypothesis or weak learner is desig ned as one whose performance is at least slightly better than random chance.

The first realization of boosting that saw great success in application was Adaptive Boosting or AdaBoost for short. The weak learners in AdaBoost are decision trees with a single split, called decision stumps for their shortness. AdaBoost works by weighting the observations, putting more weight on difficult to classify instances and less on those already handled well. New weak learners are added sequentially that focus their training on the more difficult patterns. Predictions are made by majority vote of the weak learners’ predictions, weighted by their individual accuracy. The most successful form of the AdaBoost algorithm was for binary classification problems and was called AdaBoost.M1.

Gradient Boosting - The statistical framework cast boosting as a numerical optimization problem where the objective is to minimize the loss of the model by adding weak learners using a gradient descent like procedure. This class of algorithms were described as a stage-wise additive model. This is because one new weak learner is added at a time and existing weak learners in the model are frozen and left unchanged.The generalization allowed arbitrary diffierentiable loss functions to be used, expanding the technique beyond binary classification problems to support regression, multiclass classification and more.

How Gradient Boosting Works

Gradient boosting involves three elements:

  1. A loss function to be optimized.
  2. A weak learner to make predictions - Decision Trees are used as weak learners
  3. An additive model to add weak learners to minimize the loss function.

What algorithm does Boosting uses?

This algorithm goes by lots of different names such as gradient boosting, multiple additive regression trees, stochastic gradient boosting or gradient boosting machines.

Boosting is an ensemble technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made. A popular example is the AdaBoost algorithm that weights data points that are hard to predict.

Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.

This approach supports both regression and classification predictive modeling problems.

For more on boosting and gradient boosting, see Trevor Hastie’s talk on Gradient Boosting Machine Learning.

XGBoost (eXtreme Gradient Boosting)

The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms.

It is an implementation of gradient boosting machines created by Tianqi Chen

XGBoost Features: The library is laser focused on computational speed and model performance, as such there are few frills. Nevertheless, it does offer a number of advanced features.

Model Features : The implementation of the model supports the features of the scikit-learn and R implementations, with new additions like regularization. Three main forms of gradient boosting are supported:

  • Gradient Boosting algorithm also called gradient boosting machine including the learning rate.
  • Stochastic Gradient Boosting with sub-sampling at the row, column and column per split levels.
  • Regularized Gradient Boosting with both L1 and L2 regularization.

System Features : The library provides a system for use in a range of computing environments, not least:

  • Parallelization of tree construction using all of your CPU cores during training.
  • Distributed Computing for training very large models using a cluster of machines.
  • Out-of-Core Computing for very large datasets that don’t fit into memory.
  • Cache Optimization of data structures and algorithm to make best use of hardware.

Algorithm Features : The implementation of the algorithm was engineered for efficiency of compute time and memory resources. A design goal was to make the best use of available resources to train the model. Some key algorithm implementation features include:

  • Sparse Aware implementation with automatic handling of missing data values.
  • Block Structure to support the parallelization of tree construction.
  • Continued Training so that you can further boost an already fitted model on new data.

Key things to remember

  • XGBoost works only with Numerical data
  • Handles missing value automatically
  • Works for both classification and regression
  • Fast and Accurate algorithm

Parameter TUning

Gradient Boosting Models

XGBoost Models

Sample XGBoost Code

Breast Cancer Dataset, with Label Encoding and One Hot Encoding of categorical Features

# binary classification, breast cancer dataset, label and one hot encoded

from numpy import column_stack
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# load data
data = read_csv('datasets-uci-breast-cancer.csv', header=None)
dataset = data.values

# split data into X and y
X = dataset[:,0:9]
X = X.astype(str)
Y = dataset[:,9]

# encode string input values as integers
columns = []
for i in range(0, X.shape[1]):
  label_encoder = LabelEncoder()
  feature = label_encoder.fit_transform(X[:,i])
  feature = feature.reshape(X.shape[0], 1)
  onehot_encoder = OneHotEncoder(sparse=False)
  feature = onehot_encoder.fit_transform(feature)
# collapse columns into array
encoded_x = column_stack(columns)
print("X shape: : ", encoded_x.shape)

# encode string class values as integers
label_encoder = LabelEncoder()
label_encoder =
label_encoded_y = label_encoder.transform(Y)

# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(encoded_x, label_encoded_y,
test_size=test_size, random_state=seed)

# fit model no training data
model = XGBClassifier(), y_train)

# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Written on January 23, 2018