Q&A 15 How do you train a gradient boosting model using XGBoost?

15.1 Explanation

Gradient Boosting builds models sequentially, where each new model corrects the errors of the previous ones. XGBoost is an optimized and scalable implementation of gradient boosting that supports regularization, handling of missing values, and parallel computation.

It’s a go-to choice when you need both accuracy and performance.


15.2 Python Code

# Train an XGBoost model in Python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

# Load and preprocess
df = pd.read_csv("data/titanic.csv")
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

# Features and target
X = df[['Pclass', 'Sex', 'Age', 'Fare', 'Embarked_Q', 'Embarked_S']]
y = df['Survived']

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train XGBoost model
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb_model.fit(X_train, y_train)

# Predict and evaluate
y_pred = xgb_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Accuracy: 0.8100558659217877

15.3 R Code

# Train an XGBoost model in R
library(readr)
library(dplyr)
library(fastDummies)
library(caret)
library(xgboost)

# Load and preprocess
df <- read_csv("data/titanic.csv")
df$Age[is.na(df$Age)] <- median(df$Age, na.rm = TRUE)
mode_embarked <- names(sort(table(df$Embarked), decreasing = TRUE))[1]
df$Embarked[is.na(df$Embarked)] <- mode_embarked
df$Sex <- ifelse(df$Sex == "male", 0, 1)
df <- fastDummies::dummy_cols(df, select_columns = "Embarked", remove_first_dummy = TRUE, remove_selected_columns = TRUE)

# Prepare data
features <- df %>% select(Pclass, Sex, Age, Fare, Embarked_Q, Embarked_S)
target <- df$Survived
set.seed(42)
split_index <- createDataPartition(target, p = 0.8, list = FALSE)
X_train <- as.matrix(features[split_index, ])
X_test <- as.matrix(features[-split_index, ])
y_train <- target[split_index]
y_test <- target[-split_index]

# Train model
xgb_model <- xgboost(data = X_train, label = y_train, objective = "binary:logistic", nrounds = 100, verbose = 0)

# Predict and evaluate
y_pred <- predict(xgb_model, X_test)
y_pred_class <- ifelse(y_pred > 0.5, 1, 0)
confusionMatrix(as.factor(y_pred_class), as.factor(y_test))
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 100  15
         1  14  49
                                          
               Accuracy : 0.8371          
                 95% CI : (0.7745, 0.8881)
    No Information Rate : 0.6404          
    P-Value [Acc > NIR] : 5.365e-09       
                                          
                  Kappa : 0.645           
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.8772          
            Specificity : 0.7656          
         Pos Pred Value : 0.8696          
         Neg Pred Value : 0.7778          
             Prevalence : 0.6404          
         Detection Rate : 0.5618          
   Detection Prevalence : 0.6461          
      Balanced Accuracy : 0.8214          
                                          
       'Positive' Class : 0               
                                          

. ✅ Takeaway: XGBoost delivers state-of-the-art performance for classification and regression tasks. It’s highly tunable and robust to noisy data, making it a top performer in ML competitions and real-world deployments.