Q&A 15 How do you train a gradient boosting model using XGBoost?
15.1 Explanation
Gradient Boosting builds models sequentially, where each new model corrects the errors of the previous ones. XGBoost is an optimized and scalable implementation of gradient boosting that supports regularization, handling of missing values, and parallel computation.
It’s a go-to choice when you need both accuracy and performance.
15.2 Python Code
# Train an XGBoost model in Python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
# Load and preprocess
df = pd.read_csv("data/titanic.csv")
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)
# Features and target
X = df[['Pclass', 'Sex', 'Age', 'Fare', 'Embarked_Q', 'Embarked_S']]
y = df['Survived']
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train XGBoost model
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb_model.fit(X_train, y_train)
# Predict and evaluate
y_pred = xgb_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))Accuracy: 0.8100558659217877
15.3 R Code
# Train an XGBoost model in R
library(readr)
library(dplyr)
library(fastDummies)
library(caret)
library(xgboost)
# Load and preprocess
df <- read_csv("data/titanic.csv")
df$Age[is.na(df$Age)] <- median(df$Age, na.rm = TRUE)
mode_embarked <- names(sort(table(df$Embarked), decreasing = TRUE))[1]
df$Embarked[is.na(df$Embarked)] <- mode_embarked
df$Sex <- ifelse(df$Sex == "male", 0, 1)
df <- fastDummies::dummy_cols(df, select_columns = "Embarked", remove_first_dummy = TRUE, remove_selected_columns = TRUE)
# Prepare data
features <- df %>% select(Pclass, Sex, Age, Fare, Embarked_Q, Embarked_S)
target <- df$Survived
set.seed(42)
split_index <- createDataPartition(target, p = 0.8, list = FALSE)
X_train <- as.matrix(features[split_index, ])
X_test <- as.matrix(features[-split_index, ])
y_train <- target[split_index]
y_test <- target[-split_index]
# Train model
xgb_model <- xgboost(data = X_train, label = y_train, objective = "binary:logistic", nrounds = 100, verbose = 0)
# Predict and evaluate
y_pred <- predict(xgb_model, X_test)
y_pred_class <- ifelse(y_pred > 0.5, 1, 0)
confusionMatrix(as.factor(y_pred_class), as.factor(y_test))Confusion Matrix and Statistics
Reference
Prediction 0 1
0 100 15
1 14 49
Accuracy : 0.8371
95% CI : (0.7745, 0.8881)
No Information Rate : 0.6404
P-Value [Acc > NIR] : 5.365e-09
Kappa : 0.645
Mcnemar's Test P-Value : 1
Sensitivity : 0.8772
Specificity : 0.7656
Pos Pred Value : 0.8696
Neg Pred Value : 0.7778
Prevalence : 0.6404
Detection Rate : 0.5618
Detection Prevalence : 0.6461
Balanced Accuracy : 0.8214
'Positive' Class : 0
. ✅ Takeaway: XGBoost delivers state-of-the-art performance for classification and regression tasks. It’s highly tunable and robust to noisy data, making it a top performer in ML competitions and real-world deployments.