Q&A 27 How do you tune hyperparameters to improve model performance?
27.1 Explanation
Hyperparameters are the knobs you set before training a model—like tree depth, number of neighbors, or regularization strength.
Tuning these hyperparameters helps: - Maximize performance (e.g., accuracy or AUC) - Prevent overfitting - Balance bias vs. variance
We use grid search with cross-validation to systematically test combinations and select the best ones.
27.2 Python Code
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
# Load and prepare data
df = pd.read_csv("data/titanic.csv").dropna(subset=["Age", "Fare", "Embarked", "Sex", "Survived"])
X = df[["Pclass", "Age", "Fare"]].copy()
X["Sex"] = LabelEncoder().fit_transform(df["Sex"])
X["Embarked"] = LabelEncoder().fit_transform(df["Embarked"])
y = df["Survived"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Define model and grid
model = RandomForestClassifier(random_state=42)
param_grid = {
"n_estimators": [50, 100],
"max_depth": [2, 4, 6],
"min_samples_split": [2, 5]
}
# Grid search with cross-validation
grid_search = GridSearchCV(model, param_grid, cv=5, scoring="accuracy")
grid_search.fit(X_train, y_train)
# Best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print("Best Parameters:", grid_search.best_params_)
print("Test Accuracy:", accuracy_score(y_test, y_pred))Best Parameters: {'max_depth': 6, 'min_samples_split': 5, 'n_estimators': 50}
Test Accuracy: 0.7616822429906542
27.3 R Coce
library(tidyverse)
library(caret)
library(readr)
# Load and clean data
df <- read_csv("data/titanic.csv") %>%
drop_na(Age, Fare, Embarked, Sex, Survived) %>%
mutate(
Sex = as.factor(Sex),
Embarked = as.factor(Embarked),
Survived = as.factor(Survived)
)
# Split data
set.seed(42)
train_index <- createDataPartition(df$Survived, p = 0.7, list = FALSE)
train <- df[train_index, ]
test <- df[-train_index, ]
# Define tuning grid
grid <- expand.grid(
mtry = c(1, 2, 3),
splitrule = "gini",
min.node.size = c(1, 5, 10)
)
# Train with cross-validation
ctrl <- trainControl(method = "cv", number = 5)
rf_model <- train(
Survived ~ Pclass + Age + Fare + Sex + Embarked,
data = train,
method = "ranger",
trControl = ctrl,
tuneGrid = grid,
metric = "Accuracy"
)
# Results
print(rf_model$bestTune) mtry splitrule min.node.size
6 2 gini 10
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 118 33
1 9 53
Accuracy : 0.8028
95% CI : (0.743, 0.854)
No Information Rate : 0.5962
P-Value [Acc > NIR] : 1.016e-10
Kappa : 0.5711
Mcnemar's Test P-Value : 0.0003867
Sensitivity : 0.9291
Specificity : 0.6163
Pos Pred Value : 0.7815
Neg Pred Value : 0.8548
Prevalence : 0.5962
Detection Rate : 0.5540
Detection Prevalence : 0.7089
Balanced Accuracy : 0.7727
'Positive' Class : 0
✅ Takeaway: Hyperparameter tuning helps you unlock your model’s full potential. Use GridSearchCV in Python or caret::train() with tuneGrid in R to systematically test different settings.