Q&A 27 How do you tune hyperparameters to improve model performance?

27.1 Explanation

Hyperparameters are the knobs you set before training a model—like tree depth, number of neighbors, or regularization strength.

Tuning these hyperparameters helps: - Maximize performance (e.g., accuracy or AUC) - Prevent overfitting - Balance bias vs. variance

We use grid search with cross-validation to systematically test combinations and select the best ones.


27.2 Python Code

import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

# Load and prepare data
df = pd.read_csv("data/titanic.csv").dropna(subset=["Age", "Fare", "Embarked", "Sex", "Survived"])
X = df[["Pclass", "Age", "Fare"]].copy()
X["Sex"] = LabelEncoder().fit_transform(df["Sex"])
X["Embarked"] = LabelEncoder().fit_transform(df["Embarked"])
y = df["Survived"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define model and grid
model = RandomForestClassifier(random_state=42)
param_grid = {
    "n_estimators": [50, 100],
    "max_depth": [2, 4, 6],
    "min_samples_split": [2, 5]
}

# Grid search with cross-validation
grid_search = GridSearchCV(model, param_grid, cv=5, scoring="accuracy")
grid_search.fit(X_train, y_train)

# Best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print("Best Parameters:", grid_search.best_params_)
print("Test Accuracy:", accuracy_score(y_test, y_pred))
Best Parameters: {'max_depth': 6, 'min_samples_split': 5, 'n_estimators': 50}
Test Accuracy: 0.7616822429906542

27.3 R Coce

library(tidyverse)
library(caret)
library(readr)

# Load and clean data
df <- read_csv("data/titanic.csv") %>%
  drop_na(Age, Fare, Embarked, Sex, Survived) %>%
  mutate(
    Sex = as.factor(Sex),
    Embarked = as.factor(Embarked),
    Survived = as.factor(Survived)
  )

# Split data
set.seed(42)
train_index <- createDataPartition(df$Survived, p = 0.7, list = FALSE)
train <- df[train_index, ]
test  <- df[-train_index, ]

# Define tuning grid
grid <- expand.grid(
  mtry = c(1, 2, 3),
  splitrule = "gini",
  min.node.size = c(1, 5, 10)
)

# Train with cross-validation
ctrl <- trainControl(method = "cv", number = 5)
rf_model <- train(
  Survived ~ Pclass + Age + Fare + Sex + Embarked,
  data = train,
  method = "ranger",
  trControl = ctrl,
  tuneGrid = grid,
  metric = "Accuracy"
)

# Results
print(rf_model$bestTune)
  mtry splitrule min.node.size
6    2      gini            10
pred <- predict(rf_model, newdata = test)
confusionMatrix(pred, test$Survived)
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 118  33
         1   9  53
                                        
               Accuracy : 0.8028        
                 95% CI : (0.743, 0.854)
    No Information Rate : 0.5962        
    P-Value [Acc > NIR] : 1.016e-10     
                                        
                  Kappa : 0.5711        
                                        
 Mcnemar's Test P-Value : 0.0003867     
                                        
            Sensitivity : 0.9291        
            Specificity : 0.6163        
         Pos Pred Value : 0.7815        
         Neg Pred Value : 0.8548        
             Prevalence : 0.5962        
         Detection Rate : 0.5540        
   Detection Prevalence : 0.7089        
      Balanced Accuracy : 0.7727        
                                        
       'Positive' Class : 0             
                                        

✅ Takeaway: Hyperparameter tuning helps you unlock your model’s full potential. Use GridSearchCV in Python or caret::train() with tuneGrid in R to systematically test different settings.