Q&A 10 How do you train a logistic regression model?
10.1 Explanation
Logistic regression is one of the most widely used models for binary classification. Instead of predicting continuous values, it models the probability of a class label (e.g., survival: yes or no).
It’s fast, interpretable, and works well as a baseline.
10.2 Python Code
# Train a logistic regression model in Python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load and preprocess
df = pd.read_csv("data/titanic.csv")
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)
# Features and target
X = df[['Pclass', 'Sex', 'Age', 'Fare', 'Embarked_Q', 'Embarked_S']]
y = df['Survived']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Predict and evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))Accuracy: 0.7988826815642458
10.3 R Code
# Train a logistic regression model in R
library(readr)
library(dplyr)
library(fastDummies)
library(caret)
# Load and preprocess
df <- read_csv("data/titanic.csv")
df$Age[is.na(df$Age)] <- median(df$Age, na.rm = TRUE)
mode_embarked <- names(sort(table(df$Embarked), decreasing = TRUE))[1]
df$Embarked[is.na(df$Embarked)] <- mode_embarked
df$Sex <- ifelse(df$Sex == "male", 0, 1)
df <- fastDummies::dummy_cols(df, select_columns = "Embarked", remove_first_dummy = TRUE, remove_selected_columns = TRUE)
# Feature and target
features <- df %>% select(Pclass, Sex, Age, Fare, Embarked_Q, Embarked_S)
target <- df$Survived
# Split data
set.seed(42)
split_index <- createDataPartition(target, p = 0.8, list = FALSE)
X_train <- features[split_index, ]
X_test <- features[-split_index, ]
y_train <- target[split_index]
y_test <- target[-split_index]
# Train model
train_data <- cbind(X_train, Survived = y_train)
log_model <- glm(Survived ~ ., data = train_data, family = "binomial")
# Predict and evaluate
y_pred <- predict(log_model, X_test, type = "response")
y_pred_class <- ifelse(y_pred > 0.5, 1, 0)
confusionMatrix(as.factor(y_pred_class), as.factor(y_test))Confusion Matrix and Statistics
Reference
Prediction 0 1
0 95 23
1 19 41
Accuracy : 0.764
95% CI : (0.6947, 0.8243)
No Information Rate : 0.6404
P-Value [Acc > NIR] : 0.0002718
Kappa : 0.4805
Mcnemar's Test P-Value : 0.6434288
Sensitivity : 0.8333
Specificity : 0.6406
Pos Pred Value : 0.8051
Neg Pred Value : 0.6833
Prevalence : 0.6404
Detection Rate : 0.5337
Detection Prevalence : 0.6629
Balanced Accuracy : 0.7370
'Positive' Class : 0
âś… Takeaway: Logistic regression is a fast and interpretable baseline model. It estimates probabilities and provides meaningful coefficients, making it ideal for many real-world classification tasks.