Q&A 3 How do you encode categorical variables for machine learning?
3.1 Explanation
Most machine learning models require numerical input. Categorical variables like Sex, Embarked, or Pclass must be transformed into a numeric format before modeling.
There are two common encoding strategies:
- Label Encoding: Assigns an integer to each category (e.g.,
male = 0,female = 1) - One-Hot Encoding: Creates a new binary column for each category (e.g.,
Embarked_C,Embarked_Q,Embarked_S)
We’ll use both techniques to prepare the Titanic dataset.
3.2 Python Code
# Encoding categorical variables in Python
import pandas as pd
# Load dataset
df = pd.read_csv("data/titanic.csv")
# Fill missing values first (as done previously)
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
# Label encode Sex (binary category)
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
# One-hot encode Embarked (multi-class)
df_encoded = pd.get_dummies(df, columns=['Embarked'], drop_first=True)
# Preview the result
print(df_encoded[['Sex', 'Embarked_Q', 'Embarked_S']].head()) Sex Embarked_Q Embarked_S
0 0 False True
1 1 False False
2 1 False True
3 1 False True
4 0 False True
3.3 R Code
# Encoding categorical variables in R
library(readr)
library(dplyr)
library(fastDummies)
# Load dataset
df <- read_csv("data/titanic.csv")
# Impute missing values
df$Age[is.na(df$Age)] <- median(df$Age, na.rm = TRUE)
mode_embarked <- names(sort(table(df$Embarked), decreasing = TRUE))[1]
df$Embarked[is.na(df$Embarked)] <- mode_embarked
# Label encode Sex
df$Sex <- ifelse(df$Sex == "male", 0, 1)
# One-hot encode Embarked (drop first to avoid multicollinearity)
df_encoded <- fastDummies::dummy_cols(df, select_columns = "Embarked", remove_first_dummy = TRUE, remove_selected_columns = TRUE)
# Preview the result
head(df_encoded[, c("Sex", "Embarked_Q", "Embarked_S")])# A tibble: 6 Ă— 3
Sex Embarked_Q Embarked_S
<dbl> <int> <int>
1 0 0 1
2 1 0 0
3 1 0 1
4 1 0 1
5 0 0 1
6 0 1 0
âś… Takeaway: Encoding categorical variables ensures your ML models can interpret the input data. Use label encoding for binary categories and one-hot encoding for variables with multiple level