Q&A 3 How do you encode categorical variables for machine learning?

3.1 Explanation

Most machine learning models require numerical input. Categorical variables like Sex, Embarked, or Pclass must be transformed into a numeric format before modeling.

There are two common encoding strategies:

  • Label Encoding: Assigns an integer to each category (e.g., male = 0, female = 1)
  • One-Hot Encoding: Creates a new binary column for each category (e.g., Embarked_C, Embarked_Q, Embarked_S)

We’ll use both techniques to prepare the Titanic dataset.


3.2 Python Code

# Encoding categorical variables in Python
import pandas as pd

# Load dataset
df = pd.read_csv("data/titanic.csv")

# Fill missing values first (as done previously)
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

# Label encode Sex (binary category)
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

# One-hot encode Embarked (multi-class)
df_encoded = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

# Preview the result
print(df_encoded[['Sex', 'Embarked_Q', 'Embarked_S']].head())
   Sex  Embarked_Q  Embarked_S
0    0       False        True
1    1       False       False
2    1       False        True
3    1       False        True
4    0       False        True

3.3 R Code

# Encoding categorical variables in R
library(readr)
library(dplyr)
library(fastDummies)

# Load dataset
df <- read_csv("data/titanic.csv")

# Impute missing values
df$Age[is.na(df$Age)] <- median(df$Age, na.rm = TRUE)
mode_embarked <- names(sort(table(df$Embarked), decreasing = TRUE))[1]
df$Embarked[is.na(df$Embarked)] <- mode_embarked

# Label encode Sex
df$Sex <- ifelse(df$Sex == "male", 0, 1)

# One-hot encode Embarked (drop first to avoid multicollinearity)
df_encoded <- fastDummies::dummy_cols(df, select_columns = "Embarked", remove_first_dummy = TRUE, remove_selected_columns = TRUE)

# Preview the result
head(df_encoded[, c("Sex", "Embarked_Q", "Embarked_S")])
# A tibble: 6 Ă— 3
    Sex Embarked_Q Embarked_S
  <dbl>      <int>      <int>
1     0          0          1
2     1          0          0
3     1          0          1
4     1          0          1
5     0          0          1
6     0          1          0

âś… Takeaway: Encoding categorical variables ensures your ML models can interpret the input data. Use label encoding for binary categories and one-hot encoding for variables with multiple level