Q&A 2 How do you handle missing values in a machine learning dataset?

2.1 Explanation

Missing data is a common issue in real-world datasets. Before feeding data into a machine learning model, you need to decide how to deal with these gaps. The two most common approaches are:

  • Removal: Drop rows or columns with missing values (useful when missingness is minimal).
  • Imputation: Fill in missing values using a strategy such as mean, median, or mode.

In more advanced workflows, you may also use predictive models to impute missing values, but simple strategies are often sufficient to begin with.

For this example, we’ll continue using the Titanic dataset, which includes missing values in variables like Age and Embarked.


2.2 Python Code

# Handling missing values in Python (Titanic dataset)
import pandas as pd

# Load the dataset
df = pd.read_csv("data/titanic.csv")

# Check how many missing values per column
print(df.isnull().sum())

# Drop rows with any missing values (not recommended unless necessary)
df_dropped = df.dropna()

# Impute missing Age with median
df['Age'] = df['Age'].fillna(df['Age'].median())

# Impute missing Embarked with mode (most frequent value)
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

# Confirm no more missing values
print(df.isnull().sum())
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

2.3 R Code

# Handling missing values in R (Titanic dataset)
library(readr)
library(dplyr)

# Load dataset
df <- read_csv("data/titanic.csv")

# Check number of missing values per column
sapply(df, function(x) sum(is.na(x)))
PassengerId    Survived      Pclass        Name         Sex         Age 
          0           0           0           0           0         177 
      SibSp       Parch      Ticket        Fare       Cabin    Embarked 
          0           0           0           0         687           2 
# Drop rows with missing values (not ideal for important features)
df_dropped <- na.omit(df)

# Impute missing Age with median
df$Age[is.na(df$Age)] <- median(df$Age, na.rm = TRUE)

# Impute missing Embarked with mode
mode_embarked <- names(sort(table(df$Embarked), decreasing = TRUE))[1]
df$Embarked[is.na(df$Embarked)] <- mode_embarked

# Confirm all missing values are handled
sapply(df, function(x) sum(is.na(x)))
PassengerId    Survived      Pclass        Name         Sex         Age 
          0           0           0           0           0           0 
      SibSp       Parch      Ticket        Fare       Cabin    Embarked 
          0           0           0           0         687           0 

âś… Takeaway: Handling missing data is essential for building reliable models. Simple imputation methods like median or mode work well for many problems, but always be aware of how they may affect your analysis.