Q&A 4 How do you split a dataset into training and testing sets?

4.1 Explanation

To evaluate how well a machine learning model performs on new, unseen data, we divide the dataset into two parts:

Training set: Used to train the model
Test set: Used to assess how well the model generalizes

A common split is 80% for training and 20% for testing. This ensures that we don’t evaluate the model on the same data it learned from.

4.2 Python Code

# Splitting the dataset in Python
import pandas as pd
from sklearn.model_selection import train_test_split

# Load and preprocess the dataset
df = pd.read_csv("data/titanic.csv")
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

# Define features and target
X = df[['Pclass', 'Sex', 'Age', 'Fare', 'Embarked_Q', 'Embarked_S']]
y = df['Survived']

# Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Show dimensions
print("Training set:", X_train.shape)
print("Test set:", X_test.shape)

Training set: (712, 6)
Test set: (179, 6)

4.3 R Code

# Splitting the dataset in R
library(readr)
library(dplyr)
library(fastDummies)
library(caret)

# Load and preprocess the dataset
df <- read_csv("data/titanic.csv")
df$Age[is.na(df$Age)] <- median(df$Age, na.rm = TRUE)
mode_embarked <- names(sort(table(df$Embarked), decreasing = TRUE))[1]
df$Embarked[is.na(df$Embarked)] <- mode_embarked
df$Sex <- ifelse(df$Sex == "male", 0, 1)
df <- fastDummies::dummy_cols(df, select_columns = "Embarked", remove_first_dummy = TRUE, remove_selected_columns = TRUE)

# Define features and target
features <- df %>% select(Pclass, Sex, Age, Fare, Embarked_Q, Embarked_S)
target <- df$Survived

# Split using caret (80/20)
set.seed(42)
split_index <- createDataPartition(target, p = 0.8, list = FALSE)
X_train <- features[split_index, ]
X_test <- features[-split_index, ]
y_train <- target[split_index]
y_test <- target[-split_index]

# Show dimensions
dim(X_train)

[1] 713   6

dim(X_test)

[1] 178   6

✅ Takeaway: Always split your data before modeling. This ensures fair evaluation and helps avoid overfitting, giving you a realistic sense of model performance.