Q&A 1 How do you load and inspect a dataset for modeling?
1.1 Recommended Dataset: Titanic Survival (Classification)
- Name:
titanic.csv
- Source: Kaggle Titanic Competition
- Task: Predict passenger survival (0/1)
- Features:
Pclass,Sex,Age,Fare,Embarked, etc.
Why it’s great:
- âś… Common in job interviews and tutorials
- âś… Includes missing values and categorical variables for preprocessing
- âś… Simple enough for beginners, rich enough for deeper ML tasks
1.2 Explanation
Before building a model, it’s important to load the dataset, inspect its structure, and get a feel for the variables. This helps identify potential issues like missing values, incorrect data types, or outliers.
The Titanic dataset is a classic classification problem where we predict whether a passenger survived based on features like class, age, and sex.
1.3 Python Code
# Load and inspect Titanic dataset in Python
import pandas as pd
# Load the dataset
df = pd.read_csv("data/titanic.csv")
# View the shape and column names
print("Shape:", df.shape)
print("Columns:", df.columns.tolist())
# Preview the first few rows
print(df.head())
# Summary statistics
print(df.describe(include='all'))Shape: (891, 12)
Columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
PassengerId Survived Pclass Name Sex \
count 891.000000 891.000000 891.000000 891 891
unique NaN NaN NaN 891 2
top NaN NaN NaN Braund, Mr. Owen Harris male
freq NaN NaN NaN 1 577
mean 446.000000 0.383838 2.308642 NaN NaN
std 257.353842 0.486592 0.836071 NaN NaN
min 1.000000 0.000000 1.000000 NaN NaN
25% 223.500000 0.000000 2.000000 NaN NaN
50% 446.000000 0.000000 3.000000 NaN NaN
75% 668.500000 1.000000 3.000000 NaN NaN
max 891.000000 1.000000 3.000000 NaN NaN
Age SibSp Parch Ticket Fare Cabin \
count 714.000000 891.000000 891.000000 891 891.000000 204
unique NaN NaN NaN 681 NaN 147
top NaN NaN NaN 347082 NaN B96 B98
freq NaN NaN NaN 7 NaN 4
mean 29.699118 0.523008 0.381594 NaN 32.204208 NaN
std 14.526497 1.102743 0.806057 NaN 49.693429 NaN
min 0.420000 0.000000 0.000000 NaN 0.000000 NaN
25% 20.125000 0.000000 0.000000 NaN 7.910400 NaN
50% 28.000000 0.000000 0.000000 NaN 14.454200 NaN
75% 38.000000 1.000000 0.000000 NaN 31.000000 NaN
max 80.000000 8.000000 6.000000 NaN 512.329200 NaN
Embarked
count 889
unique 3
top S
freq 644
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
1.4 R Code
# Load and inspect Titanic dataset in R
library(readr)
library(dplyr)
# Load the dataset
df <- read_csv("data/titanic.csv")
# View dimensions and column names
dim(df)[1] 891 12
[1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
[6] "Age" "SibSp" "Parch" "Ticket" "Fare"
[11] "Cabin" "Embarked"
# A tibble: 6 Ă— 12
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
<dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 1 0 3 Braund… male 22 1 0 A/5 2… 7.25 <NA>
2 2 1 1 Cuming… fema… 38 1 0 PC 17… 71.3 C85
3 3 1 3 Heikki… fema… 26 0 0 STON/… 7.92 <NA>
4 4 1 1 Futrel… fema… 35 1 0 113803 53.1 C123
5 5 0 3 Allen,… male 35 0 0 373450 8.05 <NA>
6 6 0 3 Moran,… male NA 0 0 330877 8.46 <NA>
# ℹ 1 more variable: Embarked <chr>
PassengerId Survived Pclass Name
Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
Median :446.0 Median :0.0000 Median :3.000 Mode :character
Mean :446.0 Mean :0.3838 Mean :2.309
3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
Max. :891.0 Max. :1.0000 Max. :3.000
Sex Age SibSp Parch
Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
Mode :character Median :28.00 Median :0.000 Median :0.0000
Mean :29.70 Mean :0.523 Mean :0.3816
3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
Max. :80.00 Max. :8.000 Max. :6.0000
NA's :177
Ticket Fare Cabin Embarked
Length:891 Min. : 0.00 Length:891 Length:891
Class :character 1st Qu.: 7.91 Class :character Class :character
Mode :character Median : 14.45 Mode :character Mode :character
Mean : 32.20
3rd Qu.: 31.00
Max. :512.33
✅ Takeaway: Always start your ML workflow by loading and inspecting the dataset. Understanding the structure, variable types, and summary statistics helps guide all downstream decisions — from preprocessing to model selection