Q&A 21 How do you cluster data using hierarchical clustering or DBSCAN?
21.1 Explanation
While K-means assumes clusters are spherical and requires a preset k, other clustering methods offer different advantages:
- Hierarchical Clustering: Builds nested clusters based on distance — great for dendrograms and discovering cluster structure
- DBSCAN: Groups points based on density — ideal for detecting clusters of arbitrary shape and identifying outliers
We’ll demonstrate both using the gene_expression_with_clusters.csv dataset.
21.2 Python Code
# Hierarchical Clustering and DBSCAN in Python
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
# Load dataset
df = pd.read_csv("data/gene_expression_with_clusters.csv")
# Drop non-numeric column
X = df.drop(columns=['SampleID'])
# Standardize features
X_scaled = StandardScaler().fit_transform(X)
# Hierarchical clustering
linked = linkage(X_scaled, method='ward')
plt.figure(figsize=(10, 6))
dendrogram(linked, no_labels=True)
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Samples")
plt.ylabel("Distance")
plt.show()
# DBSCAN
db = DBSCAN(eps=2, min_samples=5).fit(X_scaled)
df['DBSCAN_Cluster'] = db.labels_
# Visualize DBSCAN clusters (optional: with PCA)
from sklearn.decomposition import PCA
X_pca = PCA(n_components=2).fit_transform(X_scaled)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['DBSCAN_Cluster'], cmap='viridis', s=50)
plt.title("DBSCAN Clustering (PCA Projection)")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()


21.3 R Code
# Hierarchical Clustering and DBSCAN in R
library(readr)
library(dplyr)
library(ggplot2)
library(cluster)
library(dbscan)
library(factoextra)
# Load dataset
df <- read_csv("data/gene_expression_with_clusters.csv")
# Keep numeric columns
X <- df %>% select(-SampleID)
# Scale data
X_scaled <- scale(X)
# Hierarchical clustering
hc <- hclust(dist(X_scaled), method = "ward.D2")
plot(hc, main = "Hierarchical Clustering Dendrogram", xlab = "", sub = "")
# DBSCAN
set.seed(42)
db <- dbscan(X_scaled, eps = 2, minPts = 5)
# PCA for visualization
pca <- prcomp(X_scaled)
pca_df <- as.data.frame(pca$x[, 1:2])
pca_df$cluster <- factor(db$cluster)
ggplot(pca_df, aes(x = PC1, y = PC2, color = cluster)) +
geom_point(size = 2) +
labs(title = "DBSCAN Clustering (PCA Projection)\n") +
theme_minimal()
✅ Takeaway: Hierarchical clustering helps uncover structure without predefining clusters, while DBSCAN excels at finding complex shapes and outliers. When K-means isn’t enough, these methods reveal deeper patterns in your data.