Unsupervised Learning

Overview

Unsupervised learning discovers patterns in data without labeled examples. Unlike Supervised Learning, there’s no “correct answer” to learn from. The algorithm must find structure on its own. The core challenge is defining what “good structure” means without ground truth.

Key Idea

Data is not uniformly distributed. Real-world data clusters, has underlying dimensions, and follows patterns. Unsupervised learning exploits this non-uniformity.

Types of Unsupervised Learning

1. Clustering

Group data points such that points within a cluster are more similar to each other than to those in other clusters.

2. Dimensionality Reduction

Find a lower-dimensional representation that preserves important structure (variance, distances, neighborhoods).

3. Anomaly Detection

Identify data points that don’t fit the learned pattern of “normal.”

4. Association Rule Learning

Discover relationships between variables.

Mathematical Foundation

The Unsupervised Setup

We have:

Input space $X$ : the domain of possible inputs
Unlabeled dataset $D = {x_{1}, x_{2}, ..., x_{n}}$ where $x_{i} \in X$

No labels $y_{i}$ . The goal varies by task:

Clustering: Learn a mapping $c : X \to {1, 2, ..., K}$
Dimensionality Reduction: Learn a mapping $f : X \to Z$ where $dim (Z) < dim (X)$
Density Estimation: Learn the probability distribution $P (X)$

Clustering Algorithms

K-Means Clustering

The most widely used clustering algorithm. Partitions data into $K$ clusters by minimizing within-cluster variance.

Objective Function (Inertia): $J = \sum_{k = 1}^{K} \sum_{x_{i} \in C_{k}} ∥ x_{i} - μ_{k} ∥^{2}$

where $C_{k}$ is the set of points in cluster $k$ and $μ_{k}$ is the centroid of cluster $k$ .

Algorithm:

Initialize $K$ centroids randomly
Assignment Step: Assign each point to nearest centroid $c_{i} = ar g min_{k} ∥ x_{i} - μ_{k} ∥^{2}$
Update Step: Recompute centroids $μ_{k} = \frac{1}{∣ C _{k} ∣} \sum_{x_{i} \in C_{k}} x_{i}$
Repeat until convergence

Why it works: Each step monotonically decreases $J$ . Assignment minimizes $J$ for fixed centroids; update minimizes $J$ for fixed assignments.

Limitations:

Must specify $K$ in advance
Sensitive to initialization (use K-Means++ for better initialization)
Assumes spherical, equally-sized clusters
Converges to local minimum

Hierarchical Clustering

Builds a tree (dendrogram) of clusters. Two approaches:

Agglomerative (Bottom-Up):

Start with each point as its own cluster
Repeatedly merge the two closest clusters
Stop when desired number of clusters reached

Divisive (Top-Down):

Start with all points in one cluster
Recursively split clusters
Stop when desired granularity reached

Linkage Methods (how to measure cluster distance):

Single Linkage: $d (A, B) = min_{a \in A, b \in B} d (a, b)$ ()
Complete Linkage: $d (A, B) = max_{a \in A, b \in B} d (a, b)$
Average Linkage: $d (A, B) = \frac{1}{∣ A ∣∣ B ∣} \sum_{a \in A} \sum_{b \in B} d (a, b)$
Ward’s Method: Minimize increase in total within-cluster variance

DBSCAN (Density-Based Spatial Clustering)

Clusters are dense regions separated by sparse regions. Handles arbitrary cluster shapes.

Key Parameters:

$ϵ$ (eps): Neighborhood radius
MinPts: Minimum points to form a dense region

Point Classification:

Core Point: Has $\geq$ MinPts within $ϵ$ radius
Border Point: Within $ϵ$ of a core point but not itself core
Noise Point: Neither core nor border

Algorithm:

Find all core points
Connect core points that are within $ϵ$ of each other
Assign border points to nearby clusters
Label remaining points as noise

Advantages: No need to specify $K$ , finds arbitrary shapes, robust to outliers.

Limitations: Struggles with varying density clusters, sensitive to $ϵ$ and MinPts.

Dimensionality Reduction

PCA (Principal Component Analysis)

Find orthogonal directions (principal components) that maximize variance in the data.

Mathematical Formulation: Given centered data matrix $X \in R^{n \times d}$ , find projection directions.

The first principal component $w_{1}$ maximizes: $w_{1} = ar g max_{∥ w ∥ = 1} Var (Xw) = ar g max_{∥ w ∥ = 1} w^{T} Sw$

where $S = \frac{1}{n} X^{T} X$ is the covariance matrix.

Solution: The principal components are the eigenvectors of $S$ , ordered by eigenvalue magnitude.

$S = V Λ V^{T}$

where $V$ contains eigenvectors as columns and $Λ$ is diagonal with eigenvalues $λ_{1} \geq λ_{2} \geq ... \geq λ_{d}$ .

Variance Explained: The proportion of variance captured by the first $k$ components: $\frac{\sum _{i = 1}^{k} λ _{i}}{\sum _{i = 1}^{d} λ _{i}}$

Projection: To reduce to $k$ dimensions: $Z = X V_{k}$

where $V_{k}$ contains the top $k$ eigenvectors.

t-SNE (t-Distributed Stochastic Neighbor Embedding)

Non-linear dimensionality reduction optimized for visualization. Preserves local neighborhood structure.

Intuition: Points that are close in high-dimensional space should be close in low-dimensional space.

High-dimensional similarities (Gaussian kernel): $p_{j ∣ i} = \frac{e x p ( - ∥ x _{i} - x _{j} ∥ ^{2} /2 σ _{i}^{2} )}{\sum _{k \neq = i} e x p ( - ∥ x _{i} - x _{k} ∥ ^{2} /2 σ _{i}^{2} )}$

Symmetrize: $p_{ij} = \frac{p _{j ∣ i} + p _{i ∣ j}}{2 n}$

Low-dimensional similarities (t-distribution with 1 df): $q_{ij} = \frac{( 1 + ∥ y _{i} - y _{j} ∥ ^{2} ) ^{- 1}}{\sum _{k \neq = l} ( 1 + ∥ y _{k} - y _{l} ∥ ^{2} ) ^{- 1}}$

The t-distribution has heavier tails than Gaussian, allowing moderate distances in high-d to become larger in low-d (alleviates crowding problem).

Objective: Minimize KL divergence between $P$ and $Q$ : $KL (P ∥ Q) = \sum_{i \neq = j} p_{ij} lo g \frac{p _{ij}}{q _{ij}}$

Key Parameter: Perplexity (roughly, effective number of neighbors to consider). Typical range: 5-50.

Limitations:

Non-parametric (can’t project new points directly)
Computationally expensive: $O (n^{2})$
Results depend on random initialization
Cluster sizes in visualization don’t reflect true cluster sizes

UMAP (Uniform Manifold Approximation and Projection)

Modern alternative to t-SNE. Based on Riemannian geometry and algebraic topology. Generally faster and better preserves global structure.

Key Differences from t-SNE:

Constructs a weighted graph based on local distances
Optimizes cross-entropy rather than KL divergence
Can embed new points after training

Anomaly Detection

Statistical Methods

Z-Score: Flag points where $∣ z ∣ = \frac{∣ x - μ ∣}{σ} > threshold$

Mahalanobis Distance: Accounts for correlations $D_{M} (x) = (x - μ)^{T} Σ^{- 1} (x - μ)$

Isolation Forest

Based on the principle that anomalies are “few and different”.

Algorithm:

Build trees by randomly selecting features and split values
Anomalies require fewer splits to isolate
Anomaly score based on average path length across trees

Path Length Interpretation:

Short path length → likely anomaly
Long path length → likely normal

One-Class SVM

Learn a boundary around “normal” data. Uses Support Vector Machines principles with only one class.

Association Rule Learning

Discover rules like ${b re a d, b u tt er} \Rightarrow {mi l k}$ .

Key Metrics:

Support: $supp (X) = \frac{∣ { t : X \subseteq t } ∣}{∣ D ∣}$ ( how frequently itemset appears )
Confidence: $conf (X \Rightarrow Y) = \frac{supp ( X \cup Y )}{supp ( X )}$ ( reliability of rule )
Lift: $lift (X \Rightarrow Y) = \frac{conf ( X \Rightarrow Y )}{supp ( Y )}$ ( if $> 1$ , positive correlation )

Apriori Algorithm: Prune itemsets with support below threshold, build rules from frequent itemsets.

Practical Application

When to Use Each Algorithm

Scenario	Recommended Approach
Unknown number of clusters	Hierarchical, DBSCAN
Large dataset, known $K$	K-Means, Mini-Batch K-Means
Arbitrary cluster shapes	DBSCAN, Spectral Clustering
Visualization (2D/3D)	t-SNE, UMAP
Feature extraction / compression	PCA
Noise/outlier detection	DBSCAN, Isolation Forest
Market basket analysis	Apriori, FP-Growth

Choosing Number of Clusters (K)

Elbow Method: Plot inertia vs $K$ , look for “elbow”
Silhouette Score: Measures how similar points are to own cluster vs others. Range $[- 1, 1]$ , higher is better
Gap Statistic: Compare within-cluster dispersion to null reference
Domain Knowledge: Often the most reliable

Common Pitfalls

Not scaling features: K-Means and PCA are sensitive to feature scales. Standardize first.
Ignoring cluster validation: Always validate with multiple metrics and visual inspection.
Over-interpreting t-SNE: Distances between clusters are not meaningful; cluster sizes are misleading.
Using PCA for non-linear data: Consider kernel PCA or t-SNE/UMAP instead.
Wrong distance metric: Euclidean isn’t always appropriate (e.g., text data → cosine similarity).

Comparison Table

Algorithm	Cluster Shape	Scalability	Needs K?	Handles Noise
K-Means	Spherical	Very Good	Yes	No
Hierarchical	Any	Poor ( $O (n^{2})$ or worse)	No	No
DBSCAN	Arbitrary	Good	No	Yes
Gaussian Mixture	Elliptical	Good	Yes	No
Spectral	Arbitrary	Poor	Yes	No

Dim. Reduction	Linear?	Preserves	Speed	New Points
PCA	Yes	Global variance	Fast	Yes
t-SNE	No	Local structure	Slow	No
UMAP	No	Local + Global	Medium	Yes
Autoencoders	No	Learned features	Varies	Yes

Resources

Papers

Others

Back to: 01 - Core Fundamentals Index

Aayush's ML & AI Notes

Explorer

Unsupervised Learning

Overview

Key Idea

Types of Unsupervised Learning

1. Clustering

2. Dimensionality Reduction

3. Anomaly Detection

4. Association Rule Learning

Mathematical Foundation

The Unsupervised Setup

Clustering Algorithms

K-Means Clustering

Hierarchical Clustering

DBSCAN (Density-Based Spatial Clustering)

Dimensionality Reduction

PCA (Principal Component Analysis)

t-SNE (t-Distributed Stochastic Neighbor Embedding)

UMAP (Uniform Manifold Approximation and Projection)

Anomaly Detection

Statistical Methods

Isolation Forest

One-Class SVM

Association Rule Learning

Practical Application

When to Use Each Algorithm

Choosing Number of Clusters (K)

Common Pitfalls

Comparison Table

Resources

Papers

Others

Graph View

Table of Contents

Backlinks

Aayush's ML & AI Notes

Explorer

Unsupervised Learning

Overview

Key Idea

Types of Unsupervised Learning

1. Clustering

2. Dimensionality Reduction

3. Anomaly Detection

4. Association Rule Learning

Mathematical Foundation

The Unsupervised Setup

Clustering Algorithms

K-Means Clustering

Hierarchical Clustering

DBSCAN (Density-Based Spatial Clustering)

Dimensionality Reduction

PCA (Principal Component Analysis)

t-SNE (t-Distributed Stochastic Neighbor Embedding)

UMAP (Uniform Manifold Approximation and Projection)

Anomaly Detection

Statistical Methods

Isolation Forest

One-Class SVM

Association Rule Learning

Practical Application

When to Use Each Algorithm

Choosing Number of Clusters (K)

Common Pitfalls

Comparison Table

Related Concepts

Resources

Papers

Others

Graph View

Table of Contents

Backlinks