Overview

Unsupervised learning discovers patterns in data without labeled examples. Unlike Supervised Learning, there’s no “correct answer” to learn from. The algorithm must find structure on its own. The core challenge is defining what “good structure” means without ground truth.

Key Idea

Data is not uniformly distributed. Real-world data clusters, has underlying dimensions, and follows patterns. Unsupervised learning exploits this non-uniformity.

Types of Unsupervised Learning

1. Clustering

Group data points such that points within a cluster are more similar to each other than to those in other clusters.

2. Dimensionality Reduction

Find a lower-dimensional representation that preserves important structure (variance, distances, neighborhoods).

3. Anomaly Detection

Identify data points that don’t fit the learned pattern of “normal.”

4. Association Rule Learning

Discover relationships between variables.


Mathematical Foundation

The Unsupervised Setup

We have:

  • Input space : the domain of possible inputs
  • Unlabeled dataset where

No labels . The goal varies by task:

  • Clustering: Learn a mapping
  • Dimensionality Reduction: Learn a mapping where
  • Density Estimation: Learn the probability distribution

Clustering Algorithms

K-Means Clustering

The most widely used clustering algorithm. Partitions data into clusters by minimizing within-cluster variance.

Objective Function (Inertia):

where is the set of points in cluster and is the centroid of cluster .

Algorithm:

  1. Initialize centroids randomly
  2. Assignment Step: Assign each point to nearest centroid
  3. Update Step: Recompute centroids
  4. Repeat until convergence

Why it works: Each step monotonically decreases . Assignment minimizes for fixed centroids; update minimizes for fixed assignments.

Limitations:

  • Must specify in advance
  • Sensitive to initialization (use K-Means++ for better initialization)
  • Assumes spherical, equally-sized clusters
  • Converges to local minimum

Hierarchical Clustering

Builds a tree (dendrogram) of clusters. Two approaches:

Agglomerative (Bottom-Up):

  1. Start with each point as its own cluster
  2. Repeatedly merge the two closest clusters
  3. Stop when desired number of clusters reached

Divisive (Top-Down):

  1. Start with all points in one cluster
  2. Recursively split clusters
  3. Stop when desired granularity reached

Linkage Methods (how to measure cluster distance):

  • Single Linkage: ()
  • Complete Linkage:
  • Average Linkage:
  • Ward’s Method: Minimize increase in total within-cluster variance

DBSCAN (Density-Based Spatial Clustering)

Clusters are dense regions separated by sparse regions. Handles arbitrary cluster shapes.

Key Parameters:

  • (eps): Neighborhood radius
  • MinPts: Minimum points to form a dense region

Point Classification:

  • Core Point: Has MinPts within radius
  • Border Point: Within of a core point but not itself core
  • Noise Point: Neither core nor border

Algorithm:

  1. Find all core points
  2. Connect core points that are within of each other
  3. Assign border points to nearby clusters
  4. Label remaining points as noise

Advantages: No need to specify , finds arbitrary shapes, robust to outliers.

Limitations: Struggles with varying density clusters, sensitive to and MinPts.


Dimensionality Reduction

PCA (Principal Component Analysis)

Find orthogonal directions (principal components) that maximize variance in the data.

Mathematical Formulation: Given centered data matrix , find projection directions.

The first principal component maximizes:

where is the covariance matrix.

Solution: The principal components are the eigenvectors of , ordered by eigenvalue magnitude.

where contains eigenvectors as columns and is diagonal with eigenvalues .

Variance Explained: The proportion of variance captured by the first components:

Projection: To reduce to dimensions:

where contains the top eigenvectors.

t-SNE (t-Distributed Stochastic Neighbor Embedding)

Non-linear dimensionality reduction optimized for visualization. Preserves local neighborhood structure.

Intuition: Points that are close in high-dimensional space should be close in low-dimensional space.

High-dimensional similarities (Gaussian kernel):

Symmetrize:

Low-dimensional similarities (t-distribution with 1 df):

The t-distribution has heavier tails than Gaussian, allowing moderate distances in high-d to become larger in low-d (alleviates crowding problem).

Objective: Minimize KL divergence between and :

Key Parameter: Perplexity (roughly, effective number of neighbors to consider). Typical range: 5-50.

Limitations:

  • Non-parametric (can’t project new points directly)
  • Computationally expensive:
  • Results depend on random initialization
  • Cluster sizes in visualization don’t reflect true cluster sizes

UMAP (Uniform Manifold Approximation and Projection)

Modern alternative to t-SNE. Based on Riemannian geometry and algebraic topology. Generally faster and better preserves global structure.

Key Differences from t-SNE:

  • Constructs a weighted graph based on local distances
  • Optimizes cross-entropy rather than KL divergence
  • Can embed new points after training

Anomaly Detection

Statistical Methods

Z-Score: Flag points where

Mahalanobis Distance: Accounts for correlations

Isolation Forest

Based on the principle that anomalies are “few and different”.

Algorithm:

  1. Build trees by randomly selecting features and split values
  2. Anomalies require fewer splits to isolate
  3. Anomaly score based on average path length across trees

Path Length Interpretation:

  • Short path length → likely anomaly
  • Long path length → likely normal

One-Class SVM

Learn a boundary around “normal” data. Uses Support Vector Machines principles with only one class.


Association Rule Learning

Discover rules like .

Key Metrics:

  • Support: ( how frequently itemset appears )
  • Confidence: ( reliability of rule )
  • Lift: ( if , positive correlation )

Apriori Algorithm: Prune itemsets with support below threshold, build rules from frequent itemsets.


Practical Application

When to Use Each Algorithm

ScenarioRecommended Approach
Unknown number of clustersHierarchical, DBSCAN
Large dataset, known K-Means, Mini-Batch K-Means
Arbitrary cluster shapesDBSCAN, Spectral Clustering
Visualization (2D/3D)t-SNE, UMAP
Feature extraction / compressionPCA
Noise/outlier detectionDBSCAN, Isolation Forest
Market basket analysisApriori, FP-Growth

Choosing Number of Clusters (K)

  • Elbow Method: Plot inertia vs , look for “elbow”
  • Silhouette Score: Measures how similar points are to own cluster vs others. Range , higher is better
  • Gap Statistic: Compare within-cluster dispersion to null reference
  • Domain Knowledge: Often the most reliable

Common Pitfalls

  • Not scaling features: K-Means and PCA are sensitive to feature scales. Standardize first.
  • Ignoring cluster validation: Always validate with multiple metrics and visual inspection.
  • Over-interpreting t-SNE: Distances between clusters are not meaningful; cluster sizes are misleading.
  • Using PCA for non-linear data: Consider kernel PCA or t-SNE/UMAP instead.
  • Wrong distance metric: Euclidean isn’t always appropriate (e.g., text data → cosine similarity).

Comparison Table

AlgorithmCluster ShapeScalabilityNeeds K?Handles Noise
K-MeansSphericalVery GoodYesNo
HierarchicalAnyPoor ( or worse)NoNo
DBSCANArbitraryGoodNoYes
Gaussian MixtureEllipticalGoodYesNo
SpectralArbitraryPoorYesNo
Dim. ReductionLinear?PreservesSpeedNew Points
PCAYesGlobal varianceFastYes
t-SNENoLocal structureSlowNo
UMAPNoLocal + GlobalMediumYes
AutoencodersNoLearned featuresVariesYes


Resources

Papers

Others


Back to: 01 - Core Fundamentals Index