Overview
Unsupervised learning discovers patterns in data without labeled examples. Unlike Supervised Learning, there’s no “correct answer” to learn from. The algorithm must find structure on its own. The core challenge is defining what “good structure” means without ground truth.
Key Idea
Data is not uniformly distributed. Real-world data clusters, has underlying dimensions, and follows patterns. Unsupervised learning exploits this non-uniformity.
Types of Unsupervised Learning
1. Clustering
Group data points such that points within a cluster are more similar to each other than to those in other clusters.
2. Dimensionality Reduction
Find a lower-dimensional representation that preserves important structure (variance, distances, neighborhoods).
3. Anomaly Detection
Identify data points that don’t fit the learned pattern of “normal.”
4. Association Rule Learning
Discover relationships between variables.
Mathematical Foundation
The Unsupervised Setup
We have:
- Input space : the domain of possible inputs
- Unlabeled dataset where
No labels . The goal varies by task:
- Clustering: Learn a mapping
- Dimensionality Reduction: Learn a mapping where
- Density Estimation: Learn the probability distribution
Clustering Algorithms
K-Means Clustering
The most widely used clustering algorithm. Partitions data into clusters by minimizing within-cluster variance.
Objective Function (Inertia):
where is the set of points in cluster and is the centroid of cluster .
Algorithm:
- Initialize centroids randomly
- Assignment Step: Assign each point to nearest centroid
- Update Step: Recompute centroids
- Repeat until convergence
Why it works: Each step monotonically decreases . Assignment minimizes for fixed centroids; update minimizes for fixed assignments.
Limitations:
- Must specify in advance
- Sensitive to initialization (use K-Means++ for better initialization)
- Assumes spherical, equally-sized clusters
- Converges to local minimum
Hierarchical Clustering
Builds a tree (dendrogram) of clusters. Two approaches:
Agglomerative (Bottom-Up):
- Start with each point as its own cluster
- Repeatedly merge the two closest clusters
- Stop when desired number of clusters reached
Divisive (Top-Down):
- Start with all points in one cluster
- Recursively split clusters
- Stop when desired granularity reached
Linkage Methods (how to measure cluster distance):
- Single Linkage: ()
- Complete Linkage:
- Average Linkage:
- Ward’s Method: Minimize increase in total within-cluster variance
DBSCAN (Density-Based Spatial Clustering)
Clusters are dense regions separated by sparse regions. Handles arbitrary cluster shapes.
Key Parameters:
- (eps): Neighborhood radius
- MinPts: Minimum points to form a dense region
Point Classification:
- Core Point: Has MinPts within radius
- Border Point: Within of a core point but not itself core
- Noise Point: Neither core nor border
Algorithm:
- Find all core points
- Connect core points that are within of each other
- Assign border points to nearby clusters
- Label remaining points as noise
Advantages: No need to specify , finds arbitrary shapes, robust to outliers.
Limitations: Struggles with varying density clusters, sensitive to and MinPts.
Dimensionality Reduction
PCA (Principal Component Analysis)
Find orthogonal directions (principal components) that maximize variance in the data.
Mathematical Formulation: Given centered data matrix , find projection directions.
The first principal component maximizes:
where is the covariance matrix.
Solution: The principal components are the eigenvectors of , ordered by eigenvalue magnitude.
where contains eigenvectors as columns and is diagonal with eigenvalues .
Variance Explained: The proportion of variance captured by the first components:
Projection: To reduce to dimensions:
where contains the top eigenvectors.
t-SNE (t-Distributed Stochastic Neighbor Embedding)
Non-linear dimensionality reduction optimized for visualization. Preserves local neighborhood structure.
Intuition: Points that are close in high-dimensional space should be close in low-dimensional space.
High-dimensional similarities (Gaussian kernel):
Symmetrize:
Low-dimensional similarities (t-distribution with 1 df):
The t-distribution has heavier tails than Gaussian, allowing moderate distances in high-d to become larger in low-d (alleviates crowding problem).
Objective: Minimize KL divergence between and :
Key Parameter: Perplexity (roughly, effective number of neighbors to consider). Typical range: 5-50.
Limitations:
- Non-parametric (can’t project new points directly)
- Computationally expensive:
- Results depend on random initialization
- Cluster sizes in visualization don’t reflect true cluster sizes
UMAP (Uniform Manifold Approximation and Projection)
Modern alternative to t-SNE. Based on Riemannian geometry and algebraic topology. Generally faster and better preserves global structure.
Key Differences from t-SNE:
- Constructs a weighted graph based on local distances
- Optimizes cross-entropy rather than KL divergence
- Can embed new points after training
Anomaly Detection
Statistical Methods
Z-Score: Flag points where
Mahalanobis Distance: Accounts for correlations
Isolation Forest
Based on the principle that anomalies are “few and different”.
Algorithm:
- Build trees by randomly selecting features and split values
- Anomalies require fewer splits to isolate
- Anomaly score based on average path length across trees
Path Length Interpretation:
- Short path length → likely anomaly
- Long path length → likely normal
One-Class SVM
Learn a boundary around “normal” data. Uses Support Vector Machines principles with only one class.
Association Rule Learning
Discover rules like .
Key Metrics:
- Support: ( how frequently itemset appears )
- Confidence: ( reliability of rule )
- Lift: ( if , positive correlation )
Apriori Algorithm: Prune itemsets with support below threshold, build rules from frequent itemsets.
Practical Application
When to Use Each Algorithm
| Scenario | Recommended Approach |
|---|---|
| Unknown number of clusters | Hierarchical, DBSCAN |
| Large dataset, known | K-Means, Mini-Batch K-Means |
| Arbitrary cluster shapes | DBSCAN, Spectral Clustering |
| Visualization (2D/3D) | t-SNE, UMAP |
| Feature extraction / compression | PCA |
| Noise/outlier detection | DBSCAN, Isolation Forest |
| Market basket analysis | Apriori, FP-Growth |
Choosing Number of Clusters (K)
- Elbow Method: Plot inertia vs , look for “elbow”
- Silhouette Score: Measures how similar points are to own cluster vs others. Range , higher is better
- Gap Statistic: Compare within-cluster dispersion to null reference
- Domain Knowledge: Often the most reliable
Common Pitfalls
- Not scaling features: K-Means and PCA are sensitive to feature scales. Standardize first.
- Ignoring cluster validation: Always validate with multiple metrics and visual inspection.
- Over-interpreting t-SNE: Distances between clusters are not meaningful; cluster sizes are misleading.
- Using PCA for non-linear data: Consider kernel PCA or t-SNE/UMAP instead.
- Wrong distance metric: Euclidean isn’t always appropriate (e.g., text data → cosine similarity).
Comparison Table
| Algorithm | Cluster Shape | Scalability | Needs K? | Handles Noise |
|---|---|---|---|---|
| K-Means | Spherical | Very Good | Yes | No |
| Hierarchical | Any | Poor ( or worse) | No | No |
| DBSCAN | Arbitrary | Good | No | Yes |
| Gaussian Mixture | Elliptical | Good | Yes | No |
| Spectral | Arbitrary | Poor | Yes | No |
| Dim. Reduction | Linear? | Preserves | Speed | New Points |
|---|---|---|---|---|
| PCA | Yes | Global variance | Fast | Yes |
| t-SNE | No | Local structure | Slow | No |
| UMAP | No | Local + Global | Medium | Yes |
| Autoencoders | No | Learned features | Varies | Yes |
Related Concepts
- Supervised Learning
- PCA (Principal Component Analysis)
- Dimensionality Reduction
- Support Vector Machines
- Feature Engineering Principles
Resources
Papers
- A Tutorial on Spectral Clustering
- Visualizing Data using t-SNE
- UMAP: Uniform Manifold Approximation and Projection
Others
- Clustering with Scikit-Learn
- Understanding UMAP
- StatQuest: K-Means Clustering
- StatQuest: Hierarchical Clustering
- StatQuest: PCA
- StatQuest: t-SNE
Back to: 01 - Core Fundamentals Index