Supervised Learning

Overview

Model trained using labeled data.

Types

Classification: Predicting Discrete Categories
Regression: Predicting Continuous Values

Mathematical Foundation

Formal Setup

Input space $X$ : the domain of possible inputs
Output space $Y$ : the domain of possible outputs/labels
Training dataset $D = {(x_{1}, y_{1}), (x_{2}, y_{2}), ..., (x_{n}, y_{n})}$ where $x_{i} \in X$ and $y_{i} \in Y$
Hypothesis function $h : X \to Y$ (our model)

The goal is to find a hypothesis $h$ from some hypothesis space $H$ that generalizes well to unseen data.

The Learning Objective

We assume there exists some true (but unknown) function $f : X \to Y$ that generates our labels, possibly with noise. Our training data comes from some joint distribution $P (X, Y)$ . Goal: Find $h^{*} \in H$ that minimizes the expected risk (generalization error):

$R (h) = E_{(x, y) \sim P (X, Y)} [L (h (x), y)]$ where $L$ is a loss function measuring prediction error. Since we don’t know $P (X, Y)$ , we minimize the empirical risk instead:

$\hat{R} (h) = \frac{1}{n} \sum_{i = 1}^{n} L (h (x_{i}), y_{i})$

Common Loss Functions

For Regression ( $Y = R$ ):

Mean Squared Error (MSE): $L (h (x), y) = (h (x) - y)^{2}$
Mean Absolute Error: $L (h (x), y) = ∣ h (x) - y ∣$

For Binary Classification ( $Y = {0, 1}$ ):

0-1 Loss: $L (h (x), y) = 1 [h (x) \neq = y]$ (not differentiable)
Cross-Entropy Loss: $L (\overset{y}{^}, y) = - [y lo g (\overset{y}{^}) + (1 - y) lo g (1 - \overset{y}{^})]$ where $\overset{y}{^} = h (x) \in [0, 1]$ represents predicted probability.

For Multi-class Classification ( $Y = {1, 2, ..., K}$ ):

Categorical Cross-Entropy: $L (\overset{y}{^}, y) = - \sum_{k = 1}^{K} y_{k} lo g (\overset{y}{^}_{k})$ where $y$ is one-hot encoded and $\overset{y}{^}$ is a probability distribution over classes.

Gradient Descent and Optimization

To minimize empirical risk, we use gradient descent. For a parameterized model $h_{θ} (x)$ with parameters $θ$ :

$θ^{(t + 1)} = θ^{(t)} - η \nabla_{θ} \hat{R} (θ)$

where $η$ is the learning rate.

If $θ = [θ_{0}, θ_{1}, ..., θ_{p}]$ , then:

$\nabla_{θ} \hat{R} (θ) = \frac{\partial R ^}{\partial θ _{0}} \frac{\partial R ^}{\partial θ _{1}} ⋮ \frac{\partial R ^}{\partial θ _{p}}$

Each component shows how much the loss changes when adjusting that specific parameter.

$\nabla_{θ} \hat{R} (θ) = \frac{1}{n} \sum_{i = 1}^{n} \nabla_{θ} L (h_{θ} (x_{i}), y_{i})$

SGD (Stochastic Gradient Descent) approximates this using mini-batches:

$θ^{(t + 1)} = θ^{(t)} - η \cdot \frac{1}{∣ B ∣} \sum_{i \in B} \nabla_{θ} L (h_{θ} (x_{i}), y_{i})$

where $B$ is a randomly sampled batch.

Example: Linear Regression

Hypothesis: $h_{θ} (x) = θ^{T} x = \sum_{j = 0}^{d} θ_{j} x_{j}$ (where $x_{0} = 1$ for bias) Loss: MSE: $L (h_{θ} (x), y) = (h_{θ} (x) - y)^{2}$ Gradient for one example:

$\nabla_{θ} L = 2 (h_{θ} (x) - y) x$ Update rule:

$θ := θ - η \cdot 2 (h_{θ} (x) - y) x$ Closed-form solution exists for linear regression:

$θ^{*} = (X^{T} X)^{- 1} X^{T} y$

where $X$ is the design matrix and $y$ is the vector of labels.

Bias-Variance Tradeoff

The expected error for MSE loss function decomposes as: $E [(h (x) - y)^{2}] = Bias^{2} + Variance + Irreducible Error$

Bias: Error from wrong assumptions (under-fitting)
Variance: Error from sensitivity to training data fluctuations (overfitting)
Irreducible Error: Noise in the data itself

Regularization Techniques help balance this tradeoff by adding a penalty term:

$\hat{R}_{re g} (θ) = \frac{1}{n} \sum_{i = 1}^{n} L (h_{θ} (x_{i}), y_{i}) + λ Ω (θ)$ Common choices: $Ω (θ) = ∥ θ ∥_{2}^{2}$ (L2/Ridge) or $∥ θ ∥_{1}$ (L1/Lasso).

Comparison Table

Algorithm	Interpretability	Speed	Accuracy	Overfitting Risk	Best For
Logistic Regression	High	Fast	Medium	Low	Baseline, linear problems
Decision Trees	High	Fast	Medium	High	Interpretability
Random Forest	Medium	Medium	High	Low	General purpose
Gradient Boosting	Low	Slow	Very High	Medium	Competitions, tabular data
Support Vector Machines	Low	Slow	High	Medium	High-dimensional
Naive Bayes	High	Very Fast	Medium	Low	Text, categorical data
KNN	High	Slow	Medium	High	Small datasets
Neural Networks	Very Low	Slow	Very High	High	Complex patterns, images

Support Vector Machines

Back to: ML & AI Index

Aayush's ML & AI Notes

Explorer

Supervised Learning

Overview

Types

Mathematical Foundation

Formal Setup

The Learning Objective

Common Loss Functions

Gradient Descent and Optimization

Example: Linear Regression

Bias-Variance Tradeoff

Comparison Table

Graph View

Table of Contents

Backlinks

Aayush's ML & AI Notes

Explorer

Supervised Learning

Overview

Types

Mathematical Foundation

Formal Setup

The Learning Objective

Common Loss Functions

Gradient Descent and Optimization

Example: Linear Regression

Bias-Variance Tradeoff

Comparison Table

Related Concepts

Graph View

Table of Contents

Backlinks