Overview

Model trained using labeled data.

Types

  • Classification: Predicting Discrete Categories
  • Regression: Predicting Continuous Values

Mathematical Foundation

Formal Setup

  • Input space : the domain of possible inputs
  • Output space : the domain of possible outputs/labels
  • Training dataset where and
  • Hypothesis function (our model)

The goal is to find a hypothesis from some hypothesis space that generalizes well to unseen data.

The Learning Objective

We assume there exists some true (but unknown) function that generates our labels, possibly with noise. Our training data comes from some joint distribution . Goal: Find that minimizes the expected risk (generalization error):

where is a loss function measuring prediction error. Since we don’t know , we minimize the empirical risk instead:

Common Loss Functions

For Regression ():

  • Mean Squared Error (MSE):
  • Mean Absolute Error:

For Binary Classification ():

  • 0-1 Loss: (not differentiable)
  • Cross-Entropy Loss: where represents predicted probability.

For Multi-class Classification ():

  • Categorical Cross-Entropy: where is one-hot encoded and is a probability distribution over classes.

Gradient Descent and Optimization

To minimize empirical risk, we use gradient descent. For a parameterized model with parameters :

where is the learning rate.

If , then:

Each component shows how much the loss changes when adjusting that specific parameter.

SGD (Stochastic Gradient Descent) approximates this using mini-batches:

where is a randomly sampled batch.

Example: Linear Regression

Hypothesis: (where for bias) Loss: MSE: Gradient for one example:

Update rule:

Closed-form solution exists for linear regression:

where is the design matrix and is the vector of labels.

Bias-Variance Tradeoff

The expected error for MSE loss function decomposes as:

  • Bias: Error from wrong assumptions (under-fitting)
  • Variance: Error from sensitivity to training data fluctuations (overfitting)
  • Irreducible Error: Noise in the data itself

Regularization Techniques help balance this tradeoff by adding a penalty term:

Common choices: (L2/Ridge) or (L1/Lasso).

Comparison Table

AlgorithmInterpretabilitySpeedAccuracyOverfitting RiskBest For
Logistic RegressionHighFastMediumLowBaseline, linear problems
Decision TreesHighFastMediumHighInterpretability
Random ForestMediumMediumHighLowGeneral purpose
Gradient BoostingLowSlowVery HighMediumCompetitions, tabular data
Support Vector MachinesLowSlowHighMediumHigh-dimensional
Naive BayesHighVery FastMediumLowText, categorical data
KNNHighSlowMediumHighSmall datasets
Neural NetworksVery LowSlowVery HighHighComplex patterns, images

Back to: ML & AI Index