Overview
Model trained using labeled data.
Types
- Classification: Predicting Discrete Categories
- Regression: Predicting Continuous Values
Mathematical Foundation
Formal Setup
- Input space : the domain of possible inputs
- Output space : the domain of possible outputs/labels
- Training dataset where and
- Hypothesis function (our model)
The goal is to find a hypothesis from some hypothesis space that generalizes well to unseen data.
The Learning Objective
We assume there exists some true (but unknown) function that generates our labels, possibly with noise. Our training data comes from some joint distribution . Goal: Find that minimizes the expected risk (generalization error):
where is a loss function measuring prediction error. Since we don’t know , we minimize the empirical risk instead:
Common Loss Functions
For Regression ():
- Mean Squared Error (MSE):
- Mean Absolute Error:
For Binary Classification ():
- 0-1 Loss: (not differentiable)
- Cross-Entropy Loss: where represents predicted probability.
For Multi-class Classification ():
- Categorical Cross-Entropy: where is one-hot encoded and is a probability distribution over classes.
Gradient Descent and Optimization
To minimize empirical risk, we use gradient descent. For a parameterized model with parameters :
where is the learning rate.
If , then:
Each component shows how much the loss changes when adjusting that specific parameter.
SGD (Stochastic Gradient Descent) approximates this using mini-batches:
where is a randomly sampled batch.
Example: Linear Regression
Hypothesis: (where for bias) Loss: MSE: Gradient for one example:
Update rule:
Closed-form solution exists for linear regression:
where is the design matrix and is the vector of labels.
Bias-Variance Tradeoff
The expected error for MSE loss function decomposes as:
- Bias: Error from wrong assumptions (under-fitting)
- Variance: Error from sensitivity to training data fluctuations (overfitting)
- Irreducible Error: Noise in the data itself
Regularization Techniques help balance this tradeoff by adding a penalty term:
Common choices: (L2/Ridge) or (L1/Lasso).
Comparison Table
| Algorithm | Interpretability | Speed | Accuracy | Overfitting Risk | Best For |
|---|---|---|---|---|---|
| Logistic Regression | High | Fast | Medium | Low | Baseline, linear problems |
| Decision Trees | High | Fast | Medium | High | Interpretability |
| Random Forest | Medium | Medium | High | Low | General purpose |
| Gradient Boosting | Low | Slow | Very High | Medium | Competitions, tabular data |
| Support Vector Machines | Low | Slow | High | Medium | High-dimensional |
| Naive Bayes | High | Very Fast | Medium | Low | Text, categorical data |
| KNN | High | Slow | Medium | High | Small datasets |
| Neural Networks | Very Low | Slow | Very High | High | Complex patterns, images |
Related Concepts
Back to: ML & AI Index