Decision Trees
Overview
Decision Trees are a fundamental Supervised Learning algorithm that learns a hierarchy of if-then rules to partition the feature space into regions, each assigned a prediction. They can handle both classification and regression tasks. The model structure resembles an inverted tree, where internal nodes represent feature tests, branches represent outcomes, and leaf nodes contain predictions.
The key appeal of decision trees lies in their interpretability: you can trace exactly why a prediction was made by following the path from root to leaf. This makes them invaluable when model explainability is a requirement.
Core Mental Model
Decision Trees are like flowchart for making predictions. At each step, you ask a yes/no question about a feature, and based on the answer, you proceed down the appropriate branch until you reach a final decision.
Example: Predicting whether someone will buy a product:
Recursive Binary Splitting
The algorithm builds the tree top-down through recursive partitioning:
- Start with all training data at the root
- Find the best feature and threshold to split the data (minimizes impurity)
- Partition data into left and right child nodes
- Recursively repeat steps 2-3 for each child
- Stop when a stopping criteria is met (max depth, min samples, pure node)
Feature Space Partitioning
Decision trees create axis-aligned decision boundaries. Each split divides the feature space with a line perpendicular to one axis.
This axis-aligned nature is both a strength (easy to interpret) and a limitation (cannot capture diagonal boundaries efficiently).
Mathematical Foundation
Measuring Impurity
The core question at each node is: “Which split best separates the classes?” We need a metric to quantify how “mixed” or “impure” a node is.
Let denote the proportion of class samples at a node. For a binary classification problem, let be the proportion of the positive class.
Gini Impurity
For binary classification:
Interpretation: Gini impurity measures the probability of misclassifying a randomly chosen sample if it were labeled randomly according to the class distribution at that node.
- Gini = 0: Pure node (all samples belong to one class)
- Gini = 0.5 (binary): Maximum impurity (50-50 split)
Entropy (Information Gain)
For binary classification:
Interpretation: Entropy measures the uncertainty or disorder in the class distribution. It originates from information theory, representing the expected number of bits needed to encode class membership.
- Entropy = 0: Pure node
- Entropy = 1 (binary): Maximum uncertainty
Information Gain
When evaluating a split on feature at threshold , we compute the reduction in impurity:
where:
- is the impurity function (Gini or Entropy)
- is the parent node
- , are left and right child nodes
- denotes the number of samples at node
The algorithm greedily selects the split that maximizes this gain:
Gini vs Entropy: Does it Matter?
In practice, both criteria yield very similar trees. Gini is slightly faster to compute (no logarithm). The choice rarely affects model performance significantly.
Regression Trees: Variance Reduction
For regression tasks, we cannot use class proportions. Instead, we minimize the mean squared error within each node:
where is the mean target value at node .
The prediction for a leaf node is simply the mean of the target values: .
The split criterion becomes variance reduction:
Algorithm Details
CART (Classification and Regression Trees)
The most common algorithm, used by scikit-learn:
function BuildTree(D, depth):
if stopping_condition(D, depth):
return LeafNode(majority_class(D) or mean(D))
best_gain = 0
for each feature f in features:
for each threshold τ in unique_values(D[f]):
gain = compute_gain(D, f, τ)
if gain > best_gain:
best_gain = gain
best_f, best_τ = f, τ
D_left = {x ∈ D : x[best_f] <= best_τ}
D_right = {x ∈ D : x[best_f] > best_τ}
return InternalNode(
feature = best_f,
threshold = best_τ,
left = BuildTree(D_left, depth + 1),
right = BuildTree(D_right, depth + 1)
)
Handling Categorical Features
Two approaches:
- One-hot encoding: Convert to binary features (standard approach)
- Native categorical splits: Some implementations (LightGBM) support direct categorical splits
Missing Values
Different strategies:
- Surrogate splits: Find alternative features that produce similar splits
- Separate branch: Create a third branch for missing values
- Imputation: Fill missing values before training
Practical Application
When to Use Decision Trees
- Interpretability is critical: Medical diagnosis, loan approval, legal decisions
- Feature interactions matter: Trees naturally capture interactions without explicit feature engineering
- Mixed feature types: Handles numerical and categorical features naturally
- Quick baseline: Fast to train, provides a reasonable starting point
- Non-linear relationships: Can model complex non-linear patterns
- Exploratory analysis: Reveals important features and their thresholds
When NOT to Use
- Smooth decision boundaries needed: Trees create jagged, axis-aligned boundaries
- High-dimensional sparse data: Struggle with many irrelevant features (consider Random Forests)
- Extrapolation required: Trees cannot predict outside the range of training data
- Stability is important: Small data changes can produce very different trees
- Maximum accuracy needed: Ensemble methods like Gradient Boosting or Random Forests almost always outperform single trees
Common Pitfalls
| Pitfall | Problem | Solution |
|---|---|---|
| Overfitting | Deep trees memorize training data | Limit depth, prune, use min_samples_leaf |
| High variance | Small data changes yield different trees | Use Ensemble Methods |
| Feature scale sensitivity | Actually, trees are invariant to feature scaling | Not a problem for trees (unlike Support Vector Machines) |
| Imbalanced classes | Trees bias toward majority class | Use class_weight, balanced sampling |
| Ignoring feature importance | Missing insights from the model | Always check feature_importances_ |
Key Hyperparameters
| Parameter | Effect | Typical Range |
|---|---|---|
max_depth | Controls tree complexity | 3-20 (None for unlimited) |
min_samples_split | Minimum samples to split a node | 2-20 |
min_samples_leaf | Minimum samples in a leaf | 1-10 |
max_features | Features considered per split | sqrt, log2, or fraction |
criterion | Impurity measure | gini, entropy (classification) |
ccp_alpha | Cost-complexity pruning parameter | 0.0-0.1 |
Pruning
Pruning prevents overfitting by removing branches that provide little predictive power:
Pre-pruning (early stopping):
- Set max_depth, min_samples_split, min_samples_leaf before training
Post-pruning (cost-complexity pruning):
- Grow full tree, then prune branches that minimally increase training error
- Controlled by
ccp_alpha: higher values = more pruning
The cost-complexity criterion for subtree rooted at node :
where is the tree’s error and is the number of leaves. We prune if:
Computational Complexity
- Training: where = samples, = features
- Prediction: per sample (tree depth)
- Memory: , typically much smaller than dataset
Feature Importance
Decision trees provide built-in feature importance scores based on the total impurity decrease:
where the sum is over all nodes that split on feature , and is the total number of training samples.
Caveat: This can be biased toward high-cardinality features. Consider permutation importance for more reliable estimates.
Comparisons
| Aspect | Decision Tree | Random Forests | Gradient Boosting |
|---|---|---|---|
| Bias | Low | Low | Low |
| Variance | High | Low | Low |
| Interpretability | High | Medium | Low |
| Training Speed | Fast | Medium | Slow |
| Overfitting Risk | High | Low | Medium |
| Hyperparameter Sensitivity | Medium | Low | High |
| Handles Missing Values | Some implementations | Yes (random) | Yes |
Decision Tree vs Logistic Regression
| Criteria | Decision Tree | Logistic Regression |
|---|---|---|
| Decision boundary | Non-linear, axis-aligned | Linear |
| Interpretability | Rule-based | Coefficient-based |
| Feature interactions | Automatic | Must specify manually |
| Probability calibration | Often poor | Generally good |
| Overfitting tendency | High | Low |
Related Concepts
- Random Forests: Ensemble of decision trees with bagging
- Gradient Boosting: Sequential ensemble that corrects errors
- Ensemble Methods: General framework for combining models
- Overfitting and Underfitting: Key concern for decision trees
- Bias-Variance Tradeoff: Trees have low bias but high variance
- Feature Engineering Principles: Trees reduce need for feature engineering
Resources
Papers
- Classification and Regression Trees (Breiman et al., 1984)
- C4.5: Programs for Machine Learning (Quinlan, 1993)
Articles
Videos
Back to: 01 - Core Fundamentals Index