Decision Trees

Overview

Decision Trees are a fundamental Supervised Learning algorithm that learns a hierarchy of if-then rules to partition the feature space into regions, each assigned a prediction. They can handle both classification and regression tasks. The model structure resembles an inverted tree, where internal nodes represent feature tests, branches represent outcomes, and leaf nodes contain predictions.

The key appeal of decision trees lies in their interpretability: you can trace exactly why a prediction was made by following the path from root to leaf. This makes them invaluable when model explainability is a requirement.

Core Mental Model

Decision Trees are like flowchart for making predictions. At each step, you ask a yes/no question about a feature, and based on the answer, you proceed down the appropriate branch until you reach a final decision.

Example: Predicting whether someone will buy a product:

Recursive Binary Splitting

The algorithm builds the tree top-down through recursive partitioning:

  1. Start with all training data at the root
  2. Find the best feature and threshold to split the data (minimizes impurity)
  3. Partition data into left and right child nodes
  4. Recursively repeat steps 2-3 for each child
  5. Stop when a stopping criteria is met (max depth, min samples, pure node)

Feature Space Partitioning

Decision trees create axis-aligned decision boundaries. Each split divides the feature space with a line perpendicular to one axis.

This axis-aligned nature is both a strength (easy to interpret) and a limitation (cannot capture diagonal boundaries efficiently).


Mathematical Foundation

Measuring Impurity

The core question at each node is: “Which split best separates the classes?” We need a metric to quantify how “mixed” or “impure” a node is.

Let denote the proportion of class samples at a node. For a binary classification problem, let be the proportion of the positive class.

Gini Impurity

For binary classification:

Interpretation: Gini impurity measures the probability of misclassifying a randomly chosen sample if it were labeled randomly according to the class distribution at that node.

  • Gini = 0: Pure node (all samples belong to one class)
  • Gini = 0.5 (binary): Maximum impurity (50-50 split)

Entropy (Information Gain)

For binary classification:

Interpretation: Entropy measures the uncertainty or disorder in the class distribution. It originates from information theory, representing the expected number of bits needed to encode class membership.

  • Entropy = 0: Pure node
  • Entropy = 1 (binary): Maximum uncertainty

Information Gain

When evaluating a split on feature at threshold , we compute the reduction in impurity:

where:

  • is the impurity function (Gini or Entropy)
  • is the parent node
  • , are left and right child nodes
  • denotes the number of samples at node

The algorithm greedily selects the split that maximizes this gain:

Gini vs Entropy: Does it Matter?

In practice, both criteria yield very similar trees. Gini is slightly faster to compute (no logarithm). The choice rarely affects model performance significantly.

Regression Trees: Variance Reduction

For regression tasks, we cannot use class proportions. Instead, we minimize the mean squared error within each node:

where is the mean target value at node .

The prediction for a leaf node is simply the mean of the target values: .

The split criterion becomes variance reduction:


Algorithm Details

CART (Classification and Regression Trees)

The most common algorithm, used by scikit-learn:

function BuildTree(D, depth):
    if stopping_condition(D, depth):
        return LeafNode(majority_class(D) or mean(D))

    best_gain = 0
    for each feature f in features:
        for each threshold τ in unique_values(D[f]):
            gain = compute_gain(D, f, τ)
            if gain > best_gain:
                best_gain = gain
                best_f, best_τ = f, τ

    D_left = {x ∈ D : x[best_f] <= best_τ}
    D_right = {x ∈ D : x[best_f] > best_τ}

    return InternalNode(
        feature = best_f,
        threshold = best_τ,
        left = BuildTree(D_left, depth + 1),
        right = BuildTree(D_right, depth + 1)
    )

Handling Categorical Features

Two approaches:

  1. One-hot encoding: Convert to binary features (standard approach)
  2. Native categorical splits: Some implementations (LightGBM) support direct categorical splits

Missing Values

Different strategies:

  • Surrogate splits: Find alternative features that produce similar splits
  • Separate branch: Create a third branch for missing values
  • Imputation: Fill missing values before training

Practical Application

When to Use Decision Trees

  • Interpretability is critical: Medical diagnosis, loan approval, legal decisions
  • Feature interactions matter: Trees naturally capture interactions without explicit feature engineering
  • Mixed feature types: Handles numerical and categorical features naturally
  • Quick baseline: Fast to train, provides a reasonable starting point
  • Non-linear relationships: Can model complex non-linear patterns
  • Exploratory analysis: Reveals important features and their thresholds

When NOT to Use

  • Smooth decision boundaries needed: Trees create jagged, axis-aligned boundaries
  • High-dimensional sparse data: Struggle with many irrelevant features (consider Random Forests)
  • Extrapolation required: Trees cannot predict outside the range of training data
  • Stability is important: Small data changes can produce very different trees
  • Maximum accuracy needed: Ensemble methods like Gradient Boosting or Random Forests almost always outperform single trees

Common Pitfalls

PitfallProblemSolution
OverfittingDeep trees memorize training dataLimit depth, prune, use min_samples_leaf
High varianceSmall data changes yield different treesUse Ensemble Methods
Feature scale sensitivityActually, trees are invariant to feature scalingNot a problem for trees (unlike Support Vector Machines)
Imbalanced classesTrees bias toward majority classUse class_weight, balanced sampling
Ignoring feature importanceMissing insights from the modelAlways check feature_importances_

Key Hyperparameters

ParameterEffectTypical Range
max_depthControls tree complexity3-20 (None for unlimited)
min_samples_splitMinimum samples to split a node2-20
min_samples_leafMinimum samples in a leaf1-10
max_featuresFeatures considered per splitsqrt, log2, or fraction
criterionImpurity measuregini, entropy (classification)
ccp_alphaCost-complexity pruning parameter0.0-0.1

Pruning

Pruning prevents overfitting by removing branches that provide little predictive power:

Pre-pruning (early stopping):

  • Set max_depth, min_samples_split, min_samples_leaf before training

Post-pruning (cost-complexity pruning):

  • Grow full tree, then prune branches that minimally increase training error
  • Controlled by ccp_alpha: higher values = more pruning

The cost-complexity criterion for subtree rooted at node :

where is the tree’s error and is the number of leaves. We prune if:

Computational Complexity

  • Training: where = samples, = features
  • Prediction: per sample (tree depth)
  • Memory: , typically much smaller than dataset

Feature Importance

Decision trees provide built-in feature importance scores based on the total impurity decrease:

where the sum is over all nodes that split on feature , and is the total number of training samples.

Caveat: This can be biased toward high-cardinality features. Consider permutation importance for more reliable estimates.

Comparisons

AspectDecision TreeRandom ForestsGradient Boosting
BiasLowLowLow
VarianceHighLowLow
InterpretabilityHighMediumLow
Training SpeedFastMediumSlow
Overfitting RiskHighLowMedium
Hyperparameter SensitivityMediumLowHigh
Handles Missing ValuesSome implementationsYes (random)Yes

Decision Tree vs Logistic Regression

CriteriaDecision TreeLogistic Regression
Decision boundaryNon-linear, axis-alignedLinear
InterpretabilityRule-basedCoefficient-based
Feature interactionsAutomaticMust specify manually
Probability calibrationOften poorGenerally good
Overfitting tendencyHighLow


Resources

Papers

Articles

Videos


Back to: 01 - Core Fundamentals Index