Proximal Policy Optimization (PPO)

Overview

Proximal Policy Optimization (PPO) is a policy gradient algorithm that has become one of the most popular and practical deep RL algorithms. Developed by OpenAI in 2017, PPO addresses a fundamental challenge: policy gradient methods can be unstable when updates are too large, but being overly conservative wastes samples.

PPO strikes a balance by using a “clipped” objective function that prevents the policy from changing too drastically in any single update. It achieves competitive performance with algorithms like TRPO (Trust Region Policy Optimization) while being significantly simpler to implement and tune.

PPO is the algorithm behind the core RL algorithm used in RLHF for LLMs.

The Problem PPO Solves

Standard policy gradient methods (like REINFORCE) update the policy by taking gradient steps proportional to the return:

$θ \leftarrow θ + α \nabla_{θ} lo g π_{θ} (a ∣ s) \cdot \hat{A}$

The problem: how big should the step be?

Too small steps → Slow learning, wasted samples
Too large steps → Policy collapses (performance tanks and may never recover)

Policy performance is extremely sensitive to parameter changes. A seemingly small $θ$ update can dramatically shift action probabilities, leading to:

Collecting bad data with the broken policy
Using that bad data to make further bad updates
Catastrophic performance spiral

Core Intuition: “Don’t Change Too Much”

PPO’s directly limits how much the policy behavior can change, not just the parameter change. It’s done by:

Tracking the probability ratio: How much more/less likely is this action under the new policy vs the old?
Clipping: If the ratio gets too far from 1.0, stop the gradient signal

When advantage $A > 0$ (good action): Allow increases up to $1 + ϵ$ , then clip. When advantage $A < 0$ (bad action): Allow decreases down to $1 - ϵ$ , then clip.

Mathematical Foundation

Policy Gradient Recap

The standard policy gradient objective is:

$L^{PG} (θ) = E_{t} [lo g π_{θ} (a_{t} ∣ s_{t}) \cdot \hat{A}_{t}]$

This has high variance and no mechanism to prevent large policy changes.

The Probability Ratio

PPO introduces the probability ratio:

$r_{t} (θ) = \frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{o l d}} ( a _{t} ∣ s _{t} )}$

This measures how much the new policy differs from the old one for the specific action taken:

$r_{t} (θ)$	Meaning
$r_{t} = 1$	Action equally likely under old and new policy
$r_{t} > 1$	New policy makes this action more likely
$r_{t} < 1$	New policy makes this action less likely

PPO-Clip Objective

The core PPO objective (clipped version):

$L^{C L I P} (θ) = E_{t} [min (r_{t} (θ) \hat{A}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) \hat{A}_{t})]$

$r_{t} (θ) \hat{A}_{t}$ : The unclipped objective (like standard policy gradient with importance sampling)
$clip (r_{t} (θ), 1 - ϵ, 1 + ϵ)$ : Constrains the ratio to $[1 - ϵ, 1 + ϵ]$
$min (\cdot, \cdot)$ : Takes the more pessimistic (lower) estimate

Why the minimum?

The clipping is asymmetric depending on whether the advantage is positive or negative:

Scenario	Advantage	Effect
Good action, ratio increases	$\hat{A}_{t} > 0$	Cap benefit at $1 + ϵ$
Good action, ratio decreases	$\hat{A}_{t} > 0$	No clipping (allow decrease)
Bad action, ratio decreases	$\hat{A}_{t} < 0$	Cap penalty at $1 - ϵ$
Bad action, ratio increases	$\hat{A}_{t} < 0$	No clipping (allow increase, reducing bad action)

Intuition: We want to be conservative about changes that could hurt us, but permissive about changes that undo previous mistakes.

The Full PPO Objective

In practice, PPO combines multiple objectives:

$L^{PPO} (θ) = E_{t} [L^{C L I P} (θ) - c_{1} L^{V F} (θ) + c_{2} S [π_{θ}] (s_{t})]$

Where:

$L^{C L I P}$ : The clipped surrogate objective (policy improvement)
$L^{V F} = (V_{θ} (s_{t}) - V_{t}^{t a r g e t})^{2}$ : Value function loss (critic training)
$S [π_{θ}]$ : Entropy bonus (encourages exploration)
$c_{1}, c_{2}$ : Coefficients (typically $c_{1} = 0.5$ , $c_{2} = 0.01$ )

The entropy bonus $S [π_{θ}] (s_{t}) = - \sum_{a} π_{θ} (a ∣ s_{t}) lo g π_{θ} (a ∣ s_{t})$ prevents the policy from becoming too deterministic too quickly.

Generalized Advantage Estimation (GAE)

PPO typically uses GAE to compute advantages, which interpolates between high-bias/low-variance (TD) and low-bias/high-variance (Monte Carlo) estimates:

$\hat{A}_{t}^{G A E} = \sum_{l = 0}^{\infty} (γλ)^{l} δ_{t + l}$

where $δ_{t} = r_{t} + γV (s_{t + 1}) - V (s_{t})$ is the TD error.

The $λ$ parameter controls the trade-off:

$λ = 0$ : Pure TD (one-step), high bias, low variance
$λ = 1$ : Pure Monte Carlo, low bias, high variance
$λ \approx 0.95$ : Common default, good balance

PPO Algorithm

Initialize policy network πθ, value network Vφ
for iteration = 1, 2, ... do
    // COLLECT DATA
    for actor = 1, ..., N do
        Run policy πθ_old in environment for T timesteps
        Collect trajectories {s, a, r, s'}
    end for
    
    // COMPUTE ADVANTAGES
    Compute rewards-to-go R̂t
    Compute advantage estimates Ât using GAE
    
    // OPTIMIZE (multiple epochs on same data!)
    for epoch = 1, ..., K do
        for minibatch in shuffle(collected_data) do
            // Policy update
            Compute r(θ) = πθ(a|s) / πθ_old(a|s)
            L_CLIP = E[min(r(θ)Â, clip(r(θ), 1-ε, 1+ε)Â)]
            
            // Value update
            L_VF = E[(Vφ(s) - R̂)²]
            
            // Entropy bonus
            S = E[entropy(πθ)]
            
            // Combined update
            Maximize L_CLIP - c1*L_VF + c2*S
        end for
    end for
    
    θ_old ← θ
end for

Practical Application

When to Use PPO

Good fit:

Continuous or discrete action spaces
When sample efficiency isn’t critical but stability is
When you want a reliable, well-understood baseline
Multi-task and transfer learning scenarios
RLHF for language models

Not ideal for:

Extremely sample-constrained settings
Very simple problems where vanilla policy gradient suffices
When you need guaranteed convergence properties (PPO is empirically stable but has fewer theoretical guarantees than TRPO)

Common Pitfalls

Learning rate too high: Policy can still destabilize despite clipping. Start conservative.
Not normalizing advantages: Advantages should be standardized (zero mean, unit variance) per batch for stable training.
Value function divergence: If the critic is poorly trained, advantage estimates become unreliable. Ensure adequate value function training.
Too few epochs: PPO’s power comes from reusing data. Too few epochs waste samples.
Too many epochs: Overfitting to old data. The clipping helps, but there’s still a limit. 10 epochs is usually the upper bound.
Reward scaling issues: Very large or very small rewards can cause numerical problems. Normalize if needed.

Resources

Papers

Articles & Tutorials

Back to: Reinforcement Learning | 01 - Core Fundamentals Index

Aayush's ML & AI Notes

Explorer