Overview
Proximal Policy Optimization (PPO) is a policy gradient algorithm that has become one of the most popular and practical deep RL algorithms. Developed by OpenAI in 2017, PPO addresses a fundamental challenge: policy gradient methods can be unstable when updates are too large, but being overly conservative wastes samples.
PPO strikes a balance by using a “clipped” objective function that prevents the policy from changing too drastically in any single update. It achieves competitive performance with algorithms like TRPO (Trust Region Policy Optimization) while being significantly simpler to implement and tune.
PPO is the algorithm behind the core RL algorithm used in RLHF for LLMs.
The Problem PPO Solves
Standard policy gradient methods (like REINFORCE) update the policy by taking gradient steps proportional to the return:
The problem: how big should the step be?
- Too small steps → Slow learning, wasted samples
- Too large steps → Policy collapses (performance tanks and may never recover)
Policy performance is extremely sensitive to parameter changes. A seemingly small update can dramatically shift action probabilities, leading to:
- Collecting bad data with the broken policy
- Using that bad data to make further bad updates
- Catastrophic performance spiral
Core Intuition: “Don’t Change Too Much”
PPO’s directly limits how much the policy behavior can change, not just the parameter change. It’s done by:
- Tracking the probability ratio: How much more/less likely is this action under the new policy vs the old?
- Clipping: If the ratio gets too far from 1.0, stop the gradient signal
When advantage (good action): Allow increases up to , then clip. When advantage (bad action): Allow decreases down to , then clip.
Mathematical Foundation
Policy Gradient Recap
The standard policy gradient objective is:
This has high variance and no mechanism to prevent large policy changes.
The Probability Ratio
PPO introduces the probability ratio:
This measures how much the new policy differs from the old one for the specific action taken:
| Meaning | |
|---|---|
| Action equally likely under old and new policy | |
| New policy makes this action more likely | |
| New policy makes this action less likely |
PPO-Clip Objective
The core PPO objective (clipped version):
- : The unclipped objective (like standard policy gradient with importance sampling)
- : Constrains the ratio to
- : Takes the more pessimistic (lower) estimate
Why the minimum?
The clipping is asymmetric depending on whether the advantage is positive or negative:
| Scenario | Advantage | Effect |
|---|---|---|
| Good action, ratio increases | Cap benefit at | |
| Good action, ratio decreases | No clipping (allow decrease) | |
| Bad action, ratio decreases | Cap penalty at | |
| Bad action, ratio increases | No clipping (allow increase, reducing bad action) |
Intuition: We want to be conservative about changes that could hurt us, but permissive about changes that undo previous mistakes.
The Full PPO Objective
In practice, PPO combines multiple objectives:
Where:
- : The clipped surrogate objective (policy improvement)
- : Value function loss (critic training)
- : Entropy bonus (encourages exploration)
- : Coefficients (typically , )
The entropy bonus prevents the policy from becoming too deterministic too quickly.
Generalized Advantage Estimation (GAE)
PPO typically uses GAE to compute advantages, which interpolates between high-bias/low-variance (TD) and low-bias/high-variance (Monte Carlo) estimates:
where is the TD error.
The parameter controls the trade-off:
- : Pure TD (one-step), high bias, low variance
- : Pure Monte Carlo, low bias, high variance
- : Common default, good balance
PPO Algorithm
Initialize policy network πθ, value network Vφ
for iteration = 1, 2, ... do
// COLLECT DATA
for actor = 1, ..., N do
Run policy πθ_old in environment for T timesteps
Collect trajectories {s, a, r, s'}
end for
// COMPUTE ADVANTAGES
Compute rewards-to-go R̂t
Compute advantage estimates Ât using GAE
// OPTIMIZE (multiple epochs on same data!)
for epoch = 1, ..., K do
for minibatch in shuffle(collected_data) do
// Policy update
Compute r(θ) = πθ(a|s) / πθ_old(a|s)
L_CLIP = E[min(r(θ)Â, clip(r(θ), 1-ε, 1+ε)Â)]
// Value update
L_VF = E[(Vφ(s) - R̂)²]
// Entropy bonus
S = E[entropy(πθ)]
// Combined update
Maximize L_CLIP - c1*L_VF + c2*S
end for
end for
θ_old ← θ
end for
Practical Application
When to Use PPO
Good fit:
- Continuous or discrete action spaces
- When sample efficiency isn’t critical but stability is
- When you want a reliable, well-understood baseline
- Multi-task and transfer learning scenarios
- RLHF for language models
Not ideal for:
- Extremely sample-constrained settings
- Very simple problems where vanilla policy gradient suffices
- When you need guaranteed convergence properties (PPO is empirically stable but has fewer theoretical guarantees than TRPO)
Common Pitfalls
-
Learning rate too high: Policy can still destabilize despite clipping. Start conservative.
-
Not normalizing advantages: Advantages should be standardized (zero mean, unit variance) per batch for stable training.
-
Value function divergence: If the critic is poorly trained, advantage estimates become unreliable. Ensure adequate value function training.
-
Too few epochs: PPO’s power comes from reusing data. Too few epochs waste samples.
-
Too many epochs: Overfitting to old data. The clipping helps, but there’s still a limit. 10 epochs is usually the upper bound.
-
Reward scaling issues: Very large or very small rewards can cause numerical problems. Normalize if needed.
Resources
Papers
- Proximal Policy Optimization Algorithms (Original Paper)
- Trust Region Policy Optimization (TRPO)
- High-Dimensional Continuous Control Using GAE
- Emergent Complexity from Multi-Agent Competition (OpenAI)
Articles & Tutorials
- OpenAI Spinning Up - PPO
- Lilian Weng - Policy Gradient Algorithms
- The 37 Implementation Details of PPO
- HuggingFace Deep RL Course - PPO
- Arxiv Insights - PPO Explained
- Pieter Abbeel - Policy Gradients and PPO
- Stable Baselines3 - PPO
- CleanRL - PPO Implementation
- SpinningUp - PPO Implementation
Back to: Reinforcement Learning | 01 - Core Fundamentals Index