Overview
Reinforcement Learning (RL) is a learning paradigm where an agent learns to make decisions by interacting with an environment. Unlike Supervised Learning where we have labeled input-output pairs, or Unsupervised Learning where we discover patterns in unlabeled data, RL learns through trial and error by receiving rewards or penalties for actions taken.
RL optimizes for long-term cumulative reward, not immediate feedback. An agent might take a seemingly poor action now (like sacrificing a chess piece) if it leads to better outcomes later (winning the game).
RL Mental Model
Think of training a dog: you don’t show it examples of “correct” sits, you reward good behavior and ignore (or penalize) bad behavior. Over time, the dog learns which actions lead to treats.
Exploration vs Exploitation
- Exploitation: Use current knowledge to maximize immediate reward
- Exploration: Try new actions to potentially discover better strategies
Too much exploitation → stuck in local optima, miss better solutions Too much exploration → waste time on suboptimal actions
Common solution: ε-greedy strategy
- With probability : EXPLORE (random action)
- With probability : EXPLOIT (best known action)
Mathematical Foundation
Markov Decision Processes (MDPs)
RL problems are typically formalized as MDPs, defined by the tuple :
| Symbol | Name | Definition |
|---|---|---|
| State space | Set of all possible states | |
| Action space | Set of all possible actions | |
| Transition function | Probability of transitioning to given state and action | |
| Reward function | Immediate reward for transition | |
| Discount factor | , determines importance of future rewards |
Markov Property: Future states depend only on the current state, not on the history of how we got there:
Return and Value Functions
Return : The cumulative discounted reward from timestep :
Why discount?
- Mathematically ensures the sum converges (for )
- Reflects preference for immediate rewards over uncertain future ones
- : myopic (only care about immediate reward)
- : far-sighted (care equally about all future rewards)
State-Value vs. Action-Value Functions
These two functions answer different questions:
| Function | Question It Answers | Notation | Depends On |
|---|---|---|---|
| State-Value | “How good is it to be in state ?” | State only | |
| Action-Value | “How good is it to take action in state ?” | State AND action |
State-Value Function : Expected return starting from state and following policy :
Action-Value Function : Expected return starting from state , taking action , then following :
Intuition:
- tells you: “If I land in this chess position and play optimally from here, what’s my expected outcome?”
- tells you: “If I land in this chess position and make this specific move, then play optimally, what’s my expected outcome?”
is actionable because it helps you compare different actions directly. Given for all actions, you immediately know the optimal policy: just pick .
When to use each:
- : When you already have a policy and want to evaluate states (e.g., policy iteration, actor-critic critics)
- : When you need to choose actions without an explicit policy (e.g., Q-Learning, DQN)
i.e., The value of a state is the weighted average of action-values, weighted by the policy’s action probabilities. If the policy says “I pick action 70% of the time and 30%,” then .
Bellman Equations
Core idea: the value of a state depends on the values of states you can reach from it. This recursive structure is what makes RL tractable.
Bellman Expectation Equation
For a fixed policy , the value of being in state can be broken down as:
- : Probability of taking action in state (given by policy)
- : Probability of landing in state after taking action
- : Immediate reward for that transition
- : Discounted value of where we end up
- Sum over all possible actions and outcomes → expected value
i.e., Value here = average of (immediate reward + discounted future value), considering all the actions I might take and all the places I might end up.
Bellman Optimality Equation
For the optimal policy , we don’t average over actions, we take the best one:
Why this matters: If we can solve for , the optimal policy falls out trivially:
i.e., “Just pick the action with the highest Q-value.”
How to Solve for
The Bellman equation tells us what must satisfy, but how do we actually find it?
1. Model-Based (Value Iteration): If you know and , repeatedly apply the Bellman equation as an update:
Keep iterating until stops changing. This converges to .
2. Model-Free (Q-Learning): If you don’t know the dynamics, learn from experience:
Each time you take action , observe reward and next state , then nudge your estimate toward the observed target. See the Q-Learning section below for details.
Policy Definitions
Deterministic Policy: maps each state to exactly one action.
Stochastic Policy: gives probability of each action in each state.
The optimal policy satisfies:
Core Algorithms
Value-Based Methods
Q-Learning (Off-Policy)
The most famous RL algorithm. Learns directly without following the optimal policy during training.
Update Rule:
Breaking it down:
- : Learning rate
- : TD (Temporal Difference) target (observed reward + best future value)
- : Current estimate
- TD error: How wrong we were (target - estimate)
Each update nudges our estimate toward what we actually observed.
Why “off-policy”? We use in the update (the greedy action), even if we didn’t actually take the greedy action. Our exploration policy (e.g., ε-greedy) differs from what we’re learning about (the optimal policy).
SARSA (On-Policy)
State-Action-Reward-State-Action: Uses the actual next action, not the best one.
Update Rule:
Key difference from Q-Learning: Uses where is the action actually selected (following current policy), not .
On-policy: Learning and behavior use the same policy. SARSA is more conservative in risky environments because it accounts for exploration in its updates.
See: https://www.youtube.com/watch?v=tbpBW5Yr44k&t=12s
Deep Q-Network (DQN) (TODO)
Q-Learning fails for large state spaces (can’t store table for chess positions). DQN approximates with a neural network .
-
Experience Replay: Store transitions in a replay buffer. Sample random mini-batches to break correlation and improve data efficiency.
-
Target Network: Use a separate, slowly-updated network for the TD target:
This stabilizes training by preventing the target from changing too rapidly.
DQN Training Loop:
Initialize replay buffer D, Q-network θ, target network θ⁻ ← θ
For each episode:
For each step:
Select action a using ε-greedy from Q(s, ·; θ)
Execute a, observe r, s'
Store (s, a, r, s') in D
Sample mini-batch from D
Compute target: y = r + γ max_a' Q(s', a'; θ⁻)
Gradient descent on (y - Q(s, a; θ))²
Every C steps: θ⁻ ← θ
Policy-Based Methods
Instead of learning values and deriving a policy, directly learn a parameterized policy .
Why policy-based?
- Can handle continuous action spaces naturally
- Can learn stochastic policies (useful when optimal behavior is stochastic)
- Often have better convergence properties
Policy Gradient Theorem
The objective is to maximize expected return:
Policy Gradient Theorem (the fundamental result):
i.e. Increase probability of actions that lead to high returns. The gradient of points in the direction to increase that action’s probability; scale by how good the action was.
REINFORCE Algorithm
A Monte Carlo policy gradient method: run a full episode, then update the policy based on actual returns observed.
Update Rule:
Breaking it down:
- : Learning rate
- : Discount factor for timestep (earlier actions weighted more)
- : Actual return from timestep onwards
- : Direction to increase probability of action
How it works:
- Run a complete episode using current policy
- For each timestep , compute the return (sum of discounted rewards from to end)
- Update: if is high, push to make more likely; if is low, push to make less likely
- Repeat for many episodes
Why ? The gradient is the “score function.” Multiplying by return gives us an unbiased estimate of the policy gradient. High return → increase action probability. Low return → decrease it.
The Variance Problem
REINFORCE has high variance because can vary wildly between episodes (different random trajectories lead to very different returns). This means:
- Noisy gradient estimates
- Slow, unstable learning
- Needs many samples to average out the noise
Solution: Baseline Subtraction
Subtract a baseline from the return:
Why this helps:
- If : action was better than average → increase probability
- If : action was worse than average → decrease probability
- Centering around the baseline reduces the magnitude of updates, lowering variance
Common baseline: , the state-value function. This is optimal in the sense that it minimizes variance without introducing bias.
Key insight: Subtracting a baseline that doesn’t depend on the action preserves the expected gradient (unbiased) while reducing variance. This leads naturally to Actor-Critic methods, where we learn alongside the policy.
Actor-Critic Methods
Combine value-based and policy-based approaches:
- Actor: The policy that selects actions
- Critic: A value function or that evaluates how good actions are
The critic reduces variance by replacing high-variance Monte Carlo returns with lower-variance value estimates.
Advantage Function:
Measures how much better action is compared to the average action from state .
Actor Update (using advantage):
Critic Update (minimize TD error):
Proximal Policy Optimization (PPO)
Refer to linked note.
Types of RL Problems
Episodic vs. Continuing Tasks
| Aspect | Episodic | Continuing |
|---|---|---|
| Structure | Clear start and end | Runs forever |
| Examples | Games, robot navigation | Stock trading, HVAC control |
| Return | ||
| Gamma | Can use | Need for convergence |
Model-Based vs. Model-Free
Model-Free: Learn policy or value function directly from experience. No explicit model of environment dynamics.
- Pro: Works when dynamics are unknown or complex
- Con: Sample inefficient (need lots of experience)
- Examples: Q-Learning, SARSA, Policy Gradient, PPO
Model-Based: Learn a model of the environment (transition and reward functions), then use it for planning.
- Pro: Sample efficient (can simulate experience)
- Con: Model errors can compound; complex dynamics are hard to model
- Example: Dyna-Q (combines learning with planning)
On-Policy vs. Off-Policy
On-Policy: Learn about the policy currently being used for decisions.
- Must use fresh experience from current policy
- Examples: SARSA, REINFORCE, A2C/A3C, PPO
Off-Policy: Learn about a different policy than the one generating experience.
- Can reuse old experience (replay buffers)
- Examples: Q-Learning, DQN, DDPG, SAC
Practical Applications
When to Use Reinforcement Learning
Good fit:
- Sequential decision-making problems
- Clear reward signal (even if sparse)
- Ability to simulate or interact repeatedly with environment
- When optimal strategy is unknown or too complex to hand-code
Examples:
- Game playing (Go, Atari, Dota 2, StarCraft)
- Robotics and control (manipulation, locomotion)
- Recommendation systems (personalized content)
- Resource management (data centers, traffic control)
- LLM alignment (RLHF for LLMs)
Pitfalls
- Reward Hacking: Agent finds unintended ways to maximize reward
- Delete tests to ensure all tests pass? Sure.
- Sparse Rewards: Agent receives feedback too rarely to learn
- Reward only for winning a chess game
-
Sample Inefficiency: Deep RL often needs millions of environment steps
- Solutions: Model-based methods, better exploration, transfer learning
-
Hyperparameter Sensitivity: RL algorithms are notoriously finicky
- Learning rate, discount factor, exploration schedule all matter greatly
- Solutions: Maybe use PPO?
-
Non-Stationarity: In multi-agent settings, other agents are also learning
- The “environment” keeps changing
- Solutions: Self-play, population-based training
More Advanced Topics (TODO sometime later??)
Multi-Agent RL
Multiple agents learning simultaneously in shared environment. Challenges: non-stationarity, credit assignment among agents, emergent behavior.
Hierarchical RL
Learn policies at multiple levels of abstraction. High-level policy selects goals, low-level policies achieve them. Helps with long-horizon problems.
Inverse RL
Given demonstrations from an expert, infer the reward function they were optimizing. Useful when rewards are hard to specify but easy to demonstrate.
Offline RL (Batch RL)
Learn from a fixed dataset of experience without further environment interaction. Important for real-world applications where online learning is expensive or dangerous.
Safe RL
Learning while satisfying safety constraints. Critical for robotics and autonomous vehicles where exploration cannot be unconstrained.
Resources
Papers
- Playing Atari with Deep Reinforcement Learning (DQN)
- Proximal Policy Optimization Algorithms
- Soft Actor-Critic
- Human-level control through deep RL (Nature DQN)
Articles & Tutorials
- OpenAI Spinning Up - Excellent introduction to deep RL
- Lilian Weng’s RL Blog Posts
- David Silver’s RL Course
- David Silver’s RL Course (DeepMind)
- Pieter Abbeel’s Deep RL Course (Berkeley)
- https://www.youtube.com/watch?v=tbpBW5Yr44k&t=12s
Code & Libraries
Back to: 01 - Core Fundamentals Index