Reinforcement Learning

Overview

Reinforcement Learning (RL) is a learning paradigm where an agent learns to make decisions by interacting with an environment. Unlike Supervised Learning where we have labeled input-output pairs, or Unsupervised Learning where we discover patterns in unlabeled data, RL learns through trial and error by receiving rewards or penalties for actions taken.

RL optimizes for long-term cumulative reward, not immediate feedback. An agent might take a seemingly poor action now (like sacrificing a chess piece) if it leads to better outcomes later (winning the game).

RL Mental Model

Think of training a dog: you don’t show it examples of “correct” sits, you reward good behavior and ignore (or penalize) bad behavior. Over time, the dog learns which actions lead to treats.

Exploration vs Exploitation

Exploitation: Use current knowledge to maximize immediate reward
Exploration: Try new actions to potentially discover better strategies

Too much exploitation → stuck in local optima, miss better solutions Too much exploration → waste time on suboptimal actions

Common solution: ε-greedy strategy

With probability $ϵ$ : EXPLORE (random action)
With probability $1 - ϵ$ : EXPLOIT (best known action)

Mathematical Foundation

Markov Decision Processes (MDPs)

RL problems are typically formalized as MDPs, defined by the tuple $(S, A, P, R, γ)$ :

Symbol	Name	Definition
$S$	State space	Set of all possible states
$A$	Action space	Set of all possible actions
$P (s^{'} ∣ s, a)$	Transition function	Probability of transitioning to $s^{'}$ given state $s$ and action $a$
$R (s, a, s^{'})$	Reward function	Immediate reward for transition
$γ$	Discount factor	$γ \in [0, 1]$ , determines importance of future rewards

Markov Property: Future states depend only on the current state, not on the history of how we got there:

$P (s_{t + 1} ∣ s_{t}, a_{t}, s_{t - 1}, a_{t - 1}, ...) = P (s_{t + 1} ∣ s_{t}, a_{t})$

Return and Value Functions

Return $G_{t}$ : The cumulative discounted reward from timestep $t$ :

$G_{t} = R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + ... = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}$

Why discount?

Mathematically ensures the sum converges (for $γ < 1$ )
Reflects preference for immediate rewards over uncertain future ones
- $γ = 0$ : myopic (only care about immediate reward)
- $γ \to 1$ : far-sighted (care equally about all future rewards)

State-Value vs. Action-Value Functions

These two functions answer different questions:

Function	Question It Answers	Notation	Depends On
State-Value $V^{π} (s)$	“How good is it to be in state $s$ ?”	$V^{π} (s)$	State only
Action-Value $Q^{π} (s, a)$	“How good is it to take action $a$ in state $s$ ?”	$Q^{π} (s, a)$	State AND action

State-Value Function $V^{π} (s)$ : Expected return starting from state $s$ and following policy $π$ :

$V^{π} (s) = E_{π} [G_{t} ∣ S_{t} = s] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} S_{t} = s]$

Action-Value Function $Q^{π} (s, a)$ : Expected return starting from state $s$ , taking action $a$ , then following $π$ :

$Q^{π} (s, a) = E_{π} [G_{t} ∣ S_{t} = s, A_{t} = a]$

Intuition:

$V (s)$ tells you: “If I land in this chess position and play optimally from here, what’s my expected outcome?”
$Q (s, a)$ tells you: “If I land in this chess position and make this specific move, then play optimally, what’s my expected outcome?”

$Q$ is actionable because it helps you compare different actions directly. Given $Q^{*} (s, a)$ for all actions, you immediately know the optimal policy: just pick $ar g max_{a} Q^{*} (s, a)$ .

When to use each:

$V (s)$ : When you already have a policy and want to evaluate states (e.g., policy iteration, actor-critic critics)
$Q (s, a)$ : When you need to choose actions without an explicit policy (e.g., Q-Learning, DQN)

$V^{π} (s) = \sum_{a \in A} π (a ∣ s) \cdot Q^{π} (s, a)$

i.e., The value of a state is the weighted average of action-values, weighted by the policy’s action probabilities. If the policy says “I pick action $a_{1}$ 70% of the time and $a_{2}$ 30%,” then $V (s) = 0.7 \cdot Q (s, a_{1}) + 0.3 \cdot Q (s, a_{2})$ .

Bellman Equations

Core idea: the value of a state depends on the values of states you can reach from it. This recursive structure is what makes RL tractable.

Bellman Expectation Equation

For a fixed policy $π$ , the value of being in state $s$ can be broken down as:

$V^{π} (s) = \sum_{a} π (a ∣ s) \sum_{s^{'}} P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ V^{π} (s^{'})]$

$π (a ∣ s)$ : Probability of taking action $a$ in state $s$ (given by policy)
$P (s^{'} ∣ s, a)$ : Probability of landing in state $s^{'}$ after taking action $a$
$R (s, a, s^{'})$ : Immediate reward for that transition
$γ V^{π} (s^{'})$ : Discounted value of where we end up
Sum over all possible actions and outcomes → expected value

i.e., Value here = average of (immediate reward + discounted future value), considering all the actions I might take and all the places I might end up.

Bellman Optimality Equation

For the optimal policy $π^{*}$ , we don’t average over actions, we take the best one:

$V^{*} (s) = max_{a} \sum_{s^{'}} P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ V^{*} (s^{'})]$

$Q^{*} (s, a) = \sum_{s^{'}} P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ max_{a^{'}} Q^{*} (s^{'}, a^{'})]$

Why this matters: If we can solve for $Q^{*}$ , the optimal policy falls out trivially: $π^{*} (s) = ar g max_{a} Q^{*} (s, a)$

i.e., “Just pick the action with the highest Q-value.”

How to Solve for $Q^{*}$

The Bellman equation tells us what $Q^{*}$ must satisfy, but how do we actually find it?

1. Model-Based (Value Iteration): If you know $P (s^{'} ∣ s, a)$ and $R$ , repeatedly apply the Bellman equation as an update:

$Q_{k + 1} (s, a) \leftarrow \sum_{s^{'}} P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ max_{a^{'}} Q_{k} (s^{'}, a^{'})]$

Keep iterating until $Q$ stops changing. This converges to $Q^{*}$ .

2. Model-Free (Q-Learning): If you don’t know the dynamics, learn from experience:

$Q (s, a) \leftarrow Q (s, a) + α [r + γ max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)]$

Each time you take action $a$ , observe reward $r$ and next state $s^{'}$ , then nudge your estimate toward the observed target. See the Q-Learning section below for details.

Policy Definitions

Deterministic Policy: $π : S \to A$ maps each state to exactly one action.

Stochastic Policy: $π (a ∣ s) = P (A_{t} = a ∣ S_{t} = s)$ gives probability of each action in each state.

The optimal policy $π^{*}$ satisfies: $V^{π^{*}} (s) \geq V^{π} (s) \forall s \in S, \forall π$

Core Algorithms

Value-Based Methods

Q-Learning (Off-Policy)

The most famous RL algorithm. Learns $Q^{*}$ directly without following the optimal policy during training.

Update Rule: $Q (s, a) \leftarrow Q (s, a) + α TD error [r + γ a^{'} max Q (s^{'}, a^{'}) - Q (s, a)]$

Breaking it down:

$α$ : Learning rate
$r + γ max_{a^{'}} Q (s^{'}, a^{'})$ : TD (Temporal Difference) target (observed reward + best future value)
$Q (s, a)$ : Current estimate
TD error: How wrong we were (target - estimate)

Each update nudges our estimate toward what we actually observed.

Why “off-policy”? We use $max_{a^{'}} Q (s^{'}, a^{'})$ in the update (the greedy action), even if we didn’t actually take the greedy action. Our exploration policy (e.g., ε-greedy) differs from what we’re learning about (the optimal policy).

SARSA (On-Policy)

State-Action-Reward-State-Action: Uses the actual next action, not the best one.

Update Rule: $Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t} + γ Q (s_{t + 1}, a_{t + 1}) - Q (s_{t}, a_{t})]$

Key difference from Q-Learning: Uses $Q (s_{t + 1}, a_{t + 1})$ where $a_{t + 1}$ is the action actually selected (following current policy), not $max_{a} Q$ .

On-policy: Learning and behavior use the same policy. SARSA is more conservative in risky environments because it accounts for exploration in its updates.

See: https://www.youtube.com/watch?v=tbpBW5Yr44k&t=12s

Deep Q-Network (DQN) (TODO)

Q-Learning fails for large state spaces (can’t store table for $1 0^{170}$ chess positions). DQN approximates $Q (s, a)$ with a neural network $Q (s, a; θ)$ .

Experience Replay: Store transitions $(s, a, r, s^{'})$ in a replay buffer. Sample random mini-batches to break correlation and improve data efficiency.
Target Network: Use a separate, slowly-updated network $Q (s, a; θ^{-})$ for the TD target: $L (θ) = E [(r + γ max_{a^{'}} Q (s^{'}, a^{'}; θ^{-}) - Q (s, a; θ))^{2}]$

This stabilizes training by preventing the target from changing too rapidly.

DQN Training Loop:

Initialize replay buffer D, Q-network θ, target network θ⁻ ← θ
For each episode:
    For each step:
        Select action a using ε-greedy from Q(s, ·; θ)
        Execute a, observe r, s'
        Store (s, a, r, s') in D
        Sample mini-batch from D
        Compute target: y = r + γ max_a' Q(s', a'; θ⁻)
        Gradient descent on (y - Q(s, a; θ))²
        Every C steps: θ⁻ ← θ

Policy-Based Methods

Instead of learning values and deriving a policy, directly learn a parameterized policy $π_{θ} (a ∣ s)$ .

Why policy-based?

Can handle continuous action spaces naturally
Can learn stochastic policies (useful when optimal behavior is stochastic)
Often have better convergence properties

Policy Gradient Theorem

The objective is to maximize expected return: $J (θ) = E_{π_{θ}} [G_{0}] = E_{π_{θ}} [\sum_{t = 0}^{\infty} γ^{t} R_{t + 1}]$

Policy Gradient Theorem (the fundamental result):

$\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (a ∣ s) \cdot Q^{π_{θ}} (s, a)]$

i.e. Increase probability of actions that lead to high returns. The gradient of $lo g π$ points in the direction to increase that action’s probability; scale by how good the action was.

REINFORCE Algorithm

A Monte Carlo policy gradient method: run a full episode, then update the policy based on actual returns observed.

Update Rule: $θ \leftarrow θ + α γ^{t} G_{t} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})$

Breaking it down:

$α$ : Learning rate
$γ^{t}$ : Discount factor for timestep $t$ (earlier actions weighted more)
$G_{t} = \sum_{k = t}^{T} γ^{k - t} R_{k + 1}$ : Actual return from timestep $t$ onwards
$\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})$ : Direction to increase probability of action $a_{t}$

How it works:

Run a complete episode using current policy $π_{θ}$
For each timestep $t$ , compute the return $G_{t}$ (sum of discounted rewards from $t$ to end)
Update: if $G_{t}$ is high, push $θ$ to make $a_{t}$ more likely; if $G_{t}$ is low, push $θ$ to make $a_{t}$ less likely
Repeat for many episodes

Why $lo g π$ ? The gradient $\nabla_{θ} lo g π_{θ} (a ∣ s)$ is the “score function.” Multiplying by return gives us an unbiased estimate of the policy gradient. High return → increase action probability. Low return → decrease it.

The Variance Problem

REINFORCE has high variance because $G_{t}$ can vary wildly between episodes (different random trajectories lead to very different returns). This means:

Noisy gradient estimates
Slow, unstable learning
Needs many samples to average out the noise

Solution: Baseline Subtraction

Subtract a baseline $b (s)$ from the return: $θ \leftarrow θ + α γ^{t} (G_{t} - b (s_{t})) \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})$

Why this helps:

If $G_{t} > b (s_{t})$ : action was better than average → increase probability
If $G_{t} < b (s_{t})$ : action was worse than average → decrease probability
Centering around the baseline reduces the magnitude of updates, lowering variance

Common baseline: $b (s) = V (s)$ , the state-value function. This is optimal in the sense that it minimizes variance without introducing bias.

Key insight: Subtracting a baseline that doesn’t depend on the action preserves the expected gradient (unbiased) while reducing variance. This leads naturally to Actor-Critic methods, where we learn $V (s)$ alongside the policy.

Actor-Critic Methods

Combine value-based and policy-based approaches:

Actor: The policy $π_{θ} (a ∣ s)$ that selects actions
Critic: A value function $V_{ϕ} (s)$ or $Q_{ϕ} (s, a)$ that evaluates how good actions are

The critic reduces variance by replacing high-variance Monte Carlo returns with lower-variance value estimates.

Advantage Function: $A (s, a) = Q (s, a) - V (s)$

Measures how much better action $a$ is compared to the average action from state $s$ .

Actor Update (using advantage): $θ \leftarrow θ + α \nabla_{θ} lo g π_{θ} (a ∣ s) \cdot A (s, a)$

Critic Update (minimize TD error): $ϕ \leftarrow ϕ - β \nabla_{ϕ} (r + γ V_{ϕ} (s^{'}) - V_{ϕ} (s))^{2}$

Proximal Policy Optimization (PPO)

Refer to linked note.

Types of RL Problems

Episodic vs. Continuing Tasks

Aspect	Episodic	Continuing
Structure	Clear start and end	Runs forever
Examples	Games, robot navigation	Stock trading, HVAC control
Return	$G_{t} = \sum_{k = 0}^{T} γ^{k} R_{t + k + 1}$	$G_{t} = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}$
Gamma	Can use $γ = 1$	Need $γ < 1$ for convergence

Model-Based vs. Model-Free

Model-Free: Learn policy or value function directly from experience. No explicit model of environment dynamics.

Pro: Works when dynamics are unknown or complex
Con: Sample inefficient (need lots of experience)
Examples: Q-Learning, SARSA, Policy Gradient, PPO

Model-Based: Learn a model of the environment (transition and reward functions), then use it for planning.

Pro: Sample efficient (can simulate experience)
Con: Model errors can compound; complex dynamics are hard to model
Example: Dyna-Q (combines learning with planning)

On-Policy vs. Off-Policy

On-Policy: Learn about the policy currently being used for decisions.

Must use fresh experience from current policy
Examples: SARSA, REINFORCE, A2C/A3C, PPO

Off-Policy: Learn about a different policy than the one generating experience.

Can reuse old experience (replay buffers)
Examples: Q-Learning, DQN, DDPG, SAC

Practical Applications

When to Use Reinforcement Learning

Good fit:

Sequential decision-making problems
Clear reward signal (even if sparse)
Ability to simulate or interact repeatedly with environment
When optimal strategy is unknown or too complex to hand-code

Examples:

Game playing (Go, Atari, Dota 2, StarCraft)
Robotics and control (manipulation, locomotion)
Recommendation systems (personalized content)
Resource management (data centers, traffic control)
LLM alignment (RLHF for LLMs)

Pitfalls

Reward Hacking: Agent finds unintended ways to maximize reward

Delete tests to ensure all tests pass? Sure.

Sparse Rewards: Agent receives feedback too rarely to learn

Reward only for winning a chess game

Sample Inefficiency: Deep RL often needs millions of environment steps
- Solutions: Model-based methods, better exploration, transfer learning
Hyperparameter Sensitivity: RL algorithms are notoriously finicky
- Learning rate, discount factor, exploration schedule all matter greatly
- Solutions: Maybe use PPO?
Non-Stationarity: In multi-agent settings, other agents are also learning
- The “environment” keeps changing
- Solutions: Self-play, population-based training

Resources

Papers

Articles & Tutorials

OpenAI Spinning Up - Excellent introduction to deep RL
Lilian Weng’s RL Blog Posts
David Silver’s RL Course
David Silver’s RL Course (DeepMind)
Pieter Abbeel’s Deep RL Course (Berkeley)
https://www.youtube.com/watch?v=tbpBW5Yr44k&t=12s

Code & Libraries

Back to: 01 - Core Fundamentals Index

Aayush's ML & AI Notes

Explorer