⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’
Excalidraw Data
Text Elements
SARSA vs Q-Learning: The Cliff Walking Problem
Why SARSA learns a “suboptimal” but SAFER path | α=0.5, γ=1.0, ε=0.1
The Cliff World (4x12 grid, showing key columns)
…
…
…
…
START
CLIFF
-100
CLIFF
…
CLIFF
CLIFF
CLIFF
…
CLIFF
CLIFF
CLIFF
GOAL
Q-Learning Path (Optimal but Risky)
Learns: START ↑→ → → → → … →↓ GOAL
(walks right along the cliff edge)
Reward per episode: ~-13 (optimal)
BUT with ε=0.1: 10% chance to fall off!
Actual avg reward: ~-30 to -50
SARSA Path (Suboptimal but Safe)
Learns: START ↑↑ → → → … → ↓↓ GOAL
(walks one row UP, away from cliff)
Reward per episode: ~-17 (suboptimal)
With ε=0.1: Almost never falls!
Actual avg reward: ~-17 (consistent!)
Why This Happens
Q-Learning: Uses max Q(s’,a) → Assumes it will pick best action next time (no exploration)
SARSA: Uses Q(s’,a’) where a’ is actual next action → Accounts for 10% random moves!
When to Use Which?
Q-Learning: When you’ll deploy with ε=0 (pure exploitation) | SARSA: When you’ll always explore or safety matters