⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

Excalidraw Data

Text Elements

SARSA vs Q-Learning: The Cliff Walking Problem

Why SARSA learns a “suboptimal” but SAFER path | α=0.5, γ=1.0, ε=0.1

The Cliff World (4x12 grid, showing key columns)

START

CLIFF

-100

CLIFF

CLIFF

CLIFF

CLIFF

CLIFF

CLIFF

CLIFF

GOAL

Q-Learning Path (Optimal but Risky)

Learns: START ↑→ → → → → … →↓ GOAL

(walks right along the cliff edge)

Reward per episode: ~-13 (optimal)

BUT with ε=0.1: 10% chance to fall off!

Actual avg reward: ~-30 to -50

SARSA Path (Suboptimal but Safe)

Learns: START ↑↑ → → → … → ↓↓ GOAL

(walks one row UP, away from cliff)

Reward per episode: ~-17 (suboptimal)

With ε=0.1: Almost never falls!

Actual avg reward: ~-17 (consistent!)

Why This Happens

Q-Learning: Uses max Q(s’,a) → Assumes it will pick best action next time (no exploration)

SARSA: Uses Q(s’,a’) where a’ is actual next action → Accounts for 10% random moves!

When to Use Which?

Q-Learning: When you’ll deploy with ε=0 (pure exploitation) | SARSA: When you’ll always explore or safety matters