⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

Excalidraw Data

Text Elements

Q-Learning Example: Robot Navigation

Agent learns optimal path through trial and error | α=0.5, γ=0.9

The Grid World

S0

S1

S2

GOAL

+100

START

S3

WALL

S4

TRAP

-100

Each step: -1 reward

Actions: ↑ ↓ ← →

Episode 1: Agent Explores (randomly ε=1.0)

Step 1: S3 Action: ↑ S0 (r=-1)

Q(S3,↑) = 0 + 0.5[-1 + 0.9×0 - 0] = -0.5

Step 2: S0 Action: → S1 (r=-1)

Q(S0,→) = 0 + 0.5[-1 + 0.9×0 - 0] = -0.5

Step 3: S1 Action: → S2 (r=-1)

Q(S1,→) = 0 + 0.5[-1 + 0.9×0 - 0] = -0.5

Step 4: S2 Action: → GOAL! (r=+100)

Q(S2,→) = 0 + 0.5[100 + 0 - 0] = +50 ⭐

Q-Table Evolution

After Episode 1:

State ↑ ↓ ← →

S0 0 0 0 -0.5

S1 0 0 0 -0.5

S2 0 0 0 +50

S3 -0.5 0 0 0

After Episode 10:

State ↑ ↓ ← →

S0 0 35 0 40.5

S1 0 0 0 +50

S2 0 -50 0 +50

S3 36 0 0 0

Converged (Ep 50+):

State ↑ ↓ ← →

S0 0 72.9 0 81

S1 0 0 0 90

S2 0 -50 0 100

S3 72 0 0 0

Learned Optimal Policy (greedy: pick max Q)

S3 ↑ (Q=72) : S0 → (Q=81) : S1 → (Q=90) : S2 → (Q=100) : GOAL!

Total reward: -1 + -1 + -1 + 100 = 97 (Optimal path found!)