⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’
Excalidraw Data
Text Elements
Q-Learning Example: Robot Navigation
Agent learns optimal path through trial and error | α=0.5, γ=0.9
The Grid World
S0
S1
S2
GOAL
+100
START
S3
WALL
S4
TRAP
-100
Each step: -1 reward
Actions: ↑ ↓ ← →
Episode 1: Agent Explores (randomly ε=1.0)
Step 1: S3 Action: ↑ S0 (r=-1)
Q(S3,↑) = 0 + 0.5[-1 + 0.9×0 - 0] = -0.5
Step 2: S0 Action: → S1 (r=-1)
Q(S0,→) = 0 + 0.5[-1 + 0.9×0 - 0] = -0.5
Step 3: S1 Action: → S2 (r=-1)
Q(S1,→) = 0 + 0.5[-1 + 0.9×0 - 0] = -0.5
Step 4: S2 Action: → GOAL! (r=+100)
Q(S2,→) = 0 + 0.5[100 + 0 - 0] = +50 ⭐
Q-Table Evolution
After Episode 1:
State ↑ ↓ ← →
S0 0 0 0 -0.5
S1 0 0 0 -0.5
S2 0 0 0 +50
S3 -0.5 0 0 0
After Episode 10:
State ↑ ↓ ← →
S0 0 35 0 40.5
S1 0 0 0 +50
S2 0 -50 0 +50
S3 36 0 0 0
Converged (Ep 50+):
State ↑ ↓ ← →
S0 0 72.9 0 81
S1 0 0 0 90
S2 0 -50 0 100
S3 72 0 0 0
Learned Optimal Policy (greedy: pick max Q)
S3 ↑ (Q=72) : S0 → (Q=81) : S1 → (Q=90) : S2 → (Q=100) : GOAL!
Total reward: -1 + -1 + -1 + 100 = 97 (Optimal path found!)