⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

Excalidraw Data

Text Elements

Layered Defense Architecture

Multi-Layer LLM Safety Pipeline

User Input

INPUT FILTER

• Blocklist Matching

• Classifier Detection

• Embedding Similarity

Catches obvious attacks

Low latency (~10ms)

Brittle to novel attacks

MAIN LLM

Aligned via:

• RLHF (Reward Model + PPO)

• DPO (Direct Preference Opt)

• SFT on safety data

• Constitutional AI

Primary defense layer

Most compute-intensive

Can be jailbroken with

sophisticated prompts

OUTPUT FILTER

• Toxicity Classification

• PII Detection & Masking

• Format Validation

Last line of defense

Catches LLM failures

Low latency (~20ms)

User Output

DEFENSE LAYERS

Each layer catches

what the previous

layer missed.

No single layer is

sufficient alone!

Trade-off:

More layers =

↑ Safety, ↑ Latency

Aayush's ML & AI Notes

Explorer

LLM Safety Fundamentals 2025-12-30 17.52.44.excalidraw

Excalidraw Data

Text Elements

Graph View

Table of Contents