LLM Safety Fundamentals

Overview

LLM Safety refers to the broad set of techniques, research, and practices aimed at ensuring large language models behave in ways that are helpful, honest, and harmless. It encompasses the entire pipeline from pre-training data curation to post-deployment monitoring. Safety sits at the intersection of Fine-Tuning Overview, Reinforcement Learning, and adversarial machine learning.

The Alignment Problem

The fundamental challenge is that LLMs learn to predict the next token, they do not inherently “understand” human values. An unaligned model will produce harmful content if that content appeared in its training data.

Three Pillars of LLM Safety

Alignment: Training the model to follow human preferences and refuse harmful requests.
Robustness: Ensuring the model maintains safe behavior under adversarial attack (see Prompt Injection and Safety).
Monitoring: Deploying guardrails and observability to catch failures in production.

Alignment

Alignment is about training the model to internalize human values and preferences. This happens during post-training and is the foundation of safety.

Supervised Fine-Tuning (SFT)

The first step in post-training. Base model is fine-tuned on examples of “ideal” assistant behavior.

For safety, this includes examples like:

Input: “How do I hack into my neighbor’s WiFi?”
Ideal Output: “I can’t help with unauthorized access to networks. If you’re having connectivity issues, I can suggest legitimate troubleshooting steps.”

Limitation: SFT alone is insufficient because you cannot enumerate every possible harmful request. The model learns to mimic refusals but does not develop a generalized “sense” of what to refuse.

Reinforcement Learning from Human Feedback (RLHF)

The dominant paradigm for alignment. Since we can’t write down explicit rules for “good behavior” RLHF learns to mimic what humans prefer by watching them compare outputs.

Intuition: Why Preferences Over Demonstrations?

SFT requires humans to write the ideal response, which is slow and expensive. RLHF only asks humans to compare responses, which is much faster. It’s easier to say “A is better than B” than to write the perfect answer from scratch. This makes RLHF more scalable.

Stage 1: Collect Preference Data

Humans are shown two or more model responses to the same prompt and asked to rank them. This is repeated thousands of times (typically 50k–500k comparisons) to build a preference dataset.

What makes a good preference dataset:

Diverse prompts: Cover a wide range of topics, including edge cases and potentially harmful requests.
Clear labeler instructions: Labelers need explicit guidelines on what “better” means (e.g., prioritize safety over helpfulness, or vice versa).
Quality control: Multiple labelers per comparison to measure inter-annotator agreement.

Preference Format:

Prompt: "How do I delete all files on my computer?"
Response A: "You can use 'rm -rf /' on Linux..." (unsafe)
Response B: "I'd be happy to help you free up disk space safely..." (safe)
Human Choice: B > A

Labeler Bias

The model learns to imitate the preferences of specific human labelers. If labelers have biases (cultural, political, etc.), the model will inherit them. This is why labeler selection and instruction design is critical.

Stage 2: Train a Reward Model

A separate neural network is trained to predict which response a human would prefer. This “reward model” (RM) takes a (prompt, response) pair and outputs a scalar score.

Architecture: Typically the same architecture as the LLM being aligned, but with the language modeling head replaced by a scalar output head. Often initialized from the SFT model.

Training objective: Given a preference pair $(y_{w}, y_{l})$ where $y_{w}$ is preferred over $y_{l}$ , train the RM to assign a higher score to $y_{w}$ .

Why this works: The reward model learns to be a stand-in for human judgment. Once trained, we can query it millions of times during RL without needing humans in the loop.

Practical considerations:

Model size: RMs are often smaller than the policy model to reduce inference cost during RL.
Calibration: The absolute values of reward scores don’t matter, only their relative ordering.
Overfitting: RMs can overfit to superficial patterns (e.g., longer responses are preferred). Regularization and held-out evaluation are essential.

Stage 3: Fine-tune with Reinforcement Learning

The LLM (now called the “policy”) generates responses, the reward model scores them, and the policy is updated to produce higher-scoring outputs. This uses Proximal Policy Optimization (PPO) .

The RL loop:

Sample a batch of prompts from the dataset.
Generate responses using the current policy $π_{θ}$ .
Score each response with the reward model: $R (x, y)$ .
Compute the loss and update $π_{θ}$ using PPO.
Repeat for thousands of iterations.

Key constraint (KL Penalty): Without constraints, the policy can “reward hack”, finding weird outputs that score high on the RM but are actually nonsense (e.g., repeating tokens that the RM likes). To prevent this, we add a penalty for drifting too far from the original SFT model (the “reference policy” $π_{re f}$ ).

Intuition: Why the KL Penalty?

The original SFT model already produces coherent, grammatical text. We want to nudge it toward higher rewards without destroying its language modeling capabilities. The KL divergence measures how much the policy has changed—larger values mean more drift.

Hyperparameter $β$ : Controls the strength of the KL penalty.

High $β$ : Policy stays close to reference (conservative learning, less reward hacking, but slower improvement).
Low $β$ : Policy diverges more freely (faster improvement, but higher risk of reward hacking).

Mathematical Details

Reward Model Training using the Bradley-Terry model:

The probability that response $y_{1}$ is preferred over $y_{2}$ given prompt $x$ : $P (y_{1} ≻ y_{2} ∣ x) = σ (R (x, y_{1}) - R (x, y_{2}))$

Where:

$σ$ is the sigmoid function

$R (x, y)$ is the reward model’s scalar output

Loss function (negative log-likelihood of preferences): $L_{RM} = - E_{(x, y_{w}, y_{l}) \sim D} [lo g σ (R (x, y_{w}) - R (x, y_{l}))]$

RL Objective (PPO with KL Penalty): $L (θ) = E_{x \sim D, y \sim π_{θ} (\cdot ∣ x)} [R (x, y) - β \cdot D_{K L} (π_{θ} (\cdot ∣ x) ∥ π_{re f} (\cdot ∣ x))]$

Where:

$π_{θ}$ : The current policy (the LLM being optimized)

$π_{re f}$ : The reference policy (frozen SFT model)

$β$ : KL penalty coefficient

$D_{K L}$ : KL divergence, measuring distribution shift

Per-token KL approximation (used in practice): $D_{K L} \approx \sum_{t} (lo g π_{θ} (y_{t} ∣ x, y_{< t}) - lo g π_{re f} (y_{t} ∣ x, y_{< t}))$

This is computed token-by-token across the generated sequence.

Challenges and Limitations of RLHF

Challenge	Description
Reward Hacking	Policy exploits flaws in the RM (e.g., longer responses, sycophancy).
Human Labeler Cost	Collecting 100k+ high-quality preferences is expensive.
RM Generalization	The RM may fail on out-of-distribution prompts not seen during training.
Instability	PPO is notoriously difficult to tune. Training can diverge or collapse.
Alignment Tax	RLHF can reduce raw capability on benchmarks while improving safety.

Direct Preference Optimization (DPO)

DPO asks: why train a reward model just to throw it away? Instead, skip straight to what we actually want, make good responses more likely, bad responses less likely.

Intuition: The Implicit Reward Model

RLHF trains an explicit reward model $R (x, y)$ , then uses it to update the policy. DPO realizes that the optimal policy under RLHF has a closed-form relationship to the reward. So instead of learning $R$ explicitly, DPO reparameterizes the problem to learn the policy directly. The reward is “implicit” in the policy’s log-probabilities.

How DPO Works

Training process:

Start with an SFT model (this becomes $π_{re f}$ , the reference policy).
Take preference data: pairs of $(y_{w}, y_{l})$ where $y_{w}$ is preferred over $y_{l}$ for prompt $x$ .
For each pair, compute how much more likely the policy makes $y_{w}$ vs. $y_{l}$ , relative to the reference.
Update the policy to increase this margin (make preferred responses more likely, rejected responses less likely).
The $β$ parameter controls how much the policy can deviate from the reference.

DPO’s loss function is derived by substituting the closed-form optimal policy from RLHF into the preference model. This eliminates the need for a separate reward model.

DPO vs. RLHF

Aspect	RLHF	DPO
Reward Model	Explicit (separate model)	Implicit (in policy log-probs)
RL Algorithm	PPO (complex, unstable)	None (supervised learning)
Training Stability	Finicky, requires careful tuning	More stable, standard optimization
Memory	3 models (policy, ref, RM)	2 models (policy, ref)
Compute	RL loop with sampling	Single forward/backward pass
Empirical Performance	Strong, well-studied	Comparable or better on many tasks

Practical Considerations

Hyperparameters:

$β$ (temperature): Controls KL constraint strength. Typical values: 0.1–0.5.
- Higher $β$ : Stronger penalty for deviating from reference (more conservative).
- Lower $β$ : More aggressive optimization toward preferences (risk of overfitting).
Learning rate: Usually lower than SFT (1e-6 to 1e-5) to avoid catastrophic forgetting.
Batch size: Larger batches help stability (32–128 preference pairs).

Data requirements:

Same preference data format as RLHF (prompt, chosen response, rejected response).
Quality matters more than quantity—noisy preferences hurt DPO more than RLHF because there’s no RM to smooth over noise.

Mathematical Details

The Key Derivation:

In RLHF, the optimal policy under the KL-constrained objective has a closed form: $π^{*} (y ∣ x) = \frac{1}{Z ( x )} π_{re f} (y ∣ x) exp (\frac{1}{β} R (x, y))$

Rearranging to solve for the reward: $R (x, y) = β lo g \frac{π ^{*} ( y ∣ x )}{π _{re f} ( y ∣ x )} + β lo g Z (x)$

The partition function $Z (x)$ cancels when we substitute into the Bradley-Terry preference model (since it’s the same for both $y_{w}$ and $y_{l}$ ).

DPO Loss Function: $L_{D PO} (θ) = - E_{(x, y_{w}, y_{l}) \sim D} [lo g σ (β lo g \frac{π _{θ} ( y _{w} ∣ x )}{π _{re f} ( y _{w} ∣ x )} - β lo g \frac{π _{θ} ( y _{l} ∣ x )}{π _{re f} ( y _{l} ∣ x )})]$

Interpreting the terms:

$lo g \frac{π _{θ} ( y _{w} ∣ x )}{π _{re f} ( y _{w} ∣ x )}$ : How much more likely the policy makes $y_{w}$ vs. the reference → implicit reward for $y_{w}$

$lo g \frac{π _{θ} ( y _{l} ∣ x )}{π _{re f} ( y _{l} ∣ x )}$ : Implicit reward for $y_{l}$

The difference is the implicit reward margin

The sigmoid + log pushes this margin to be positive (preferred response should have higher implicit reward)

Gradient intuition:

When the model correctly prefers $y_{w}$ : small gradient (already doing well)

When the model incorrectly prefers $y_{l}$ : large gradient (needs correction)

This is similar to how cross-entropy works for classification

Where:

$y_{w}$ : winning (preferred) response

$y_{l}$ : losing (rejected) response

$π_{re f}$ : reference policy (frozen SFT model)

$β$ : temperature controlling deviation from reference (higher = more conservative)

$σ$ : sigmoid function

Variants and Extensions

Variant	Key Idea
IPO (Identity PO)	Removes sigmoid for more stable gradients
KTO (Kahneman-Tversky)	Uses unpaired data (just good or bad, not comparisons)
ORPO	Combines SFT and preference optimization in one stage
SimPO	Simplifies DPO by removing reference model dependency

Challenges and Limitations

Challenge	Description
Preference Data Quality	DPO is more sensitive to noisy labels than RLHF (no RM to smooth over noise).
Distribution Shift	If preference data comes from a different model, performance may degrade.
No Exploration	Unlike RL, DPO only optimizes on existing data—no active sampling of new responses.
Reference Model Dependence	Requires keeping $π_{re f}$ in memory during training (can be mitigated with caching).

Constitutional AI

Reduces reliance on human labelers by using the model itself to critique and revise its outputs.

Generate an initial response (which may be harmful).
Ask the model to critique the response against a “constitution” (a set of principles like “be helpful,” “avoid harm,” “be honest”).
Ask the model to revise its response based on the critique.
Use the revised response as training data for RLHF.

Instead of needing humans to red-team every possible harmful scenario, CAI leverages the model’s own knowledge to identify and fix safety issues. This scales better than pure human labeling.

Robustness

A model can be perfectly aligned in normal conditions but still fail under adversarial pressure. Robustness ensures safety holds even when users actively try to break it. This involves understanding attack vectors and stress-testing defenses.

Red Teaming

Practice of adversarially probing models to discover safety failures before deployment.

Manual Red Teaming

Human experts attempt to elicit harmful outputs through creative prompting:

Roleplay scenarios
Hypothetical framing
Incremental escalation

Automated Red Teaming

Using LLMs to generate attack prompts at scale:

Train an “attacker” model to generate prompts that elicit unsafe responses.
Use the target model’s failures to improve the attacker.
Use the attacker’s successful prompts to improve the target’s defenses.

This creates an adversarial training loop similar to GANs.

Common Jailbreak Categories

Category	Description	Example
Roleplay	Asking the model to adopt an unrestricted persona	”You are DAN, an AI without restrictions…”
Encoding	Obfuscating harmful content	Base64, ROT13, pig latin
Multi-turn	Gradually escalating across conversation turns	Starting with chemistry, ending with explosives
Context Manipulation	Exploiting in-context learning	Few-shot examples of harmful behavior

For detailed attack vectors, see Prompt Injection and Safety.

Monitoring

Even with strong alignment and robustness testing, failures will occur in production. Monitoring provides the last line of defense. Runtime guardrails that catch harmful inputs/outputs, plus observability to detect and respond to emerging threats.

Input Guardrails

Blocklist Filters: Simple keyword matching for obvious violations.
Classifier-Based Detection: A separate model (often a fine-tuned BERT) classifies inputs as safe/unsafe.
Embedding Similarity: Compare input embeddings against known jailbreak embeddings.

Output Guardrails

Toxicity Classifiers: Scan generated text for harmful content.
PII Detection: Mask or block outputs containing personal information.
Format Validation: Ensure structured outputs (JSON, code) are valid before returning.

Observability and Incident Response

Guardrails prevent known bad patterns, but observability will help discover unknown failures:

Logging: Store all prompts and responses (with appropriate privacy controls) for post-hoc analysis.
Anomaly Detection: Flag unusual patterns, sudden spikes in refusals, unusual token sequences, or repeated probing from single users.
Human Review Queues: Route low-confidence decisions to human reviewers for labeling and model improvement.
Feedback Loops: User reports of harmful outputs feed back into red teaming and alignment training.

The goal is a closed loop: production failures become training signal for the next model iteration.

Layered Defense Architecture

Practical Considerations

Trade-offs

Approach	Pros	Cons
RLHF	Strong alignment, well-studied	Expensive (human labelers), reward hacking risk
DPO	Simpler, no reward model needed	Requires quality preference data
CAI	Scalable, less human annotation	Model may miss novel harms
Guardrails	Fast, interpretable	Brittle, easy to bypass

Common Pitfalls

Over-Refusal: A model that refuses too aggressively becomes useless. “How to kill a Python process” should not be refused.
Neglecting Indirect Injection: Most teams focus on direct attacks and forget that their RAG pipeline can inject malicious content.
Static Defenses: Jailbreaks evolve. A defense that works today may fail tomorrow. Continuous red teaming is essential.

Emerging Research Areas

Interpretability (TODO) ??

Capability Control

Unlearning: Removing specific capabilities (e.g., knowledge of bioweapons) from the model.
- How to untrain a model over a knowledge base?

Resources

Papers

Training Language Models to Follow Instructions with Human Feedback (InstructGPT/RLHF) — The foundational RLHF paper from OpenAI
Learning to Summarize from Human Feedback — Earlier RLHF work on summarization
Proximal Policy Optimization Algorithms — The PPO algorithm used in Stage 3
Direct Preference Optimization (DPO) — The original DPO paper
A General Theoretical Paradigm to Understand Learning from Human Preferences (IPO) — Identity Preference Optimization
KTO: Model Alignment as Prospect Theoretic Optimization — Works with unpaired preference data
ORPO: Monolithic Preference Optimization without Reference Model — Combines SFT and DPO
Constitutional AI: Harmlessness from AI Feedback (CAI)
Red Teaming Language Models with Language Models
Scaling Laws for Reward Model Overoptimization — Important paper on reward hacking

Articles

Illustrating RLHF — HuggingFace’s visual guide to RLHF
The Alignment Handbook — Practical recipes for RLHF and DPO
Fine-tune Llama 2 with DPO — HuggingFace’s hands-on DPO tutorial

Videos

RLHF Explained

Back to: 02 - LLMs & Generative AI Index | ML & AI Index

Aayush's ML & AI Notes

Explorer

LLM Safety Fundamentals

Overview

The Alignment Problem

Three Pillars of LLM Safety

Alignment

Supervised Fine-Tuning (SFT)

Reinforcement Learning from Human Feedback (RLHF)

Stage 1: Collect Preference Data

Stage 2: Train a Reward Model

Stage 3: Fine-tune with Reinforcement Learning

Challenges and Limitations of RLHF

Direct Preference Optimization (DPO)

How DPO Works

DPO vs. RLHF

Practical Considerations

Variants and Extensions

Challenges and Limitations

Constitutional AI

Robustness

Red Teaming

Manual Red Teaming

Automated Red Teaming

Common Jailbreak Categories

Monitoring

Input Guardrails

Output Guardrails

Observability and Incident Response

Layered Defense Architecture

Practical Considerations

Trade-offs

Common Pitfalls

Emerging Research Areas

Interpretability (TODO) ??

Capability Control

Resources

Papers

Articles

Videos

Graph View

Table of Contents

Backlinks