Overview

Evaluating Large Language Models is fundamentally different from evaluating traditional ML models. Unlike classification (where ground truth is objective) or regression (where error is measurable), LLM outputs are often subjective, creative, and multi-dimensional. There is no single “correct” answer to “Write me a poem about autumn.”

LLM evaluation spans multiple dimensions:

  • Accuracy: Is the information factually correct?
  • Relevance: Does the output address the query?
  • Fluency: Is the text grammatically correct and natural?
  • Coherence: Does the text flow logically?
  • Safety: Is the output harmful, biased, or toxic?
  • Instruction Following: Did the model do what was asked?

The challenge is that many of these dimensions are subjective and task-dependent. A metric that works for summarization may be useless for code generation.

Key Ideas

The Three Pillars of LLM Evaluation

graph LR
    A[LLM Evaluation] --> B[Automatic Metrics]
    A --> C[Human Evaluation]
    A --> D[LLM-as-Judge]

    B --> B1[N-gram: BLEU, ROUGE]
    B --> B2[Semantic: BERTScore]
    B --> B3[Perplexity]

    C --> C1[Likert Scales]
    C --> C2[Pairwise Comparison]
    C --> C3[Task Completion]

    D --> D1[G-Eval]
    D --> D2[RAGAS]
    D --> D3[MT-Bench]

Automatic Reference-Based Metrics

These metrics compare the model’s output against a known “reference” or “ground truth” answer. They work well when there’s a canonical correct answer (translation, summarization).

BLEU (Bilingual Evaluation Understudy)

Originally designed for machine translation. Measures n-gram overlap between generated text and reference.

Where:

  • = maximum n-gram order
  • = precision of n-grams (what fraction of generated n-grams appear in reference)
  • = weights (typically for uniform weighting)
  • = Brevity Penalty (penalizes outputs shorter than reference)

Where = candidate length, = reference length.

Intuition: “How many chunks of my output also appear in the correct answer?”

Limitations:

  • Ignores synonyms (“happy” vs “joyful” get no credit)
  • Ignores word order importance
  • Poor for creative or open-ended tasks

METEOR (Metric for Evaluation of Translation with Explicit ORdering)

An improvement over BLEU that addresses synonym matching and word order.

Alignment Process: METEOR aligns words between candidate and reference in priority order:

  1. Exact matches (highest priority)
  2. Stem matches
  3. Synonym matches (via WordNet)

Where:

  • = Harmonic mean of precision and recall (weighted toward recall)
  • = Fragmentation penalty based on number of “chunks” needed to align texts

Fragmentation Penalty:

Fewer contiguous chunks = better word order = lower penalty.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Designed for summarization. Unlike BLEU (precision-focused), ROUGE is recall-focused: “How much of the reference appears in my output?”

VariantDescriptionUse Case
ROUGE-NN-gram recallGeneral overlap
ROUGE-LLongest Common SubsequenceSentence structure
ROUGE-SSkip-bigram (allows gaps)Flexible matching

Where:

  • (recall)
  • (precision)
  • = Longest Common Subsequence length
  • = weight for recall vs precision (β > 1 favors recall; typically 1.2 for summarization)

BERTScore

A semantic similarity metric that uses contextual embeddings (BERT) instead of exact string matching.

How it works:

  1. Encode both candidate and reference with BERT
  2. Compute cosine similarity between each token pair
  3. Use greedy matching to find optimal alignment
  4. Return precision, recall, and F1

Advantages:

  • Captures semantic similarity (“dog” ≈ “canine”)
  • Considers context (“bank” in finance vs river)
  • Better correlation with human judgment

Limitations:

  • Computationally expensive
  • Still reference-based (needs ground truth)

COMET (Crosslingual Optimized Metric for Evaluation of Translation)

A neural machine translation metric that uses multilingual embeddings (XLM-RoBERTa) and is trained on human quality judgments. Considered the state-of-the-art for translation evaluation.

How it works:

  1. Encode source sentence, hypothesis (model output), and reference with multilingual encoder
  2. Pool representations and concatenate features
  3. Pass through regression head trained on human DA (Direct Assessment) scores
  4. Output: quality score (typically 0-1, higher is better)

Unlike BLEU/BERTScore which compare strings, COMET is trained to predict human judgments directly.

Advantages:

  • Highest correlation with human judgment for translation
  • Handles multilingual evaluation well
  • Reference-free variants available (COMET-QE)
  • Captures fluency, adequacy, and errors simultaneously

Limitations:

  • Computationally expensive (requires GPU for speed)
  • Model-dependent (different COMET versions give different scores)
  • Less interpretable than n-gram metrics

Comparison of Reference-Based Metrics

MetricTypeSynonym SupportOrder SensitivityCompute CostBest For
BLEUN-gram PrecisionNoPartialLowTranslation
ROUGEN-gram RecallNoPartial (ROUGE-L)LowSummarization
METEORHybridYes (WordNet)YesMediumTranslation
BERTScoreEmbeddingYes (Contextual)NoHighGeneral semantic
COMETNeural (trained)Yes (Learned)YesHighTranslation (SOTA)

Reference-Free Metrics

These metrics evaluate outputs without needing ground truth. Essential for creative or open-ended tasks.

Perplexity

Measures how “surprised” a language model is by the text. Lower perplexity = more fluent/natural text.

Where is the probability of token given all previous tokens.

Intuition:

  • If a model assigns high probability to every token → low perplexity → fluent text
  • If a model is constantly surprised → high perplexity → unusual text
Use CaseHow Perplexity Helps
Model ComparisonCompare versions of the same model (lower PPL = better LM)
Training MonitoringTrack PPL on validation set during fine-tuning
Fluency CheckFlag unusually high-PPL outputs for review
Hallucination DetectionHallucinated facts often have higher perplexity
Domain AdaptationMeasure how well model fits new domain text
Prompt EngineeringCompare output quality across different prompts

Evaluating LLM-Generated Text with Perplexity

You can use an external judge model to compute perplexity on generated text:

                    ┌─────────────────────┐
   Generated Text   │   Judge Model       │   Perplexity
   ───────────────► │   (e.g., GPT-2,     │ ───────────────►
   "The cat sat..." │   LLaMA, Mistral)   │   Score: 12.3
                    └─────────────────────┘

Use a different model as the judge. If Model A generates text, use Model B to score its perplexity. This prevents self-evaluation bias.

Interpretation:

  • Low perplexity: Fluent, natural-sounding text
  • Medium perplexity: Acceptable, may have awkward phrasing
  • High perplexity: Unusual constructions, potential issues
  • Very high perplexity: Likely gibberish, hallucinations, or domain mismatch

NOTE

Perplexity thresholds are model-dependent. Always calibrate on known-good examples first.

Perplexity for Hallucination Detection

Hallucinated content often exhibits locally high perplexity because:

  1. Made-up entities have unusual token sequences
  2. False facts create inconsistent contexts
  3. Non-existent relationships surprise the model

Approach:

1. Compute token-level log probabilities
2. Identify spans with unusually low probability
3. Flag these as potential hallucinations
4. Cross-reference with source for verification

Perplexity vs. Cross-Entropy

They’re closely related:

  • Cross-entropy: Measured in bits/nats, used in loss functions
  • Perplexity: More interpretable (effective vocabulary size per token)

Perplexity of 10 means the model is, on average, as uncertain as if choosing uniformly among 10 options.

Limitations:

  • Measures fluency, not accuracy or relevance
  • A grammatically perfect lie has low perplexity
  • Different models have different perplexity scales (not directly comparable)
  • Doesn’t capture semantic correctness or factuality
  • Short sequences have high variance

When NOT to Use Perplexity:

  • Comparing models with different tokenizers (unfair comparison)
  • Evaluating factual correctness (fluent lies score well)
  • Cross-model comparison without normalization
  • Creative tasks where unusual = good

LLM-as-Judge Evaluation

Uses a powerful LLM to evaluate another LLM’s outputs. This has become the standard for subjective evaluation.

How It Works

Eval Framework

A systematic approach using Chain-of-Thought prompting for evaluation.

Steps:

  1. Define evaluation criteria (coherence, relevance, fluency, etc.)
  2. Provide detailed rubric to judge LLM
  3. Ask for step-by-step reasoning before scoring
  4. Extract numerical score from reasoning

Pairwise Comparison (Arena Style)

Compares two outputs directly.

Advantages:

  • Easier for humans and LLMs to judge
  • More reliable than absolute scoring
  • Basis for Language Model Elo ratings

Limitations of LLM-as-Judge

IssueDescriptionMitigation
Position BiasPrefers first/last option in pairwiseRandomize order, average both orderings
Verbosity BiasPrefers longer responsesExplicitly penalize unnecessary length
Self-Enhancement BiasGPT-4 prefers GPT-4 outputsUse different judge than model tested
SycophancyAgrees with user’s stated preferenceBlind evaluation
Limited ReasoningStruggles with math/code verificationUse specialized checkers

RAG-Specific Evaluation

See RAG Evaluation Metrics.

Human Evaluation

Gold standard, but expensive and hard to scale.

Evaluation Methods

MethodBest For
Likert ScalesAbsolute quality
Pairwise ComparisonRelative ranking
Best-Worst ScalingEfficient ranking
Task CompletionFunctional evaluation

Inter-Annotator Agreement

Measure consistency between human evaluators:

Cohen’s Kappa: (Higher is better)

Where:

  • = observed agreement
  • = expected agreement by chance

Task-Specific Metrics

Summarization

  • ROUGE-L: Longest common subsequence
  • Factual Consistency: Claims in summary supported by source
  • Compression Ratio: Summary length / Source length

Translation

  • BLEU: N-gram precision
  • COMET: Neural MT metric (better correlation with humans)
  • chrF: Character-level F-score

Question Answering

  • Exact Match (EM): Binary correct/incorrect
  • F1: Token-level overlap with gold answer
  • Accuracy: For multiple choice

Dialogue

  • Perplexity: Fluency
  • Distinct-N: Diversity
  • Engagement: Follow-up question rate
  • Task Success Rate: For goal-oriented dialogue

Code Generation

  • pass@k: Functional correctness
  • CodeBLEU: Syntax + semantic + dataflow match
  • Execution Accuracy: Output matches expected

Practical Evaluation Framework

The Evaluation Stack

Choosing the Right Metrics

Task TypePrimary MetricsSecondary
TranslationBLEU, COMETHuman preference
SummarizationROUGE-L, FaithfulnessBERTScore, Human
RAG/QAFaithfulness, Context Relevance, EMAnswer Relevance
Chat/AssistantMT-Bench, Human preferenceHelpfulness, Harmlessness
Codepass@k, ExecutionCodeBLEU
Creative WritingHuman eval, Distinct-NPerplexity

Building an Evaluation Pipeline

TODO - Link Project specific details here.

  1. Define Success Criteria: What does “good” look like for your task?
  2. Create Golden Dataset: Hand-curated examples with expected outputs
  3. Layer Metrics:
    • Automated metrics for CI/CD (fast feedback)
    • LLM-as-Judge for periodic deeper analysis
    • Human evaluation for major releases
  4. Track Over Time: Monitor metric drift as model changes
  5. A/B Test: Compare model versions on real users

Tools & Libraries

ToolPurposeKey Features
RAGASRAG evaluationFaithfulness, context relevance
TruLensLLM observabilityFeedback functions, tracing
LangSmithLLM debuggingEvaluation datasets, comparison
Weights & BiasesExperiment trackingTable comparison, prompts
OpenAI EvalsCustom benchmarksExtensible framework
lm-evaluation-harnessBenchmark suite200+ tasks, reproducible
HELMHolistic evaluationMulti-dimensional scoring
DeepEvalUnit testing for LLMsPytest-style assertions

Metric Selection Framework

Decision Matrix: When to Use What

ScenarioPrimary MetricsWhyAvoid
Closed-domain QA (factual, single correct answer)EM, F1, AccuracyGround truth existsBLEU (too lenient)
Open-domain QA (multiple valid phrasings)BERTScore, LLM-as-JudgeCaptures semantic equivalenceEM (too strict)
SummarizationROUGE-L, Faithfulness, BERTScoreCoverage + factual consistencyBLEU (wrong focus)
TranslationBLEU, COMET, chrFEstablished benchmarksROUGE (recall-focused)
Creative WritingHuman Eval, Distinct-N, PerplexitySubjective qualityAll exact-match metrics
Code Generationpass@k, Execution AccuracyFunctional correctnessBLEU/ROUGE (syntax ≠ function)
RAG SystemsFaithfulness, Context Relevance, Answer RelevanceRAGAS triadSingle-dimension metrics
Chatbots / AssistantsMT-Bench, Arena Elo, Human PreferenceMulti-turn, subjectiveStatic metrics
Safety / AlignmentRefusal Rate, Toxicity Score, Bias MetricsRisk mitigationAccuracy-only metrics

Metric Trade-offs Analysis

Understanding the trade-offs helps you pick the right tool for the job:

MetricProsConsComputational CostHuman Correlation
BLEUFast, reproducible, establishedNo synonyms, ignores meaningVery LowLow-Medium
ROUGEGood for coverageRecall-biased, no semanticsVery LowLow-Medium
METEORSynonym support, order-awareWordNet dependency, slowerMediumMedium
BERTScoreSemantic similarity, context-awareExpensive, model-dependentHighHigh
PerplexityFast fluency checkNo accuracy/relevance signalLowLow
LLM-as-JudgeFlexible, multi-dimensionalExpensive, biased, non-deterministicVery HighHigh
Human EvalGold standardExpensive, slow, hard to scaleN/A (human time)Perfect (by definition)
pass@kFunctional correctnessExpensive (needs execution)HighHigh (for code)

The Speed vs. Depth Trade-off

Fast ←────────────────────────────────────────────→ Thorough
│                                                           │
│   Perplexity    BLEU/ROUGE    BERTScore    LLM-Judge    Human
│       ↓              ↓            ↓            ↓          ↓
│   Fluency only   Surface match  Semantic   Multi-dim   Complete
│   $0.001/eval    $0.01/eval    $0.10/eval  $0.50/eval  $5+/eval

Intuition: Layer your evaluation—fast metrics for CI/CD, deep metrics for releases.


High-Stakes Systems: Financial & Medical Evaluation

CAUTION

In high-stakes domains (healthcare, finance, legal), incorrect outputs can cause real harm. Standard LLM evaluation is insufficient. You need defense-in-depth evaluation strategies.

Why Standard Metrics Fail in Critical Systems

Standard MetricFailure Mode in High-Stakes
BLEU/ROUGEHigh score doesn’t mean factually correct
PerplexityFluent lies score well
LLM-as-JudgeJudges can miss domain-specific errors
BERTScoreSemantic similarity ≠ medical accuracy

The High-Stakes Evaluation Stack

graph TD
    subgraph "Layer 1: Automated Safety Gates"
        A1[Hallucination Detection]
        A2[Citation Verification]
        A3[Constraint Checking]
    end
    
    subgraph "Layer 2: Domain-Specific Validation"
        B1[Knowledge Graph Grounding]
        B2[Structured Output Validation]
        B3[Regulatory Compliance Checks]
    end
    
    subgraph "Layer 3: Human-in-the-Loop"
        C1[Domain Expert Review]
        C2[Adversarial Red-Teaming]
        C3[Audit Trail Analysis]
    end
    
    A1 --> B1
    A2 --> B1
    A3 --> B2
    B1 --> C1
    B2 --> C1
    B3 --> C2

Essential Metrics for High-Stakes Systems

1. Faithfulness & Hallucination Detection

Definition: Does the output contain ANY claims not supported by the provided context/sources?

MetricDescriptionUse Case
Claim-level FaithfulnessDecompose output → verify each claimMedical summaries
Entailment ScoreNLI model checks if context entails outputRAG systems
Citation Precision% of citations that actually support claimsResearch assistants
Self-ConsistencySame question → consistent answers?Financial advice

Implementation Approach:

1. Decompose output into atomic claims
2. For each claim:
   - Check if source/context contains supporting evidence
   - Use NLI model: context → claim (entailment check)
   - Flag unsupported claims as hallucinations
3. Faithfulness = supported_claims / total_claims

WARNING

A 95% faithfulness score means 1 in 20 claims may be hallucinated. In medical contexts, that’s unacceptable. Target >99% with human verification for flagged cases.

2. Confidence Calibration

Definition: Does the model know what it doesn’t know?

Where:

  • = samples in confidence bin
  • = accuracy in bin
  • = average confidence in bin

Intuition: A well-calibrated model saying “I’m 90% confident” should be correct 90% of the time.

Why it matters:

  • Overconfident wrong answers are dangerous
  • Underconfident correct answers reduce trust
  • Enables appropriate human escalation

3. Abstention Rate & Quality

Definition: Does the model refuse to answer when it should?

MetricFormulaTarget
Appropriate Abstention RateCorrectly refused / Should have refusedHigh
Inappropriate Abstention RateWrongly refused / Could have answeredLow
Abstention PrecisionCorrect abstentions / Total abstentionsHigh

Critical Insight: In high-stakes systems, “I don’t know” is often the correct answer.

4. Structured Output Compliance

For systems that must output in specific formats (e.g., ICD codes, financial tickers):

CheckDescription
Schema ValidationOutput matches expected JSON/XML schema
Ontology ComplianceTerms exist in domain ontology (SNOMED-CT, FIBO)
Value Range ChecksNumerical outputs within valid ranges
Cross-field ConsistencyRelated fields are logically consistent

5. Worst-Case Performance (Robustness)

MetricDescription
Tail AccuracyAccuracy on bottom 5% performing samples
Adversarial RobustnessPerformance under input perturbations
Distribution Shift PerformanceAccuracy on OOD examples
Stress Test Failure Rate% failures under edge cases

IMPORTANT

Average metrics hide dangerous failures. A model with 95% average accuracy but 50% accuracy on rare-but-critical cases is unsafe.

Domain-Specific Considerations

Medical/Healthcare

RequirementEvaluation Approach
No hallucinated conditionsClaim extraction + medical KB grounding
No contraindicated adviceDrug interaction checking
Appropriate uncertaintyConfidence calibration + abstention
HIPAA compliancePII detection in outputs
Up-to-date guidelinesKnowledge freshness verification

Recommended Stack:

  1. Automated: Faithfulness (>99%), Schema validation, Toxicity
  2. Domain: Medical NER + KB linking, Drug interaction check
  3. Human: Physician review for flagged cases, periodic audits

Financial Systems

RequirementEvaluation Approach
No made-up numbersNumerical claim verification
Regulatory complianceSEC/FINRA rule checking
No forward-looking statementsTemporal claim analysis
Audit trailFull provenance tracking
ConsistencySame data → same output

Recommended Stack:

  1. Automated: Faithfulness, Citation verification, Numerical accuracy
  2. Domain: Regulatory keyword detection, Disclaimer presence
  3. Human: Compliance officer review, Red-teaming

High-Stakes Evaluation Checklist

## Pre-Deployment Checklist
 
### Safety Gates (Must Pass All)
- [ ] Hallucination rate < 1% on test set
- [ ] Zero harmful/dangerous outputs in adversarial testing
- [ ] Appropriate abstention on out-of-scope queries
- [ ] All outputs traceable to sources
 
### Robustness Testing
- [ ] Tested on distribution shift
- [ ] Tested on adversarial inputs  
- [ ] Worst-case performance acceptable
- [ ] Consistent outputs for same inputs
 
### Compliance
- [ ] Domain expert validation (sample review)
- [ ] Regulatory requirement check
- [ ] Audit logging implemented
- [ ] Human escalation path defined
 
### Monitoring
- [ ] Confidence drift detection
- [ ] Output distribution monitoring
- [ ] User feedback collection
- [ ] Periodic re-evaluation scheduled

Build vs. Buy: Evaluation Tools for Critical Systems

ToolBest ForHigh-Stakes Features
Patronus AIEnterprise LLM securityHallucination detection, PII leakage
GalileoLLM observabilityFactuality scoring, drift detection
Weights & Biases/WeaveExperiment trackingEvaluation datasets, comparison
Arize PhoenixProduction monitoringDrift detection, troubleshooting
Custom pipelinesDomain-specific needsFull control, compliance

Comprehensive Metric Reference

Complete Metric Taxonomy

CategoryMetricTypeReference RequiredCostBest Domain
N-gramBLEUPrecisionYesLowTranslation
N-gramROUGE-NRecallYesLowSummarization
N-gramROUGE-LLCSYesLowSummarization
N-gramMETEORHybridYesMediumTranslation
N-gramchrFChar-levelYesLowTranslation
SemanticBERTScoreEmbeddingYesHighGeneral
SemanticMoverScoreEmbeddingYesHighGeneral
SemanticBLEURTLearnedYesHighGeneral
SemanticCOMETNeural MTYesHighTranslation
FluencyPerplexityLM probNoLowFluency
DiversityDistinct-NN-gramNoLowDialogue
DiversitySelf-BLEUN-gramNoLowGeneration
Codepass@kExecutionYesHighCode
CodeCodeBLEUHybridYesMediumCode
FactualityFaithfulnessNLI/LLMYes (context)HighRAG, Summary
FactualityFactScoreClaim-levelYes (KB)Very HighLong-form
LLM-JudgeG-EvalLLM promptOptionalVery HighGeneral
LLM-JudgeMT-BenchLLM scoringNoVery HighChat
LLM-JudgeArena EloPairwiseNoVery HighChat
HumanLikert ScaleRatingNoN/AAll
HumanPairwise PrefComparisonNoN/AAll
RAGContext RelevanceLLM/EmbedYes (query)HighRAG
RAGAnswer RelevanceLLM/EmbedYes (query)HighRAG
SafetyToxicityClassifierNoLowAll
SafetyBias ScoreStatisticalNoMediumAll

Resources

Core Papers:

High-Stakes Evaluation:

Tools & Platforms:


Back to: 02 - LLMs & Generative AI Index | ML & AI Index