Context Compression Techniques

Overview

Context compression refers to methods that reduce the length of input prompts while preserving the essential information needed for accurate LLM responses. Context compression operates at the input level, aiming to fit more meaningful content within a model’s context window or reduce computational costs by processing fewer tokens.

The core motivation: API costs scale with token count, latency increases with sequence length, and the “Lost in the Middle” problem means very long contexts often hurt performance anyway.

Why Compress Context?

Consider a RAG pipeline that retrieves 10 documents of 500 tokens each. That is 5,000 tokens of context before we even add the user query or system prompt. Most of those tokens are filler words, redundant phrases, or marginally relevant content. Compression identifies and removes the noise.

The Compression Spectrum

The Information Bottleneck Intuition

Think of compression as finding the minimal sufficient representation. If you have a 2,000-token document and someone asks “What year was the company founded?”, perhaps only 15 tokens in that document actually matter. Compression techniques try to identify and retain those 15 tokens (or their semantic equivalent).

Main Techniques

1. LLMLingua Family

The LLMLingua approach uses a small “compressor” LLM to identify which tokens can be dropped without losing meaning.

LLMLingua (Original)

Uses perplexity from a small model (e.g., GPT-2, LLaMA-7B) as an importance signal
High perplexity tokens = surprising/informative = keep them
Low perplexity tokens = predictable/redundant = safe to remove
Achieves up to 20x compression with minimal performance loss on certain benchmarks

Process:

Run the context through a small LLM
Compute token-level perplexity scores
Remove tokens below a threshold (or keep top-k% by perplexity)
Feed the compressed prompt to the target LLM

LongLLMLingua

Extension for long-context scenarios. Adds:

Question-aware compression: Considers the query when deciding what to keep
Document reordering: Moves important content to the beginning and end (addressing “Lost in the Middle”)
Contrastive perplexity: Measures how surprising a token is given the question vs. in isolation

LLMLingua-2

Uses a trained classifier (small BERT-like model) instead of perplexity:

Binary classification: keep or discard each token
Trained on distillation data from GPT-4 judgments
Faster than perplexity-based methods (no autoregressive forward pass needed)

2. Selective Context

A simpler approach that uses self-information (negative log probability) to filter tokens:

I (x_{i}) = - lo g P (x_{i} ∣ x_{< i})

Tokens with low self-information (highly predictable given prior context) are removed. The intuition: if you can predict a word perfectly from context, including it adds no new information.

Algorithm:

Compute conditional probabilities for each token using a causal LM
Calculate self-information scores
Apply a threshold or keep a fixed percentage
Concatenate remaining tokens

3. Gist Tokens / Gisting

Gist tokens are learned virtual tokens that summarize longer contexts into a fixed number of embeddings.

┌──────────────────────────────────────────────────────────────────┐
│  Original: "The quick brown fox jumps over the lazy dog"         │
│                              ↓                                   │
│  Gist Compression (k=2 gist tokens)                              │
│                              ↓                                   │
│  [GIST_1] [GIST_2]  ←  Dense vectors encoding the sentence       │
│                                                                  │
│  These 2 "virtual tokens" replace 9 real tokens                  │
└──────────────────────────────────────────────────────────────────┘

Training:

Fine-tune model to produce useful gist embeddings
Gist tokens are prepended to the actual input
Model learns to condition on gist tokens for downstream tasks

Relation to Prefix-Tuning: Gisting is conceptually similar, but the goal is compression rather than task adaptation. The “prefix” here summarizes the context.

4. AutoCompressors

AutoCompressors take gisting further by training the model to compress its own context iteratively:

Process the first segment, generate “summary vectors”
Prepend summary vectors to the next segment
Repeat, accumulating compressed representations
Final summary vectors encode the entire document

This allows processing documents longer than the context window by compressing as you go.

Key Equation (conceptual):

summary_{t} = f_{θ} (segment_{t}, summary_{t - 1})

where $f_{θ}$ is the autocompressor model that takes a text segment and the previous summary, producing a new summary.

5. Summarization-Based Compression

The most straightforward approach: use an LLM to summarize retrieved documents before including them in the prompt.

Two Variants:

Extractive: Select important sentences verbatim
Abstractive: Generate a condensed paraphrase

Trade-offs:

Aspect	Extractive	Abstractive
Faithfulness	High (original text)	Risk of hallucination
Compression ratio	Moderate (sentence-level)	High (can be very brief)
Latency	Low	Higher (LLM call)
Cost	Low	Additional inference cost

6. RECOMP (Retrieval Compression)

Designed specifically for RAG pipelines. Two components:

Extractive Compressor: Selects relevant sentences from retrieved documents
Abstractive Compressor: Generates a summary conditioned on the query

The compressor is trained to produce outputs that maximize downstream QA accuracy, not just general summarization quality.

7. Nugget-Based Compression

Identifies atomic “nuggets” of information in a document:

Each nugget is a self-contained fact
Nuggets are scored for relevance to the query
Only high-scoring nuggets are included

This is more semantic than token-level pruning, operating at the fact/claim level.

Mathematical Foundation

Perplexity-Based Token Selection

For a token sequence $x_{1}, x_{2}, ..., x_{n}$ , the perplexity of token $x_{i}$ given its prefix is:

PPL (x_{i}) = exp (- lo g P (x_{i} ∣ x_{< i})) = \frac{1}{P ( x _{i} ∣ x _{< i} )}

Selection Rule: Keep token $x_{i}$ if $PPL (x_{i}) > τ$ (threshold).

Tokens with high perplexity are “surprising” given context, meaning they carry more information.

Compression Ratio

Compression Ratio = \frac{∣ Original Tokens ∣}{∣ Compressed Tokens ∣}

A ratio of 10x means you reduced a 1,000-token input to 100 tokens.

Information-Theoretic View

Compression fundamentally trades off rate (number of bits/tokens) against distortion (information loss). The optimal compressed representation minimizes:

L = R + λ D

where $R$ is the length of the compressed representation and $D$ is a distortion measure (e.g., performance drop on downstream tasks). $λ$ controls the rate-distortion trade-off.

Practical Application

When to Use

High API costs: Compression can cut token usage by 50-90%
Latency-sensitive applications: Fewer tokens = faster inference
Long document QA: Fitting multiple documents in context
RAG pipelines with many retrieved chunks: Compress before generation

When NOT to Use

Short prompts: Overhead of compression outweighs benefits
Tasks requiring exact wording: Legal documents, code generation from specs
Low-resource languages: Compression models may perform poorly
When you need full auditability: Hard to debug what was removed

Common Pitfalls

Over-compression: Removing too much loses critical information
Domain mismatch: Compressor trained on news may fail on medical text
Ignoring query context: Generic compression loses query-relevant details
Cascading errors: If compressor misses something, LLM cannot recover

Trade-offs and Calculations

Latency Analysis:

Compression adds overhead (small model forward pass)
But reduces main LLM processing time
Break-even depends on: compression ratio, relative model sizes, context length

Example Calculation:

Original: 4,000 tokens at $0.01/1K = $0.04 per request
Compressed (5x): 800 tokens = $0.008 per request
Savings: 80% cost reduction

Quality Degradation: Typical benchmarks show:

2-4x compression: <1% accuracy drop
5-10x compression: 2-5% accuracy drop
10-20x compression: 5-15% accuracy drop (varies heavily by task)

Comparisons

Method	Compression Type	Typical Ratio	Requires Training	Query-Aware
LLMLingua	Hard (token pruning)	5-20x	No (uses pretrained LM)	Optional
LLMLingua-2	Hard (token pruning)	5-15x	Yes (classifier)	Yes
Selective Context	Hard (token pruning)	2-5x	No	No
Gist Tokens	Soft (embeddings)	Variable	Yes	Depends
RECOMP	Hard (sentence selection)	3-10x	Yes	Yes
Summarization	Hard (abstractive)	5-20x	Optional	Optional

Concept	Goal	Operates On
Context Compression	Reduce input length	Prompt/retrieved docs
Token Optimization	Faster inference	Attention/KV cache
Chunking Strategies	Better retrieval	Document indexing
Re-ranking	Improve relevance	Retrieved results

Resources

Aayush's ML & AI Notes

Explorer

Context Compression Techniques

Context Compression Techniques

Overview

Why Compress Context?

The Compression Spectrum

The Information Bottleneck Intuition

Main Techniques

1. LLMLingua Family

LLMLingua (Original)

LongLLMLingua

LLMLingua-2

2. Selective Context

3. Gist Tokens / Gisting

4. AutoCompressors

5. Summarization-Based Compression

6. RECOMP (Retrieval Compression)

7. Nugget-Based Compression

Mathematical Foundation

Perplexity-Based Token Selection

Compression Ratio

Information-Theoretic View

Practical Application

When to Use

When NOT to Use

Common Pitfalls

Trade-offs and Calculations

Comparisons

Resources

Papers

Articles

Videos

Graph View

Table of Contents

Backlinks

Aayush's ML & AI Notes

Explorer

Context Compression Techniques

Context Compression Techniques

Overview

Why Compress Context?

The Compression Spectrum

The Information Bottleneck Intuition

Main Techniques

1. LLMLingua Family

LLMLingua (Original)

LongLLMLingua

LLMLingua-2

2. Selective Context

3. Gist Tokens / Gisting

4. AutoCompressors

5. Summarization-Based Compression

6. RECOMP (Retrieval Compression)

7. Nugget-Based Compression

Mathematical Foundation

Perplexity-Based Token Selection

Compression Ratio

Information-Theoretic View

Practical Application

When to Use

When NOT to Use

Common Pitfalls

Trade-offs and Calculations

Comparisons

Comparison with Related Concepts

Resources

Papers

Articles

Videos

Graph View

Table of Contents

Backlinks