Overview

Embeddings are dense, fixed-dimensional vector representations of data (text, images, audio) that capture semantic meaning. In the context of RAG pipelines, embeddings enable semantic search by converting both documents and queries into a shared vector space where similarity can be measured geometrically.

The core insight is that semantically similar content should have similar embeddings:

  • “The quick brown fox” ≈ “A fast auburn fox” (close in vector space)
  • “The quick brown fox” ≠ “Quarterly earnings report” (far apart in vector space)

Where is the embedding model (a neural network with learned parameters ), and is the embedding dimension.

How Embeddings Work

From Words to Vectors

Embedding models learn to map text to vectors during training. The training process optimizes the model so that semantically similar texts produce similar vectors.

Training Objective (Contrastive Learning): Given a query and a positive document (relevant), and negative documents (irrelevant), the model learns to use below loss function to:

  • Maximize similarity between and
  • Minimize similarity between and

Where is a temperature parameter and is typically cosine similarity.

Understanding the Vector Space

Embeddings encode semantic relationships in geometric relationships:

Linear Relationships: The famous Word2Vec analogy:

This property extends to sentence embeddings, though less perfectly. Related concepts cluster together, and vector arithmetic can sometimes discover relationships.

Clustering by Topic: Documents about similar topics naturally cluster:

  • Legal documents form one cluster
  • Medical research forms another cluster
  • Sports news forms yet another

This clustering is what enables retrieval: a query about “contract law” will be close to legal documents in vector space.

Embedding Model Architectures

Bi-Encoder Architecture

The standard architecture for retrieval embeddings. Encodes queries and documents independently.

Characteristics:

  • Query and document encoded separately
  • Document embeddings can be pre-computed and cached
  • Fast at query time (just encode query, then vector search in the Vector Databases)
  • No cross-attention between query and document

Training: Typically uses contrastive learning with in-batch negatives.

Asymmetric vs Symmetric Models

TypeQuery StyleDocument StyleUse Case
SymmetricSame as documentSame as querySemantic similarity, deduplication
AsymmetricShort questionsLong passagesQ&A retrieval, RAG

Asymmetric models are trained specifically for the query-document retrieval task. The query encoder learns to represent questions, while the document encoder learns to represent answer-containing passages.

Most production RAG systems use asymmetric models (OpenAI, Cohere, BGE).

Choosing an Embedding Model

Factors to Consider:

  1. Quality: How well does it perform on your specific domain/task?
  2. Dimensionality: Higher dimensions = more expressive, but more storage/compute
  3. Max Tokens: Can it handle your chunk sizes?
  4. Latency: How fast is inference?
  5. Cost: API pricing or self-hosting compute
  6. Multilingual: Does it support your languages?

MTEB Benchmark

The Massive Text Embedding Benchmark (MTEB) is the standard benchmark for evaluating embedding models across diverse tasks.

Task Categories

TaskDescriptionRelevance to RAG
RetrievalFind relevant documents for a queryDirect RAG relevance
Semantic Textual Similarity (STS)Score similarity between sentence pairsRelated to retrieval quality
ClassificationCategorize text into classesLess relevant
ClusteringGroup similar documentsDocument organization
RerankingReorder candidates by relevancePost-retrieval refinement
Pair ClassificationBinary similarity decisionsDeduplication

For RAG, prioritize Retrieval scores. The benchmark reports:

  • NDCG@10: Normalized Discounted Cumulative Gain at rank 10
  • Recall@k: Percentage of relevant documents in top-k

Leaderboard: https://huggingface.co/spaces/mteb/leaderboard

Dimensionality and Its Effects

What Dimensionality Means

The embedding dimension determines the size of the vector:

  • text-embedding-3-small:
  • all-MiniLM-L6-v2:

Higher dimensions can theoretically capture more nuanced semantic information, but with diminishing returns.

Dimension Reduction (Matryoshka Embeddings)

Some modern models (OpenAI text-embedding-3-*, Nomic) support Matryoshka Representation Learning (MRL) , which allows truncating embeddings to smaller dimensions while retaining most quality.

The model is trained so that the first dimensions are a valid embedding on their own. Truncate from 1536 → 512 dimensions and still get useful embeddings.

Benefits:

  • Reduce storage by 3x (1536 → 512)
  • Faster similarity computation
  • Minimal quality loss (typically 1-3% on benchmarks)

Practical Application

Batch Embedding Calls

For large-scale indexing, batch multiple texts per API call:

# Inefficient: One call per document
for doc in documents:
    embedding = embed(doc)  # N API calls
 
# Efficient: Batch calls
batch_size = 100
for i in range(0, len(documents), batch_size):
    batch = documents[i:i+batch_size]
    embeddings = embed(batch)  # N/100 API calls

Domain Adaptation

The Domain Shift Problem

General embedding models are trained on web text (Wikipedia, news, forums). When applied to specialized domains, performance degrades:

DomainChallenge
LegalSpecific terminology (“tort”, “habeas corpus”, “estoppel”)
MedicalTechnical vocabulary (“acute myocardial infarction”)
ScientificJargon, abbreviations, formulas
CodeSyntax, variable names, non-natural language

Strategies for Domain Adaptation

1. Fine-Tuning (Best Quality, Most Effort)

Train on domain-specific data:

  • Collect query-document pairs from your domain
  • Fine-tune using contrastive loss
  • Requires labeled data (or synthetic generation)

Frameworks: Sentence Transformers, LoRA (Low-Rank Adaptation)

2. Use Domain-Specific Models

Pre-trained models for specific domains:

  • Legal: Legal-BERT embeddings
  • Medical: PubMedBERT, BiomedCLIP
  • Scientific: SciBERT, SPECTER2
  • Code: CodeBERT, StarCoder embeddings

3. Hybrid Retrieval

Combine embeddings with BM25 keyword search:

  • Embeddings catch semantic matches
  • BM25 catches exact terminology
  • Hybrid Search combines both for robustness
  • This is often the simplest mitigation for domain shift.

Multilingual Embeddings

Cross-Lingual Retrieval

Multilingual models map text from different languages into the same vector space:

  • Query in English → retrieve documents in French, German, Chinese
  • Single index for all languages

Leading Multilingual Models

ModelLanguagesNotes
BGE-M3100+Excellent multilingual, open-source
Cohere embed-v3100+API-based, strong performance
E5-multilingual100+Open-source
OpenAI text-embedding-3-**100+Good multilingual support

Considerations

  • Cross-lingual performance is typically 5-15% lower than monolingual
  • Some language pairs work better than others (related languages transfer better)
  • For critical applications, consider language-specific models

Token Limits and Long Documents

The Context Window Problem

Most embedding models have limited context windows:

  • all-MiniLM-L6-v2: 256 tokens
  • BGE-base: 512 tokens
  • OpenAI text-embedding-3-small: 8191 tokens

Text beyond the limit is truncated, losing information.

Strategies for Long Documents

1. Chunk and Embed (Standard Approach)

Split documents into chunks that fit the context window. See Chunking Strategies.

Trade-off: Loses document-level context.

2. Use Long-Context Models

Choose models with larger windows:

  • nomic-embed-text: 8192 tokens
  • GTE-large: 8192 tokens
  • jina-embeddings-v2: 8192 tokens

Trade-off: Longer context = slower inference, higher cost.

3. Hierarchical Embeddings

Embed at multiple granularities:

  • Sentence-level embeddings for precision
  • Paragraph-level for broader context
  • Document-level for theme matching

Query against all levels, merge results.

4. Late Chunking

Newer technique where:

  1. Run the full document through the transformer (up to max tokens)
  2. Pool token embeddings into chunk embeddings afterward

Benefit: Chunks retain context from surrounding text.

Implementations: LlamaIndex, some embedding providers

Common Pitfalls

1. Embedding Model Mismatch

Problem: Using different models for indexing vs. querying.

Why It Breaks: Vector spaces are incompatible. Similarity scores become meaningless.

Solution: Track which model created each embedding. Store model name as metadata.

2. Ignoring Instruction Prefixes

Problem: Using instruction-tuned models without the required prefixes.

Why It Matters: Model was trained with specific formats. Without them, embeddings are suboptimal.

Solution: Check model documentation. BGE, E5, and others require specific prefixes.

3. Truncation Without Awareness

Problem: Long text silently truncated.

Symptoms: Documents that should be similar are not. Key information at the end of chunks is lost.

Solution: Monitor token counts. Design chunking to stay within limits. Consider overlap.

4. Not Normalizing Embeddings

Problem: Using cosine similarity with non-normalized embeddings.

Why It Matters: Cosine similarity assumes unit vectors. Results may be incorrect.

Solution: Check model documentation. Normalize if not done by default.

5. Overfitting to Benchmarks

Problem: Choosing model purely based on MTEB scores.

Reality: Benchmark performance does not always transfer to your specific domain.

Solution: Evaluate on your own data. Create a small test set of queries and relevant documents.

Comparisons

Embedding vs TF-IDF/BM25

AspectEmbeddings (Dense)TF-IDF/BM25 (Sparse)
RepresentationDense vectors ()Sparse vectors ( vocabulary size)
Semantic UnderstandingYes (synonyms, paraphrases)No (exact matches only)
TrainingRequired (pre-trained models)None (statistical formula)
StorageFixed size per documentVariable (depends on document length)
Exact Term MatchingWeakStrong
Novel VocabularyHandles via subword tokenizationFails on unseen terms
Best ForSemantic similarityKeyword matching, domain terms

Embedding Model Comparison

ModelProsCons
OpenAIHigh quality, easy APICost, vendor lock-in
CohereMultilingual, compressionCost
BGE-M3Open-source, hybrid, multilingualSelf-hosting complexity
E5-large-v2Strong open-sourceShorter context (512)
all-MiniLM-L6-v2Fast, tinyLower quality, short context

Advanced Topics

Sparse-Dense Hybrid Embeddings

Models like BGE-M3 and SPLADE output both dense and sparse representations:

Dense: Traditional semantic embedding (1024 dims) Sparse: Learned term weights (vocabulary-size sparse vector)

Benefits:

  • Combines semantic understanding with keyword matching
  • Single model for Hybrid Search
  • No separate BM25 index needed

Embedding Quantization

Reduce storage and speed up search by quantizing embeddings:

TypeOriginalQuantizedMemory Savings
float324 bytes/dimBaseline0%
float164 bytes/dim2 bytes/dim50%
int84 bytes/dim1 byte/dim75%
binary4 bytes/dim1 bit/dim97%

Most Vector Databases support quantization options.

Resources

Documentation

Papers

Benchmarks


Back to: 01 - RAG Index