How Semantic Retrieval Works¶
Semantic retrieval is the core technology enabling RAG to find relevant information without exact keyword matches. This explanation explores embeddings, similarity metrics, and how retrieval models work in NERxiv.
The Problem with Keyword Search¶
Traditional keyword search matches exact words:
Query: "DFT calculation"
Document 1: "We performed DFT calculations..." ✅ Match
Document 2: "Density functional theory was used..." ❌ No match
Document 3: "First-principles methods..." ❌ No match
Problems: - Misses synonyms and related terms - Doesn't understand context - Can't rank by relevance beyond word frequency - Fails with paraphrasing
Semantic Search with Embeddings¶
Semantic search converts text into high-dimensional vectors (embeddings) that capture meaning:
Query: "DFT calculation"
Embedding: [0.23, -0.45, 0.12, 0.89, ..., -0.31] (384 dimensions)
Document 1: "DFT calculations"
Embedding: [0.22, -0.44, 0.11, 0.87, ..., -0.29] (very similar!)
Document 2: "Density functional theory"
Embedding: [0.19, -0.42, 0.08, 0.85, ..., -0.28] (also similar!)
Document 3: "The weather is nice"
Embedding: [-0.51, 0.23, -0.78, 0.15, ..., 0.62] (not similar)
Key insight: Semantically similar texts have similar embeddings, even with different words.
What are Embeddings?¶
Embeddings are dense vector representations of text that encode semantic meaning.
Creating Embeddings¶
A sentence transformer model converts text to embeddings:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
text = "The bandgap is 1.2 eV"
embedding = model.encode(text)
# Returns: numpy array of shape (384,)
# [0.23, -0.45, 0.12, ..., 0.89]
Properties of Embeddings¶
- Fixed dimension: All texts become the same size vector
all-MiniLM-L6-v2: 384 dimensions-
all-mpnet-base-v2: 768 dimensions -
Semantic similarity: Similar meanings → similar vectors
-
Continuous space: Embeddings exist in continuous space, enabling similarity measurement
-
Language understanding: Captures grammar, context, relationships
Measuring Similarity¶
Once we have embeddings, we need to measure how similar they are.
Cosine Similarity¶
The standard metric is cosine similarity, which measures the angle between vectors:
similarity = cos(θ) = (A · B) / (||A|| × ||B||)
Where:
- A · B = dot product
- ||A|| = magnitude of vector A
Range: -1 to 1 - 1.0: Identical meaning - 0.7-0.9: Very similar - 0.5-0.7: Moderately similar - 0.3-0.5: Somewhat related - <0.3: Different topics - 0.0: Orthogonal (no relation) - <0.0: Opposite meaning (rare in practice)
Visual Intuition¶
Imagine embeddings as arrows in space:
Query: "Find DFT methods"
↑
/|\
/ | \
/ | \
Chunk 1: "DFT used" (angle: 15°, sim: 0.97)
/ \
/ \
Chunk 2: "QMC methods" (angle: 45°, sim: 0.71)
\
\
Chunk 3: "Weather data" (angle: 90°, sim: 0.0)
where:
- Small angle = high similarity
- Large angle = low similarity
Retrieval in NERxiv¶
Here's how semantic retrieval works in NERxiv's RAG pipeline:
Step-by-Step Process¶
from nerxiv.chunker import Chunker
from nerxiv.rag import CustomRetriever
# 1. Chunk the paper
chunker = Chunker(text=paper_text)
chunks = chunker.chunk_text()
# Result: [chunk_1, chunk_2, ..., chunk_100]
# 2. Initialize retriever with query
retriever = CustomRetriever(
model="all-MiniLM-L6-v2",
query="Find all mentions of chemical formulas",
n_top_chunks=5,
)
# 3. Retrieve relevant chunks
top_chunks = retriever.get_relevant_chunks(chunks=chunks)
What Happens Inside get_relevant_chunks¶
# Pseudo-code showing the internal process
def get_relevant_chunks(chunks, n_top_chunks):
# 1. Extract text from chunks
chunk_texts = [chunk.page_content for chunk in chunks]
# 2. Encode query
query_embedding = model.encode(query)
# Shape: (384,)
# 3. Encode all chunks
chunk_embeddings = model.encode(chunk_texts)
# Shape: (100, 384) for 100 chunks
# 4. Compute cosine similarity
similarities = cosine_similarity(query_embedding, chunk_embeddings)
# Shape: (100,) - one score per chunk
# Example: [0.87, 0.34, 0.92, 0.15, ..., 0.45]
# 5. Sort by similarity (descending)
sorted_indices = argsort(similarities, descending=True)
# Example: [2, 0, 99, 42, 15, ...] (chunk 2 most similar)
# 6. Select top N chunks
top_indices = sorted_indices[:n_top_chunks]
top_chunks = [chunk_texts[i] for i in top_indices]
# 7. Join and return
return "\n\n".join(top_chunks)
Example with Real Numbers¶
Query: "Find chemical formulas"
Query embedding: [0.23, -0.45, 0.12, ..., 0.89]
Chunk 0: "The material La₀.₈Sr₀.₂NiO₂ was synthesized..."
Embedding: [0.22, -0.44, 0.11, ..., 0.87]
Similarity: 0.923 ← High!
Chunk 1: "Previous studies have shown..."
Embedding: [-0.31, 0.15, -0.67, ..., 0.23]
Similarity: 0.342 ← Low
Chunk 2: "The formula Fe₂O₃ is commonly used..."
Embedding: [0.24, -0.46, 0.13, ..., 0.88]
Similarity: 0.947 ← Highest!
...
Top 5 chunks: [2, 0, 42, 78, 15] (by similarity)
Retrieval Models¶
Different models create different embeddings, affecting retrieval quality.
Model Characteristics¶
| Model | Dims | Training Data | Best For |
|---|---|---|---|
all-MiniLM-L6-v2 |
384 | General text (1B pairs) | General purpose, fast |
all-mpnet-base-v2 |
768 | General text (1B pairs) | Higher quality, slower |
msmarco-distilbert-base-v4 |
768 | MS MARCO (passage ranking) | Question answering |
Why Model Choice Matters¶
Different models capture different semantic relationships:
Query: "superconductivity"
all-MiniLM-L6-v2 might rank highly:
1. "superconducting materials" (0.85)
2. "high-Tc compounds" (0.72)
3. "zero resistance" (0.68)
all-mpnet-base-v2 might rank highly:
1. "superconducting materials" (0.88)
2. "Cooper pairs" (0.79) ← Better physics understanding
3. "zero resistance" (0.75)
4. "high-Tc compounds" (0.73)
More sophisticated models understand deeper relationships.
Limitations and Solutions¶
Limitation 1: Domain Gap¶
General models may miss domain-specific terminology:
Solution: Use domain-specific models or fine-tune on your papers
Limitation 2: Very Short Queries¶
Short queries have less semantic information:
Query: "DFT"
Embedding captures less context than:
Query: "Find all mentions of DFT calculations and parameters"
Solution: Use more descriptive queries
Limitation 3: Chunk Size Matters¶
Very small chunks lack context:
Very large chunks dilute relevance:
Solution: Balance chunk size based on task (see Explanation: Understanding Chunking Strategies)
Advanced: How Models Learn Embeddings¶
Sentence transformer models are trained using contrastive learning:
-
Positive pairs: Similar sentences (paraphrases, translations)
-
Negative pairs: Different sentences
-
Training objective: Make positive pairs close, negative pairs far
-
Result: The model learns to encode semantic similarity
Retrieval vs. Reranking¶
NERxiv uses single-stage retrieval, but some systems use two-stage retrieval:
Single-Stage (NERxiv)¶
Two-Stage¶
Trade-off: Two-stage is more accurate but slower
Practical Considerations¶
Number of Chunks to Retrieve¶
# Few chunks (3-5): Fast, focused, may miss information
nerxiv prompt --file-path paper.hdf5 --n-top-chunks 3
# Many chunks (10-15): Comprehensive, slower, may include noise
nerxiv prompt --file-path paper.hdf5 --n-top-chunks 12
Rule of thumb: - Simple queries: 3-5 chunks - Complex queries: 8-12 chunks - Exploratory: 15+ chunks
Query Engineering¶
Better queries lead to better retrieval:
❌ Vague: "methods" ✅ Specific: "computational and experimental methods used in the study"
❌ Too narrow: "DFT" ✅ Inclusive: "DFT, density functional theory, and other ab initio methods"
Debugging Retrieval¶
Check similarity scores:
from nerxiv.rag import CustomRetriever
from sentence_transformers import util
retriever = CustomRetriever(model="all-MiniLM-L6-v2", query="your query")
query_emb = retriever.model.encode(retriever.query, convert_to_tensor=True)
chunk_texts = [chunk.page_content for chunk in chunks]
chunk_embs = retriever.model.encode(chunk_texts, convert_to_tensor=True)
similarities = util.pytorch_cos_sim(query_emb, chunk_embs).squeeze(0)
# Print top 5
for i in similarities.argsort(descending=True)[:5]:
print(f"Similarity: {similarities[i]:.4f}")
print(f"Chunk: {chunk_texts[i][:200]}\n")
If top chunks have low similarity (<0.5), your query may need refinement.
Summary¶
Semantic retrieval works by:
- Encoding query and chunks into embeddings
- Computing cosine similarity between embeddings
- Ranking chunks by similarity score
- Selecting top N chunks for the LLM
This approach:
- Understands meaning, not just keywords
- Works with synonyms and paraphrasing
- Enables precise relevance ranking
- Scales to large document collections