How-to: Customize Chunking Strategies¶
This guide shows you how to choose and configure different chunking strategies for your RAG pipeline. You can read more about why chunking matters in Explanation: Understanding Chunking Strategies.
Available Chunkers¶
NERxiv provides three chunking strategies:
1. Fixed-Size Chunker (Default)¶
The Chunker class uses fixed character-based chunks with overlap.
When to use:
- General-purpose chunking
- When you want consistent chunk sizes
- When processing speed is important
CLI usage:
Python usage:
from nerxiv.chunker import Chunker
chunker = Chunker(chunk_size=1000, chunk_overlap=200, text=paper_text)
chunks = chunker.chunk_text()
2. Semantic Chunker¶
The SemanticChunker uses spaCy to create chunks at sentence boundaries.
When to use:
- When you want to preserve sentence integrity
- When semantic coherence is important
- For extracting specific facts or statements
CLI usage:
Python usage:
from nerxiv.chunker import SemanticChunker
chunker = SemanticChunker(text=paper_text)
chunks = chunker.chunk_text()
This chunker automatically groups sentences together while maintaining semantic boundaries.
3. Advanced Semantic Chunker¶
The AdvancedSemanticChunker uses KMeans clustering on sentence embeddings to group semantically similar sentences.
When to use:
- When you want topically coherent chunks
- When extracting complex, multi-sentence information
- When you know approximately how many topics are in the paper
CLI usage:
Python usage:
from nerxiv.chunker import AdvancedSemanticChunker
chunker = AdvancedSemanticChunker(n_chunks=10, text=paper_text)
chunks = chunker.chunk_text()
Choosing the Right Strategy¶
| Your Goal | Recommended Chunker | Why |
|---|---|---|
| Fast processing | Chunker |
Simple, no NLP overhead |
| Extract formulas/numbers | Chunker or SemanticChunker |
Preserves local context |
| Extract methodology descriptions | AdvancedSemanticChunker |
Groups related methodological text |
| General metadata extraction | SemanticChunker |
Good balance of speed and quality |
| Highly specific technical queries | AdvancedSemanticChunker |
Better topical grouping |
Advanced Configuration¶
Adjusting Fixed-Size Chunks¶
You can't directly pass chunk_size via CLI, but you can modify it in your Python scripts:
from pathlib import Path
import h5py
from nerxiv.chunker import Chunker
from nerxiv.rag import CustomRetriever, LLMGenerator
from nerxiv.prompts import PROMPT_REGISTRY
# Load paper text
paper_path = Path("paper.hdf5")
with h5py.File(paper_path, "r") as f:
arxiv_id = paper_path.stem
text = f[arxiv_id]["arxiv_paper"]["text"][()].decode("utf-8")
# Custom chunking
chunker = Chunker(chunk_size=1500, chunk_overlap=300, text=text)
chunks = chunker.chunk_text()
# Continue with retrieval and generation
retriever_query = PROMPT_REGISTRY["material_formula"].retriever_query
retriever = CustomRetriever(n_top_chunks=5, query=retriever_query)
top_text = retriever.get_relevant_chunks(chunks=chunks)
prompt = PROMPT_REGISTRY["material_formula"].prompt
generator = LLMGenerator(model="llama3.1:70b", text=top_text)
answer = generator.generate(prompt=prompt.build(text=top_text))
print(answer)
Adjusting Semantic Clusters¶
For papers with complex topics, increase the number of clusters:
from nerxiv.chunker import AdvancedSemanticChunker
chunker = AdvancedSemanticChunker(n_chunks=15, text=paper_text) # More granular clustering
chunks = chunker.chunk_text()
Debugging Chunks¶
To see what chunks are created, inspect them in Python: