How-to: Customize Chunking Strategies¶

This guide shows you how to choose and configure different chunking strategies for your RAG pipeline. You can read more about why chunking matters in Explanation: Understanding Chunking Strategies.

Available Chunkers¶

NERxiv provides three chunking strategies:

1. Fixed-Size Chunker (Default)¶

The Chunker class uses fixed character-based chunks with overlap.

When to use:

General-purpose chunking
When you want consistent chunk sizes
When processing speed is important

CLI usage:

nerxiv prompt --file-path paper.hdf5 --chunker Chunker

Python usage:

from nerxiv.chunker import Chunker

chunker = Chunker(chunk_size=1000, chunk_overlap=200, text=paper_text)
chunks = chunker.chunk_text()

2. Semantic Chunker¶

The SemanticChunker uses spaCy to create chunks at sentence boundaries.

When to use:

When you want to preserve sentence integrity
When semantic coherence is important
For extracting specific facts or statements

CLI usage:

nerxiv prompt --file-path paper.hdf5 --chunker SemanticChunker

Python usage:

from nerxiv.chunker import SemanticChunker

chunker = SemanticChunker(text=paper_text)
chunks = chunker.chunk_text()

This chunker automatically groups sentences together while maintaining semantic boundaries.

3. Advanced Semantic Chunker¶

The AdvancedSemanticChunker uses KMeans clustering on sentence embeddings to group semantically similar sentences.

When to use:

When you want topically coherent chunks
When extracting complex, multi-sentence information
When you know approximately how many topics are in the paper

CLI usage:

nerxiv prompt --file-path paper.hdf5 --chunker AdvancedSemanticChunker

Python usage:

from nerxiv.chunker import AdvancedSemanticChunker

chunker = AdvancedSemanticChunker(n_chunks=10, text=paper_text)
chunks = chunker.chunk_text()

Choosing the Right Strategy¶

Your Goal	Recommended Chunker	Why
Fast processing	`Chunker`	Simple, no NLP overhead
Extract formulas/numbers	`Chunker` or `SemanticChunker`	Preserves local context
Extract methodology descriptions	`AdvancedSemanticChunker`	Groups related methodological text
General metadata extraction	`SemanticChunker`	Good balance of speed and quality
Highly specific technical queries	`AdvancedSemanticChunker`	Better topical grouping

Advanced Configuration¶

Adjusting Fixed-Size Chunks¶

You can't directly pass chunk_size via CLI, but you can modify it in your Python scripts:

from pathlib import Path
import h5py
from nerxiv.chunker import Chunker
from nerxiv.rag import CustomRetriever, LLMGenerator
from nerxiv.prompts import PROMPT_REGISTRY

# Load paper text
paper_path = Path("paper.hdf5")
with h5py.File(paper_path, "r") as f:
    arxiv_id = paper_path.stem
    text = f[arxiv_id]["arxiv_paper"]["text"][()].decode("utf-8")

# Custom chunking
chunker = Chunker(chunk_size=1500, chunk_overlap=300, text=text)
chunks = chunker.chunk_text()

# Continue with retrieval and generation
retriever_query = PROMPT_REGISTRY["material_formula"].retriever_query
retriever = CustomRetriever(n_top_chunks=5, query=retriever_query)
top_text = retriever.get_relevant_chunks(chunks=chunks)

prompt = PROMPT_REGISTRY["material_formula"].prompt
generator = LLMGenerator(model="llama3.1:70b", text=top_text)
answer = generator.generate(prompt=prompt.build(text=top_text))
print(answer)

Adjusting Semantic Clusters¶

For papers with complex topics, increase the number of clusters:

from nerxiv.chunker import AdvancedSemanticChunker

chunker = AdvancedSemanticChunker(n_chunks=15, text=paper_text)  # More granular clustering
chunks = chunker.chunk_text()

Debugging Chunks¶

To see what chunks are created, inspect them in Python:

from nerxiv.chunker import SemanticChunker

chunker = SemanticChunker(text=paper_text)
chunks = chunker.chunk_text()

# Print first 3 chunks
for i, chunk in enumerate(chunks[:3]):
    print(f"=== Chunk {i} ===")
    print(chunk.page_content)
    print()