API Reference¶
This API reference provides comprehensive documentation for all public classes and functions in NERxiv. For more detailed examples and usage patterns, see the How-to Guides and Tutorial sections.
nerxiv.chunker
¶
Chunker
¶
Bases: BaseChunker
Chunk text into smaller parts for processing and avoiding the token limit of an LLM model.
Source code in nerxiv/chunker.py
chunk_size = kwargs.get('chunk_size', 1000)
¶
chunk_overlap = kwargs.get('chunk_overlap', 200)
¶
__init__(text='', **kwargs)
¶
chunk_text()
¶
Chunk the text into smaller parts. This is done to avoid exceeding the token limit of the LLM.
| RETURNS | DESCRIPTION |
|---|---|
list[Document]
|
list[Document]: The list of chunks as |
Source code in nerxiv/chunker.py
SemanticChunker
¶
Bases: BaseChunker
Sentence-level semantic chunker using spaCy.
Source code in nerxiv/chunker.py
__init__(text='', **kwargs)
¶
chunk_text()
¶
Chunk the text into smaller parts based on semantic meaning using spaCy.
| RETURNS | DESCRIPTION |
|---|---|
list[Document]
|
list[Document]: The list of chunks as |
Source code in nerxiv/chunker.py
AdvancedSemanticChunker
¶
Bases: BaseChunker
KMeans-based semantic chunker using SentenceTransformer embeddings.
Source code in nerxiv/chunker.py
n_chunks = kwargs.get('n_chunks', 10)
¶
model = kwargs.get('model', 'all-MiniLM-L6-v2')
¶
__init__(text='', **kwargs)
¶
chunk_text()
¶
Chunk the text into smaller parts based on semantic meaning using KMeans clustering on sentence embeddings.
| RETURNS | DESCRIPTION |
|---|---|
list[Document]
|
list[Document]: The list of chunks as |
Source code in nerxiv/chunker.py
nerxiv.rag.retriever
¶
RETRIEVER_VERSION = '1.0.0'
¶
Retriever
¶
Bases: ABC
Abstract base class for retrieving relevant chunks of text from a list of documents. This class is designed to be inherited from and implemented by specific retriever classes.
Source code in nerxiv/rag/retriever.py
logger = kwargs.get('logger', logger)
¶
model_name = kwargs.get('model', 'all-MiniLM-L6-v2')
¶
n_top_chunks = kwargs.get('n_top_chunks', 5)
¶
query = kwargs.get('query')
¶
__init__(**kwargs)
¶
Source code in nerxiv/rag/retriever.py
CustomRetriever
¶
Bases: Retriever
A custom retriever class that uses the SentenceTransformer model to retrieve relevant chunks of text
from a list of documents.
Source code in nerxiv/rag/retriever.py
model = SentenceTransformer(self.model_name)
¶
__init__(**kwargs)
¶
get_relevant_chunks(chunks=[])
¶
Retrieves the most relevant chunks of text from a list of documents using the SentenceTransformer model.
| PARAMETER | DESCRIPTION |
|---|---|
chunks
|
The chunks to be ranked. Defaults to [].
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The top
TYPE:
|
Source code in nerxiv/rag/retriever.py
LangChainRetriever
¶
Bases: Retriever
Source code in nerxiv/rag/retriever.py
embeddings = HuggingFaceEmbeddings(model_name=(self.model_name))
¶
__init__(**kwargs)
¶
get_relevant_chunks(chunks=[])
¶
Retrieves the most relevant chunks of text from a list of documents using the HuggingFaceEmbeddings model.
| PARAMETER | DESCRIPTION |
|---|---|
chunks
|
The chunks to be ranked. Defaults to [].
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The top
TYPE:
|
Source code in nerxiv/rag/retriever.py
nerxiv.rag.generator
¶
LLMGenerator
¶
LLMGenerator class for generating answers with the generate method using a specified LLM model
specified by the user. The LLM model is loaded using OllamaLLM implementation in LangChain.
Read more in https://python.langchain.com/docs/integrations/llms/ollama/
Source code in nerxiv/rag/generator.py
text = text
¶
logger = kwargs.get('logger', logger)
¶
llm = OllamaLLM(**ollama_kwargs)
¶
__init__(text='', **kwargs)
¶
Source code in nerxiv/rag/generator.py
generate(prompt='', regex='\\n\\nAnswer\\: *', del_regex='\\n\\nAnswer\\: *')
¶
Generates an answer using the specified LLM model and the provided prompt provided that
the token limit is not exceeded.
Args:
prompt (str, optional): The prompt to be used for generating the answer. Defaults to "".
regex (str, optional): The regex pattern to search for in the answer. Defaults to r"
Answer\: *". del_regex (str, optional): The regex pattern to delete from the answer. Defaults to r"
Answer\: *".
Returns:
str: The generated and cleaned answer from the LLM model.
Source code in nerxiv/rag/generator.py
nerxiv.rag.agents
¶
BaseAgent
¶
Bases: ABC
Abstract base class for extraction agents.
All agents should implement the run method which executes the
extraction workflow and returns structured results.
Source code in nerxiv/rag/agents.py
run(text, prompt, **kwargs)
¶
Execute the extraction workflow.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
Input text to process.
TYPE:
|
prompt
|
Prompt template for LLM.
TYPE:
|
**kwargs
|
Additional parameters specific to the agent
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
None
|
Dictionary containing extraction results |
Source code in nerxiv/rag/agents.py
RAGExtractorAgent
¶
Bases: BaseAgent
Source code in nerxiv/rag/agents.py
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 | |
chunker = chunker
¶
retriever = retriever
¶
generator = generator
¶
chunker_params = kwargs.get('chunker_params', {})
¶
retriever_params = kwargs.get('retriever_params', {})
¶
generator_params = kwargs.get('generator_params', {})
¶
logger = kwargs.get('logger', logger)
¶
__init__(chunker, retriever, generator, **kwargs)
¶
Source code in nerxiv/rag/agents.py
parse(answer)
¶
Parse JSON from LLM answer if the prompt is of StructuredPrompt type. This method
attempts to extract JSON from markdown code blocks (json...) and if successful,
return the parsed data. If no code blocks are found, it tries to find JSON patterns
directly in the text.
| PARAMETER | DESCRIPTION |
|---|---|
answer
|
Raw LLM output string.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any] | None
|
dict[str, Any] | None: Parsed JSON data as a dictionary, or None if parsing fails. |
Source code in nerxiv/rag/agents.py
run(file=None, text='', prompt=None)
¶
Runs the RAG extraction pipeline: chunking, retrieval, and generation.
Chunking and retrieval results are cached in the provided HDF5 file to avoid redundant computations.
If the prompt is of type StructuredPrompt, the generated answer is parsed into structured data.
| PARAMETER | DESCRIPTION |
|---|---|
file
|
The file were to store the metainformation. Defaults to None.
TYPE:
|
text
|
The text to process. Defaults to "".
TYPE:
|
prompt
|
The prompt used for the LLM prompting. Defaults to None.
TYPE:
|
Source code in nerxiv/rag/agents.py
155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 | |
nerxiv.prompts.prompts
¶
Example
¶
Bases: BaseModel
Represents an example for a prompt, containing input text and expected output.
Source code in nerxiv/prompts/prompts.py
BasePrompt
¶
Bases: BaseModel
Base class used as an interface for other prompt classes. It defines the common fields and methods that all prompts should implement. This class is not meant to be instantiated directly.
Source code in nerxiv/prompts/prompts.py
expert = Field(..., description="\n The expert or main field of expertise for the prompt. For example, 'Condensed Matter Physics'.\n ")
¶
sub_field_expertise = Field(None, description="\n The sub-field of expertise for the prompt. For example, 'many-body physics simulations'.\n ")
¶
examples = Field([], description="\n Examples to illustrate the prompt. These are formatted as:\n\n 'Examples of how to answer the prompt:\n Example 1:\n - Input text: `example.input`\n - Answer: `example.output`'\n\n They are used to guide the model on how to answer the prompt.\n ")
¶
constraints = Field([], description="\n Constraints to be followed in the output of the prompt. These are formatted as\n\n 'Important constraints when generating the output: `constraints`'.\n\n They are mainly used as instructions to avoid unused text, broken formats or sentences, etc.\n ")
¶
build()
¶
Builds the prompt based on the fields defined in this class. This is used to format the prompt
and append the text to be sent to the LLM for generation.
| RAISES | DESCRIPTION |
|---|---|
NotImplementedError
|
This method should be implemented in subclasses. |
| RETURNS | DESCRIPTION |
|---|---|
str
|
The formatted prompt ready to be sent to the LLM.
TYPE:
|
Source code in nerxiv/prompts/prompts.py
Prompt
¶
Bases: BasePrompt
Represents a prompt object with various fields to define its structure and content. The final prompt
is built using the build() method, which formats the prompt based on the provided text and the fields defined in this class.
Source code in nerxiv/prompts/prompts.py
main_instruction = Field(..., description='\n Main instruction for the prompt. This has to be written in the imperative form, e.g. \'identify all mentions of the system being simulated\'.\n The format in the prompt is "Given the following scientific text, your task is `main_instruction`",\n ')
¶
secondary_instructions = Field([], description='\n Secondary instructions for the prompt. These are additional instructions that complement `main_instruction`\n and are formatted as "Additionally, you also need to follow these instructions: `secondary_instructions`".\n ')
¶
build(text)
¶
Builds the prompt based on the fields defined in this class. This is used to format the prompt
and append the text to be sent to the LLM for generation.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
The text to append to the prompt.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The formatted prompt ready to be sent to the LLM.
TYPE:
|
Source code in nerxiv/prompts/prompts.py
StructuredPrompt
¶
Bases: BasePrompt
Represents a prompt object with various fields to define its structure and content. The final prompt
is built using the build() method, which formats the prompt based on the provided text and the fields defined in this class.
Note: The main difference with the Prompt class is that StructuredPrompt is designed to work with a specific output schema,
so instead of using main_instructions, secondary_instructions and constraints, the instructions are automatically defined by output_schema
and target_fields.
Source code in nerxiv/prompts/prompts.py
186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 | |
output_schema = Field(..., description='\n The target `BaseModel` schema in which the fields to be extracted are defined.\n ')
¶
target_fields = Field(..., description='\n The fields within `output_schema` that the prompt should extract. If set to `all`, all fields defined in `output_schema` will be extracted.\n ')
¶
validate_target_fields_in_schema(data)
¶
Validates that the target_fields are defined in the output_schema and that they are of type Field.
| PARAMETER | DESCRIPTION |
|---|---|
data
|
The data containing the fields values to validate.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Any
|
The data with the validated fields.
TYPE:
|
Source code in nerxiv/prompts/prompts.py
build(text)
¶
Builds the prompt based on the fields defined in this class. This is used to format the prompt
and append the text to be sent to the LLM for generation.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
The text to append to the prompt.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The formatted prompt ready to be sent to the LLM.
TYPE:
|
Source code in nerxiv/prompts/prompts.py
PromptRegistryEntry
¶
Bases: BaseModel
Represents a registry entry for a prompt, containing the retriever query and the prompt itself. This
is used to register prompts in the PROMPT_REGISTRY defined in nerxiv.prompts.prompts_registry.py.
Source code in nerxiv/prompts/prompts.py
retriever_query = Field(..., description='The query used in the retriever.')
¶
prompt = Field(..., description='The prompt to use for the query.')
¶
clean_retriever_query(value)
¶
Cleans the retriever query by removing extra whitespace and newlines.
| PARAMETER | DESCRIPTION |
|---|---|
value
|
The retriever query to clean.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The cleaned retriever query.
TYPE:
|
Source code in nerxiv/prompts/prompts.py
nerxiv.utils.utils
¶
answer_to_dict(answer='', logger=logger)
¶
Converts the answer string to a list of dictionaries by removing unwanted characters. This is useful when prompting the LLM to return a list of objects containing metainformation in a structured way.
| PARAMETER | DESCRIPTION |
|---|---|
answer
|
The answer string to be converted to a list of dictionaries. Defaults to "".
TYPE:
|
logger
|
The logger to log messages. Defaults to logger.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[dict]
|
list[dict]: The list of dictionaries extracted from the answer string. |
Source code in nerxiv/utils/utils.py
clean_description(description)
¶
Cleans the description by removing extra spaces and leading/trailing whitespace.
| PARAMETER | DESCRIPTION |
|---|---|
description
|
The description string to be cleaned.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The cleaned description string with extra spaces removed.
TYPE:
|
Source code in nerxiv/utils/utils.py
filter_material_formula_predicate(answer)
¶
Predicate function to determine if the answer indicates the presence of a material formula.
| PARAMETER | DESCRIPTION |
|---|---|
answer
|
The answer string to be evaluated.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if the answer is "model", indicating a material formula is present; False otherwise.
TYPE:
|
Source code in nerxiv/utils/utils.py
filter_only_dmft_predicate(answer)
¶
Predicate function to determine if the answer indicates the absence of DMFT method.
| PARAMETER | DESCRIPTION |
|---|---|
answer
|
The answer string to be evaluated.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if the answer is not "True", indicating DMFT is not used; False if DMFT is used.
TYPE:
|