Prompt Engineering for Metadata Extraction¶
Prompt engineering is the art and science of crafting instructions that guide LLMs to produce accurate, consistent, and useful outputs. This explanation covers principles, techniques, and best practices for extracting metadata from scientific papers.
A prompt is the instruction given to an LLM. In NERxiv, prompts guide the LLM to extract specific information from retrieved paper chunks.
Basic prompt:
Engineered prompt:
You are a Condensed Matter Physics assistant. Your task is to identify
all mentions of the system being simulated in the following text.
Look for:
- Chemical formulas (e.g., La₀.₈Sr₀.₂NiO₂)
- Specific model names (e.g., "square lattice")
- Material names (e.g., "graphene")
Only consider if the mention corresponds to an actual simulation of that material.
Ignore references to similar materials used only for comparison.
Important constraints:
- Return only the extracted formulas/names
- Do not include explanations or thinking blocks
- Use pipe | to separate alternative names for the same material
[Text to analyze]
The engineered prompt provides:
- Role: Expert identity
- Task: Clear objective
- Instructions: What to look for
- Constraints: Output format
- Context: Retrieved text
Anatomy of a Good Prompt¶
NERxiv uses a structured approach with five components:
1. Expert Identity¶
Sets the context and expertise:
Generated prompt section:
Why it matters: Primes the model to use domain-appropriate knowledge and terminology.
2. Main Instruction¶
The primary task:
Generated:
Best practices: - Use action verbs: identify, extract, list, classify - Be specific: "computational methods" not just "methods" - Avoid ambiguity: "methods used in the study" not "methods mentioned"
3. Secondary Instructions¶
Detailed guidance:
secondary_instructions=[
"Look for abbreviations like DFT, DMFT, QMC",
"Include full method names when mentioned",
"Distinguish between methods used vs. methods mentioned for comparison",
"Check the methods section, computational details, and results"
]
Best practices: - Provide 3-5 specific sub-instructions - Cover edge cases - Clarify ambiguities - Guide where to look
4. Constraints¶
Output formatting and restrictions:
constraints=[
"Return only method names, one per line",
"Do not include explanations or reasoning",
"Use pipe | to separate alternative names (e.g., DFT | density functional theory)",
"Return 'None' if no methods are found"
]
Why constraints matter: LLMs tend to be verbose. Constraints ensure clean, parseable output.
5. Examples¶
Few-shot learning examples:
examples=[
Example(
input="We use DFT+DMFT to study the electronic structure.",
output="DFT+DMFT"
),
Example(
input="Our DFT results differ from previous DMFT studies.",
output="DFT"
)
]
Power of examples: Shows the model exactly what you want, especially for format and edge cases.
6. Schema and Target Fields to Extract Structured Outputs¶
Specify the pydantic BaseModel class and the target field defined within that class to extract metadata from:
NERxiv functionalities will read the description of both the schema class and targetted fields and generate a prompt from it:
Given the following scientific text, your task is: to identify all mentions of the ChemicalFormulation. This is defined as A ChemicalFormulation is a descriptive representation of the chemical composition of a material system, expressed in one or more standardized formula formats (e.g., IUPAC, anonymous, Hill, or reduced), each encoding the stoichiometry and elemental ordering according to specific conventions. For the compound H2O2 (hydrogen peroxide), the different formulations would be: iupac: H2O2 anonymous: AB hill: H2O2 reduced: H2O2. You must extract the values of the following fields:
- iupac defined as 'Chemical formula where the elements are ordered using a formal list based on electronegativity as defined in the IUPAC nomenclature of inorganic chemistry (2005): - https://en.wikipedia.org/wiki/List_of_inorganic_compounds Contains reduced integer chemical proportion numbers where the proportion number is omitted if it is 1.' and which is of type string
- hill defined as 'Chemical formula where Carbon is placed first, then Hydrogen, and then all the other elements in alphabetical order. If Carbon is not present, the order is alphabetical.' and which is of type string
You must return the extracted values in the following format:
```json
'ChemicalFormulation': {
'iupac': <parsed-value>,
'hill': <parsed-value>,
}
```
Principles of Effective Prompts¶
Principle 1: Clarity Over Brevity¶
❌ Too brief:
✅ Clear:
Extract all chemical formulas representing materials that were
actually simulated or synthesized in this study.
Principle 2: Provide Context¶
❌ No context:
✅ With context:
You are a Condensed Matter Physics expert. Determine whether
DMFT (Dynamical Mean Field Theory) or its variants (DFT+DMFT, EDMFT)
were used as a primary computational method in this study.
Principle 3: Handle Edge Cases as Constraints¶
Common edge cases in scientific papers:
Case 1: Mentioned but not used
Input: "Our DFT results differ from previous DMFT studies."
Question: Was DMFT used?
Answer: No (only DFT was used, DMFT was referenced)
Case 2: Multiple representations
Case 3: Implicit information
Input: "The nickelate was synthesized at 800°C"
Expected: Extract "nickelate" even without exact formula
Add instructions for these:
constraints=[
"Only consider methods actually used, not just mentioned for comparison",
"Expand symbolic formulas (e.g., La₁₋ₓSrₓNiO₂ with x=0.2 → La₀.₈Sr₀.₂NiO₂)",
"Include material class names if specific formulas aren't given"
]
Principle 4: Use Examples Strategically¶
Include examples that cover:
- Simple case: The most straightforward scenario
- Edge case: Something tricky or ambiguous
- Negative case: When nothing should be extracted
- Complex case: Multiple entities or formats
examples=[
# Simple
Example(
input="The material is silicon (Si).",
output="Si"
),
# Multiple
Example(
input="We study Fe₂O₃ and its doped variant Fe₂O₃.₂₅.",
output="Fe₂O₃, Fe₂O₃.₂₅"
),
# Edge case - mentioned but not studied
Example(
input="SrVO₃ is similar to SrTiO₃ but has different properties.",
output="SrVO₃"
),
# Complex - symbolic to explicit
Example(
input="The system is doped La₁₋ₓSrₓNiO₂, for x=0.2.",
output="La₀.₈Sr₀.₂NiO₂"
)
]
Common Prompting Patterns¶
Pattern 1: Classification (Yes/No)¶
prompt = Prompt(
expert="Physics",
main_instruction="determine if DMFT methodology is used",
secondary_instructions=[
"DMFT includes DFT+DMFT, EDMFT, and other variants",
"Return True only if DMFT is a primary method used",
"Return False if DMFT is only mentioned as reference"
],
constraints=[
"Return only 'True' or 'False'",
"No explanations"
],
examples=[
Example(input="We use DFT+DMFT.", output="True"),
Example(input="We use DFT.", output="False"),
Example(input="Our DFT results differ from DMFT studies.", output="False")
]
)
Pattern 2: Extraction (List of Entities)¶
prompt = Prompt(
expert="Chemistry",
main_instruction="extract all chemical formulas mentioned",
secondary_instructions=[
"Include systematic names and common names",
"Expand symbolic notation if values are given",
"Include both reactants and products"
],
constraints=[
"One formula per line",
"Use standard chemical notation",
"Return 'None found' if no formulas present"
],
examples=[
Example(input="NaCl dissolved in water.", output="NaCl"),
Example(input="Synthesis of TiO₂ from Ti and O₂.", output="TiO₂\nTi\nO₂")
]
)
Pattern 3: Structured Extraction (JSON)¶
from pydantic import BaseModel, Field
class MaterialInfo(BaseModel):
formula: str = Field(description="Chemical formula")
temperature: str | None = Field(description="Synthesis temperature")
method: str | None = Field(description="Synthesis method")
prompt = StructuredPrompt(
expert="Materials Science",
output_schema=MaterialInfo,
target_fields=["formula", "temperature", "method"],
constraints=[
"Return valid JSON matching the schema",
"Use null for missing information",
"Include units with numerical values"
],
examples=[
Example(
input="Fe₂O₃ was synthesized at 800°C using sol-gel method.",
output='```json\n{"formula": "Fe₂O₃", "temperature": "800°C", "method": "sol-gel"}\n```'
)
]
)
Advanced Techniques¶
Chain of Thought (CoT)¶
For complex reasoning, encourage step-by-step thinking:
secondary_instructions=[
"First, identify all material mentions",
"Then, determine which were actually studied (not just referenced)",
"Finally, extract their chemical formulas"
]
The LLM naturally reasons through steps before answering.
Self-Consistency¶
For critical extractions, you can run the same prompt multiple times with temperature > 0 and take the most common answer:
# Run 3 times
for i in 1 2 3; do
nerxiv prompt --file-path paper.hdf5 --query material_formula -llmo temperature=0.3
done
# Compare outputs, use consensus
Negative Instructions¶
Sometimes telling the model what NOT to do helps:
constraints=[
"Do NOT include author names",
"Do NOT return materials mentioned only in references",
"Do NOT include explanation or thinking process"
]
Format Examples in Output¶
Show exact format in examples:
Example(
input="Temperature was 300K, pressure 1 bar, duration 2 hours.",
output="Temperature: 300K\nPressure: 1 bar\nDuration: 2 hours"
)
The model learns the exact format you want.
Best Practices Summary¶
- Be specific: Clear tasks, detailed instructions
- Provide context: Set expert role and domain
- Use examples: Show exactly what you want (3-5 examples)
- Control format: Explicit output constraints
- Handle edge cases: Cover tricky scenarios in instructions and examples
- Test iteratively: Try on diverse inputs, refine based on failures
- Use low temperature: 0.0-0.2 for factual extraction
- Keep it focused: One clear task per prompt