Improving LLM Reasoning in Materials Science

About

This repository is a part of Master's thesis in AI: "Enhancing the Reasoning of LLMs in Materials Discovery via Causal Contracts"

📋 Overview

This repository contains a modular, DAG‑based framework for experimenting with pipelines that test and improve the reasoning capabilities of large language models (LLMs). The system offers two end‑to‑end pipelines:

Document Pipeline — downloads papers, extracts pages, performs multi‑stage analysis, and synthesizes structured artifacts (semantic summaries and contracts) before generating QA datasets.
Answer & Evaluation Pipeline — answers the generated questions under configurable context settings and evaluates performance with transparent, reproducible artifacts.

The pipelines are built from interoperable agents orchestrated by a lightweight DAG runner, making it easy to plug in new components, models, or evaluation strategies. Code quality, reproducibility, and documentation are priorities.

📖 Paper: coming soon

🏁 Highlights

DAG‑orchestrated research pipelines with explicit data contracts between tasks.
Agentic design: discrete agents for download, extraction, analysis, contract writing, QA generation, answering, and evaluation.
Reproducible runs: each experiment writes all inputs/outputs to a timestamped folder under runs/.
Configurable model runtime via CLI flags (--model, --temperature, --retries).
Context controls for ablations (e.g., raw text vs. additional context flags).

📋 Table of Contents

🚀 Get Started

System Requirements

Python 3.10+
macOS/Linux/WSL2 (Windows native should also work)
Recommended: virtual environment (Conda/venv)

Installation

Clone the repository

git clone <your-repo-url>
cd <your-repo>

Create and activate a Conda environment

conda create -n llm-reasoning python=3.10 -y
conda activate llm-reasoning

Install dependencies

Using pip (from requirements.txt):

pip install -r requirements.txt

Or, if you maintain an environment.yml:

conda env update -n llm-reasoning -f environment.yml

Tip: If you're on Apple Silicon or using CUDA, prefer installing any platform‑specific packages via conda first, then fall back to pip.

Environment Setup

Minimal configuration is required for local runs. By default, artifacts are written to ./runs. Ensure the directories under data/ and test_data/ exist and contain the input files referenced below.

Quick Start

Document Pipeline — given a list of paper URLs:

python app.py document  --working-dir runs \
  --papers ./test_data/papers_1.lst \
  --run-id doc_pipeline_test  \
  --model openai/o4-mini --temperature 1.0 --retries 3

Answer & Evaluation Pipeline — given a QA dataset and contracts:

python app.py answer --working-dir runs \
  --contracts ./data/contracts2.json \
  --dataset   ./data/full_dataset2.json \
  --flags RAW_TEXT \
  --model openai/o4-mini --temperature 1.0 --retries 3

python app.py design  --working-dir runs \
   --contracts ./data/contracts2.json \
   --contract_id 21-0 \
   --run-id design_pipeline-20250915-065507 \
   --model openai/o4-mini --temperature 1.0 --retries 3

Both commands create a timestamped experiment folder inside runs/ that contains all intermediate and final artifacts.

🔧 Framework Components

General notes on code architecture

The LLM client part is implemented using LiteLLM library, and this makes our tool model-agnostic. When using two providers, appropriate API keys must be provided.

🧬 Document Pipeline

Goal: transform raw papers into structured, analysis‑ready artifacts and generate a high‑quality QA dataset for reasoning evaluation.

Stages (agents):

DownloadAgent — fetches documents from URLs listed in --papers.
ExtractionAgent — extracts pages and creates normalized document IDs.
DocumentAnalyzerAgent — runs first‑pass analysis on the processed documents.
SemanticAnalyzerAgent — performs deeper semantic analysis to derive structured representations.
ContractWriterAgent — synthesizes contracts (concise, schema‑constrained summaries) to support controlled QA.
QADatasetGeneratorAgent — produces a QA dataset aligned with the contracts and selected context flags.

Outputs: processed IDs, semantic documents, contracts, and a generated QA dataset — all stored under the current run directory.

🧪 Answer & Evaluation Pipeline

Goal: answer questions under specified context settings and evaluate performance.

Stages (agents):

QAAnswerAgent — answers questions from a QA dataset using the selected model configuration and context flags.
QAEvaluationAgent — computes evaluation metrics and produces analysis artifacts.

Outputs: model answers and evaluation reports (JSON/CSV/plots as configured) under the current run directory.

⚙️ Core Runtime

DAG / DAGRunner — schedules and executes agents given explicit data dependencies; ensures artifact lineage, reproducibility, and fault isolation.
Composability — use the DAG builders in app4.py as examples to assemble new experimental pipelines.

🗂️ Utilities & Middleware

utils.prompt_manager.PromptManager — centralizes prompts used by analysis/answering agents.
utils.common.ModelConfig — encapsulates model runtime knobs (name, temperature, retries).
middleware.ImageStorage — manages image assets produced along the pipeline.

📦 Project Structure

<repo-root>/
├─ agents/                  # Agent implementations (download, extract, analyze, QA, eval, ...)
├─ core/                    # DAG & runtime abstractions
├─ data/                    # Generated artifacts (contracts, datasets, etc)
├─ docs/                    # Technical documentation
├─ docucache/              
├─ middleware/              # Shared services (e.g., image storage)
├─ notebooks/               # Analytics for thesis            
├─ runs/                    # Experiment outputs (created at runtime)
├─ test_data/               # Test lists / small fixtures (e.g., paper URLs)
├─ utils/                   # Prompt manager, settings, model config, logging helpers
├─ requirements.txt         # Python dependencies
└─ app.py                  # CLI entrypoint: build & run DAG pipelines

The file structure of DocuCache

└── docucache/
    ├── metadata.db            <-- The SQLite database file
    ├── 1/                     <-- First paper's folder (ID from DB)
    │   ├── assets/
    │   └── tmp/
    │       └── 1706.03762.pdf
    ├── 2/
    │   ├── assets/
    │   └── tmp/
    │       └── 2203.02155.pdf
    └── 3/
        ├── assets/
        └── tmp/
            └── 2307.09288.pdf

📈 Reproducibility & Experiment Artifacts

Every run produces a folder runs/<pipeline-name>-YYYYMMDD-HHMMSS/ containing:

Inputs: exact copies or references to all inputs used by each agent.
Intermediate Artifacts: JSON manifests (e.g., processed document IDs, semantic docs, contracts, model answers).
Final Reports: evaluation summaries and any generated plots/tables.
Logs: structured logs with timestamps for traceability.

You can pass a custom --run-id to name the experiment folder deterministically.

📝 Configuration Reference

Common CLI Flags

--working-dir (default: runs) — base directory for artifacts.
--model (default: openai/o4-mini) — model name/ID in a standard format provider/model_name.
--temperature (default: 1.0) — sampling temperature.
--retries (default: 3) — retry attempts for model calls.

Document Pipeline

--papers — path to a text file containing one URL per line (default: ./test_data/papers_1.lst).
--run-id — optional experiment/run identifier.

Answer & Evaluation Pipeline

--dataset — path to a QA dataset JSON (default: ./data/test_dataset.json).
--contracts — path to a contracts JSON (default: ./data/test_contracts.json).
--flags — context flags (e.g., RAW_TEXT, CC, or CC+RAW_TEXT).
--run-id — optional experiment/run identifier.

Programmatic Use

Import and extend the DAG builders from app.py:

from app import get_document_pipeline_dag, get_answer_pipeline_dag

# Build a custom DAG and run it with your own runner/configuration

📚 Citation

If you use this repository in academic work, please cite it as:

@misc{kaliutau2025,
  author = {Kaliutau, A.},
  title = {{AI in Material Science}},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/stable-reasoning/ai-material-science}},
}

💡 Tips

Keep the structure of your input files stable; the agents rely on consistent schemas.
For ablations, duplicate a prior run with only one change (e.g., --flags, --temperature).
Use --run-id to align artifacts across multiple pipelines when running a batched study.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Improving LLM Reasoning in Materials Science

About

📋 Overview

🏁 Highlights

📋 Table of Contents

🚀 Get Started

System Requirements

Installation

Environment Setup

Quick Start

🔧 Framework Components

General notes on code architecture

🧬 Document Pipeline

🧪 Answer & Evaluation Pipeline

⚙️ Core Runtime

🗂️ Utilities & Middleware

📦 Project Structure

The file structure of DocuCache

📈 Reproducibility & Experiment Artifacts

📝 Configuration Reference

Common CLI Flags

Document Pipeline

Answer & Evaluation Pipeline

Programmatic Use

📚 Citation

💡 Tips

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.embeddings		.embeddings
agents		agents
config		config
core		core
data		data
middleware		middleware
notebooks		notebooks
runs		runs
scripts		scripts
templates		templates
test_data		test_data
utils		utils
.env.copy		.env.copy
.gitignore		.gitignore
LICENSE		LICENSE
app.py		app.py
readme.md		readme.md
requirements.txt		requirements.txt

License

akaliutau/ai4materials

Folders and files

Latest commit

History

Repository files navigation

Improving LLM Reasoning in Materials Science

About

📋 Overview

🏁 Highlights

📋 Table of Contents

🚀 Get Started

System Requirements

Installation

Environment Setup

Quick Start

🔧 Framework Components

General notes on code architecture

🧬 Document Pipeline

🧪 Answer & Evaluation Pipeline

⚙️ Core Runtime

🗂️ Utilities & Middleware

📦 Project Structure

The file structure of DocuCache

📈 Reproducibility & Experiment Artifacts

📝 Configuration Reference

Common CLI Flags

Document Pipeline

Answer & Evaluation Pipeline

Programmatic Use

📚 Citation

💡 Tips

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages