Auto-ARGUE

This repository contains the source code for Auto-ARGUE, an LLM-based implementation of the ARGUE framework for report generation evaluation from "On the Evaluation of Machine-Generated Reports" (Mayfield et al., 2024).

0. ARGUE Framework

Overview

The ARGUE framework evaluates generated reports based on citation usage and coverage of key information from the collection related to the query topic. Reports are evaluated sentence-by-sentence according to the following flow chart. Based on the outcomes shown in the flow chart, sentences accrue rewards (green circles) and penalties (red circles) for the report, depending on the correctness of their citations and their answers to a pre-defined set of questions (nuggets) for the topic. For a complete overview of ARGUE, see "On the Evaluation of Machine-Generated Reports" (Mayfield et al., 2024). For a brief summary, covering how sentences are processed by ARGUE (both those with citations and those without), see directly below.

Details

The subsections below provide a description of how ARGUE operates in the abstract (i.e. not specifically in our implementation). Sentences that have citations attached to them are handled differently from those without citations.

Sentences Without Citations

If the sentence contains a negative assertion (e.g. "X is not known" or "There is no evidence that Y"):
- Reward (+) if there is a nugget confirming this assertion
- Penalize (-) if no nugget confirms this assertion
For statements requiring citations:
- Penalize (-) if it's the first occurrence of the statement
- Ignore (0) if the statement was previously cited in the report
For statements not requiring citations (e.g. introductory text):
- Ignore (0)

Sentences With Citations

Check if each cited document supports the statement:
- Penalize (-) if any cited document doesn't support the statement
- Continue to step 2 if all documents support the statement
Check nugget matching:
- Reward (+) for each nugget the sentence correctly answers
- Ignore (0) each unmatched nugget

Other Notes

A sentence can receive multiple rewards for matching multiple nuggets
Each unique nugget counts only once for recall
Ignored sentences (score=0) don't affect precision
In our implementation, any penalized sentence or nugget (-) counts against precision only by being added to the denominator (not by being subtracted from the numerator)

1. Installation

Install UV (if not already installed):

curl -LsSf https://github.com/astral-sh/uv/releases/latest/download/uv-installer.sh | sh

Clone the repo and set up the virtual environment (which will be named argue)

git clone git@github.com:hltcoe/auto-argue.git && cd auto-argue
chmod +x uv_install.sh && source uv_install.sh

(Optional) install in development mode:

uv pip install -e .

Note: You may need to adjust the torch distribution depending on the system you're runninig on.

2. Configuration

Model Providers

Auto-ARGUE uses LangChain to interact with several LLM providers:

Together AI (together)
OpenAI (openai)
Anthropic (anthropic)
HuggingFace, including:
- The HuggingFace Inference Providers API (huggingface)
- Models you download and run locally from the HuggingFace model hub (huggingface_local)

API Keys

Create a .env file with your API key(s). You can also put this in your .bashrc file or similar (you need only supply keys for providers you intend to use):

TOGETHER_API_KEY=your_togetherai_api_key_here
OPENAI_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here
HUGGINGFACEHUB_API_TOKEN=your_huggingface_key_here

Requests and Rate Limits

ARGUE issues asynchronous requests to model providers. By default, a maximum of 10 parallel requests are permitted (enforced via semaphore). This maximum can be adjusted by setting the ARGUE_MAX_CONCURRENCY environment variable, which may be necessary to avoid hitting rate limits, depending on the model and provider you are using and depending on your usage tier.

3. Collections and Data

Auto-ARGUE requires access to source documents in order to verify citations. The case study presented in our paper focuses on Chinese, Russian, and Farsi collections from the TREC 2024 NeuCLIR track (neuclir/1/zh, neuclir/1/fa, neuclir/1/ru). These collections can be downloaded from HuggingFace:

git clone https://huggingface.co/datasets/orionweller/neuclir-docs-lookup

Note: The collections are large (56G across all three) and thus should be stored on a large partition. Once you have downloaded the collections, set the DOC_LOOKUP environment variable to point to that path:

export DOC_LOOKUP=/path/to/neuclir-docs-lookup/

Reports

If you wish to evaluate the reports submitted to the TREC 2024 NeuCLIR track, you will need to obtain these through TREC: https://pages.nist.gov/trec-browser/trec33/neuclir/runs/. NOTE: If you are not a TREC participant, you may need to request access by emailing trec@nist.gov.

Nuggets

Nuggets for the NeuCLIR 2024 topics are included in assets/neuclir-nuggets/

Referencing the Lookup

By default, the document lookup defaults to the path given by DOC_LOOKUP. If you want to override this default with some other collection, you can pass a path to that collection via the --collection-dir (-C) option to eval.py.

Mini Collection for Testing

Since the NeuCLIR collections are large, a mini collection can be helpful to have for quick testing to ensure basic functionality of small changes during development. We provide such a collection (with only a handful of Russian documents) in assets/sample-inputs/sample-docs-lookup/ in this repository, which can be provided as the argument to --collection-dir when evaluating on the sample report (assets/sample-inputs/sample_report.jsonl). The documents in this collection include all of the ones cited by assets/sample-inputs/sample_report.jsonl and all those associated with the nuggets in assets/sample-inputs/sample_nuggets_388.v3.json. Thus, these are the corresponding report and nuggets files you should use for testing on this mini collection.

Document Loading and Memory Usage

A few points on how Auto-ARGUE works with documents in memory:

Documents are loaded on first access and a document cache is created
The cache is shared across evaluations
The cache persists until program exits
Each collection typically requires 1-2GB of memory

Some additional tips for memory management:

Process reports in smaller batches
Clear cache between large batches

4. Basic Usage

Before proceeding, please make sure you have chosen a model provider (e.g. together), set an API key for that provider (e.g. TOGETHER_API_KEY), and have identified the appropriate path for your document lookup (i.e. DOC_LOOKUP). See Referencing the Lookup.

Once you have done this, you can run auto_argue/eval.py, which will produce judgments on a sample report (annotate) and then score those judgments (score). An example invocation is shown below, using Anthropic as the model provider and running evaluation on the sample report:

python -m auto_argue.eval assets/sample-inputs/sample_report.jsonl assets/sample-inputs/sample_nuggets_388.v3.json results/sample -p together -a annotate score --verbose --validate-judgments

This will output the judgments to a file sample.judgments.jsonl and the corresponding scores to a file sample.scores.tsv — both in an output directory named results/. Examples of the judgments and scores files can be found in assets/sample-outputs/.

Here, we explain some of the most important command line options. Explanations of others can be found in auto_argue/eval.py:

-a (--actions): specifies the actions that the script should run. There are two options: annotate (which produces the judgments) and score (which scores those judgments). In general, you will likely want to set both (i.e. -a annotate score). Note that when running -a annotate score, the code will try to find an existing judgments file to use for scoring that matches the provided output_file_prefix (above: results/sample). To force the model to regenerate the judgments file, specify the --rerun flag.
-p: specifies the model provider (here, together). Other valid choices are openai, together, huggingface (for models queried through the HuggingFace API), and huggingface_local (for HuggingFace models on disk).
-m (--model-name): the unique identifier for the model to use from the chosen provider. Defaults for each provider are described below.
-C (--collection-dir): path to the directory containing the collection(s) to use for judging and scoring.
-c (--collection): the identifier(s) of a subset of the collection(s) within the --collection-dir to use (defaults to all three NeuCLIR collections: neuclir/1/ru, neuclir/1/fa, neuclir/1/zh)
--validate-judgments: flag is used to ensure that the judgments produced by the LLM are valid for scoring. We recommend always setting this flag.

Command Line Interface

You can specify the provider with the -p flag and the specific model to be used with the -m flag in the eval.py command describe above. If you do not explicitly specify a provider, Auto-ARGUE will try to use Together AI by default. If you specify a provider (-p) without specifying a model (-m), the default model configured for that provider will be used:

openai: gpt-4o-mini-2024-07-18
anthropic: claude-sonnet-4-5-20250929
together: meta-llama/Llama-3.3-70B-Instruct-Turbo
huggingface: meta-llama/Llama-3.3-70B-Instruct
huggingface_local: meta-llama/Llama-3.1-8B-Instruct

A note on reasoning models: Because Auto-ARGUE expects YES/NO responses to each judgment, maximum output tokens are restricted by default (to 10). This is fine for non-reasoning models, but can sometimes yield poor results when using a reasoning model. If you do use reasoning models, you may need to adjust the max_tokens parameter in auto_argue/utils.py:get_model_response.

5. Data Formats

Reports

The input JSONL file containing the reports to be judged/scored should contain entries with this structure, one per report:

{
    "metadata": {
        "team_id": "my_fantastic_team",
        "run_id": "my_best_run_02", 
        "topic_id": "101",
    },
    "responses": [
        {
            "text": "Sky is blue.",
            "citations": [
                "docid001",
                "docid003",
            ]
        },
        {
            "text": "The moon is made out of blue cheese.",
            "citations": [
                "docid002",
            ]
        },
        {
            "text": "This is all.",
            "citations": {}
        },
    ],
    "references": [
        "docid0001",
        "docid0002",
        "docid0003",
    ]
}

Alternatively, the citations field for each sentence in the responses can be a dictionary, with scores (e.g. for relevance) associated with each cited document:

{
    "text": "Sky is blue.",
    "citations": {
        "docid001": 0.3,
        "docid003": 0.5
    }
},

Nuggets

Nuggets are question-answer pairs associated with a topic that are used to evaluate how effectively a report on that topic recovers key information from the collection. Nuggets can have multiple answers and they come in two varieties:

AND: Nuggets for which all answers must be provided by the report in order to be counted as correct.
OR: Nuggets for which at least one answer must be provided by the report in order to be counted as correct.

Each nugget answer is expected to be linked to a set of documents in the collection that attest that answer. Topics from the TREC 2024 NeuCLIR track each have between 10 and 20 nuggets associated with them.

The primary format for nugget files is illustrated in assets/sample-inputs/sample_nuggets_388.v3.json. (Other formats are supported but are deprecated and not discussed in this README.) All nugget files adopting this format are expected to follow a naming convention in which the topic ID is given as the last element of a '_'-separated list of terms and where the suffix is v3.json (e.g. nuggets_388.v3.json). Failing to adhere to this convention will result in errors.

Judgments

Judgment Types

The auto_argue/eval.py script produces a set of judgments for each sentence of a report. There are seven types of judgments, each of which directly corresponds to one of the blue diamonds shown in the ARGUE framework diagram (see above). The judgment types are as follows and are specified in auto_argue/utils.py:

SENTENCE_ATTESTED: Is the report sentence attested/supported by a specific document it cites?
SENTENCE_ANSWERS_QUESTION: For a given nugget question, does the report sentence provide a correct answer to that question?
REQUIRES_CITATION: Does the report sentence require a citation?
CITED_DOCUMENT_RELEVANCE: For a given cited document, is this document relevant (NOTE: in Auto-ARGUE, a document is considered relevant iff it provides at least part of an answer to at least one of the nugget questions associated with the report request?)
FIRST_INSTANCE: Does the report sentence contain repeated information, relayed earlier in the report?
NEGATIVE_ASSERTION: Does the report sentence assert that some information is unknown or not answerable?

This last judgment type (NEGATIVE_ASSERTION), though part of the ARGUE framework, is not currently supported in our automatic implementation. While this would be interesting to explore, in practice there are very few of these sorts of statements in the NeuCLIR reports on which we evaluated our implementation.

The CITED_DOCUMENT_RELEVANCE judgment type is unique in being — by default — a deterministic check that DOES NOT invoke an LLM. Recall from above that a document is considered relevant iff it was annotated (by a human assessor) as containing at least part of an answer to at least some nugget question. We can therefore determine whether a document is relevant simply via lookup, checking whether it was annotated as answering a nugget question. Alternatively, you have the option to use an LLM to assess which nuggets a given cited document attests. To enable this behavior, set the --always-check-all-nuggets option for eval.py. WARNING:: this is very expensive and will issue many requests.

All of the remaining judgments are evaluated via an LLM call. The system and user prompts used for these calls can be found in auto_argue/prompts/. These prompts all expect "YES" or "NO" response. In general, recent models are competent at abiding by instructions to output only one of these two responses. Occasionally, however, a model may fail to do this. In such cases, we currently fall back to a "default" response for each judgment type. These default responses are as follows:

SENTENCE_ATTESTED: false
SENTENCE_ANSWERS_QUESTION: false
REQUIRES_CITATION: true
FIRST_INSTANCE: true

Judgment Objects

Judgments follow a Pydantic data model defined in auto_argue/judge.py. Each judgment is saved in the output as a JSON dictionary with the following keys:

judgment_type_id: a unique identifier for the judgment type (see the JudgmentType class in auto_argue/utils.py)
response: the model response(s) for this judgment
evaluator: a unique identifier for the model or system that produced this judgment
provenance: information needed to determine what the judgment is about (e.g. a document identifier for SENTENCE_ATTESTED)

The format of the response and provenance fields is dependent on the judgment type (judgment_type_id). The data models for specific judgment types can also be found in auto_argue/judge.py.

Judgment Output File

The JSONL file output by auto_argue/eval.py has one line per report with the following fields:

request_id: the report request ID for this report (str)
run_id: an identifier for the run that produced this report (str)
collection_ids: a list of the names of the document collections used to produce and evaluate this report (List[str])
segments: a list of sentences in the report, including the citations and judgments associated with them (List[Dict[str,Any]])

Each element of segments has the following fields:

segment_type: What kind of text segment this is (here, always sentence)
text: The sentence text (str)
citations: A list of of documents cited by this sentence, each represented by its document ID (doc_id) and the document text (text) (List[Dict[str, str]])
judgments: A list of ARGUE judgments produced for this sentence (List[Judgment])

7. Metrics

Auto-ARGUE produces a number of metrics, which are written to a *.scores.tsv file.

Core Metrics

We note that ARGUE itself does not dictate specific metrics to be reported, but rather provides a framework for producing judgments about a report, which can then be scored using various metrics. Auto-ARGUE emphasizes two primary metrics that were suggested by Mayfield et al. (2024):

Nugget Coverage (also called Nugget Recall) = (Number of unique nuggets correctly answered by the report) / (Total number of nuggets for the request)
- This is reported as nugget_coverage in the score TSVs
Sentence Support = (Number of rewarded sentences) / (Total scored sentences)
- This is reported as sentence_support in the score TSVs

We also report an F1 Score computed on the basis of the above two scores:

f1: an F1 score computed from nugget_coverage and sentence_support

Importantly, there is also a weighted version of nugget_coverage (nugget_coverage_weighted) based on the importance label associated with that nugget — either vital or okay — as well as a corresponding f1_weighted score. Currently, vital nuggets carry a weight of 2.0 and okay nuggets carry a weight of 1.0. These weights were determined manually following lots of close investigation and quality control of different nugget sets. If no importance labels are provided for a given nugget set, nugget_coverage_weighted will be equivalent to nugget_coverage.

We report per-topic results for all metrics, but we additionally report micro- and macro-averaged versions of the following metrics for each run across all topics:

nugget_coverage
nugget_coverage_weighted
sentence_support
f1
f1_weighted
citation_support (see below)
citation_relevance (see below)

For these aggregated results, _micro or _macro is appended to the base metric name to distinguish between the micro- and macro-averaged versions.

Beyond nugget_coverage, sentence_support, and f1, there are a number of additional metrics that will be written to the output score TSVs:

Sentence-Level Metrics

sentences: The total number of sentences in the report
correctly_cited_sentences: The number of report sentences that (1) have at least one citation, and where (2) all these citations (individually) support the sentence
sentences_missing_citation: The number of report sentences that are lacking a citation and that are judged to need one
first_instance_sentences_missing_citation: of sentences_missing_citation, the number of sentences that are also judged to be "first instances" — i.e. sentences that present novel information not discussed earlier in the report

Citation-Level Metrics

citations: total number of citations present in the report. (Note that a single document cited by two different sentences counts as two citations.)
relevant_citations: of citations, the number of citations to relevant documents — i.e. documents that contain an answer to at least one nugget question
supporting_citations: of citations, the number that are judged to support the sentence they're attached to (regardless of whether they are relevant or not)
citation_support: proportion of citations that are judged to support the sentence they're attached to
citation_relevance: proportion of cited documents that are relevant

Nugget-Level Metrics

correct_nuggets: total number of nuggets for which the report provides the full correct answer. For AND nuggets, a system must produce all answers for that nugget. For OR nuggets, a system must produce one answer for that nugget (and no incorrect answers).

8. Python API

The core ARGUE functionality can be invoked from within Python by calling the evaluate_report function on a valid report (see the expected report format above). Note that the evaluate_report function is asynchronous, so you will need to await it. Example usage is shown below.

import asyncio
import json
import os

from auto_argue.judge import evaluate_report, ModelProvider
from auto_argue.utils import Report
from pathlib import Path


async def evaluate():
    my_reports = []
    with open("assets/sample-inputs/sample_report.jsonl", "r") as f:
        for line in f:
            my_reports.append(Report.model_validate(json.loads(line)))

    collection_dir = "assets/sample-inputs/sample-docs-lookup/"
    assert os.path.isdir(
        collection_dir
    ), f"Collection path {collection_dir} does not exist!"
    collection_ids = ["neuclir/1/ru", "neuclir/1/fa", "neuclir/1/zh"]
    return await asyncio.gather(
        *[
            evaluate_report(
                r,
                Path("assets/sample-inputs/sample_nuggets_388.v3.json"),
                collection_ids,
                collection_dir,
                provider=ModelProvider.TOGETHER,
            )
            for r in my_reports
        ]
    )


if __name__ == "__main__":
    import asyncio

    results = asyncio.run(evaluate())
    print(results)

9. Prompt Customization

By default, ARGUE uses the prompts in auto_argue/prompts/ to obtain judgments from the LLM judge. However, prompts for all ARGUE judgment types can be customized — and likely should be if you're working with novel data or models! To do so, you can supply a prompt configuration JSON file as an argument to the --prompt_config (-P) command line option. This file should contain a JSON dictionary with one or more of the following keys, each corresponding to a specific ARGUE judgment type (see Judgment Types):

negative_assertion
requires_citation
first_instance
cited_document_relevance
sentence_attested
sentence_answers_question

The value corresponding to each supplied key should itself be a dictionary with the following fields:

user_prompt (str): the user prompt to be used for this judgment
system_prompt (str): the system prompt to be used for this judgment
default_response (Optional[str]; YES or NO): the default response to be returned in cases where the model's output cannot be parsed

Each judgment type expects certain template variables to be included in the user_prompt for that judgment (wrapped in curly brackets, e.g. {sentence}). These variables are as follows:

negative_assertion: sentence
requires_citation: sentence
first_instance: sentence, previous_sentences
cited_document_relevance: sentence, document
sentence_attested: sentence, document
sentence_answers_question: sentence, nugget_question, nugget_answer

Apart from these constraints, you have complete freedom with regard to the structure of the prompts. Note that the system prompts are not expected to contain any template variables. An example prompt configuration file that modifies only the sentence_attested prompts can be found in assets/sample-inputs/sample_prompt_config.json.

10. Testing

Limited unit tests have been implemented in the tests/ directory. To run all tests, cd into tests/ and run pytest.

11. Issues

All bugs and feature requests should be submitted as an issue to this repository. When creating one, please describe the feature request or bug in as much detail as possible (including steps to reproduce in the latter case or desired functionality and suggested implementation in the former case).

12. Citing this Work

If you use this evaluation framework in your research, please cite the following two papers:

@misc{walden2025autoarguellmbasedreportgeneration,
      title={Auto-ARGUE: LLM-Based Report Generation Evaluation}, 
      author={William Walden and Marc Mason and Orion Weller and Laura Dietz and John Conroy and Neil Molino and Hannah Recknor and Bryan Li and Gabrielle Kaili-May Liu and Yu Hou and Dawn Lawrie and James Mayfield and Eugene Yang},
      year={2025},
      eprint={2509.26184},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2509.26184}, 
}

@inproceedings{Mayfield2024OnTE,
  title={On the Evaluation of Machine-Generated Reports},
  author={James Mayfield and Eugene Yang and Dawn J Lawrie and Sean MacAvaney and Paul McNamee and Douglas W. Oard and Luca Soldaini and Ian Soboroff and Orion Weller and Efsun Kayi and Kate Sanders and Marc Mason and Noah Hibbler},
  booktitle={Annual International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2024},
  url={https://api.semanticscholar.org/CorpusID:269502216}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
auto_argue		auto_argue
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv_install.sh		uv_install.sh

License

hltcoe/auto-argue

Folders and files

Latest commit

History

Repository files navigation

Auto-ARGUE

Table of Contents

0. ARGUE Framework

Overview

Details

Sentences Without Citations

Sentences With Citations

Other Notes

1. Installation

2. Configuration

Model Providers

API Keys

Requests and Rate Limits

3. Collections and Data

Reports

Nuggets

Referencing the Lookup

Mini Collection for Testing

Document Loading and Memory Usage

4. Basic Usage

Command Line Interface

5. Data Formats

Reports

Nuggets

Judgments

Judgment Types

Judgment Objects

Judgment Output File

7. Metrics

Core Metrics

Sentence-Level Metrics

Citation-Level Metrics

Nugget-Level Metrics

8. Python API

9. Prompt Customization

10. Testing

11. Issues

12. Citing this Work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages