This repository contains the source code for Auto-ARGUE, an LLM-based implementation of the ARGUE framework for report generation evaluation from "On the Evaluation of Machine-Generated Reports" (Mayfield et al., 2024).
- Auto-ARGUE
The ARGUE framework evaluates generated reports based on citation usage and coverage of key information from the collection related to the query topic. Reports are evaluated sentence-by-sentence according to the following flow chart. Based on the outcomes shown in the flow chart, sentences accrue rewards (green circles) and penalties (red circles) for the report, depending on the correctness of their citations and their answers to a pre-defined set of questions (nuggets) for the topic. For a complete overview of ARGUE, see "On the Evaluation of Machine-Generated Reports" (Mayfield et al., 2024). For a brief summary, covering how sentences are processed by ARGUE (both those with citations and those without), see directly below.
The subsections below provide a description of how ARGUE operates in the abstract (i.e. not specifically in our implementation). Sentences that have citations attached to them are handled differently from those without citations.
-
If the sentence contains a negative assertion (e.g. "X is not known" or "There is no evidence that Y"):
- Reward (+) if there is a nugget confirming this assertion
- Penalize (-) if no nugget confirms this assertion
-
For statements requiring citations:
- Penalize (-) if it's the first occurrence of the statement
- Ignore (0) if the statement was previously cited in the report
-
For statements not requiring citations (e.g. introductory text):
- Ignore (0)
-
Check if each cited document supports the statement:
- Penalize (-) if any cited document doesn't support the statement
- Continue to step 2 if all documents support the statement
-
Check nugget matching:
- Reward (+) for each nugget the sentence correctly answers
- Ignore (0) each unmatched nugget
- A sentence can receive multiple rewards for matching multiple nuggets
- Each unique nugget counts only once for recall
- Ignored sentences (score=0) don't affect precision
- In our implementation, any penalized sentence or nugget (-) counts against precision only by being added to the denominator (not by being subtracted from the numerator)
Install UV (if not already installed):
curl -LsSf https://github.com/astral-sh/uv/releases/latest/download/uv-installer.sh | sh
Clone the repo and set up the virtual environment (which will be named argue
)
git clone git@github.com:hltcoe/auto-argue.git && cd auto-argue
chmod +x uv_install.sh && source uv_install.sh
(Optional) install in development mode:
uv pip install -e .
Note: You may need to adjust the torch
distribution depending on the system you're runninig on.
Auto-ARGUE uses LangChain to interact with several LLM providers:
- Together AI (
together
) - OpenAI (
openai
) - Anthropic (
anthropic
) - HuggingFace, including:
- The HuggingFace Inference Providers API (
huggingface
) - Models you download and run locally from the HuggingFace model hub (
huggingface_local
)
- The HuggingFace Inference Providers API (
Create a .env
file with your API key(s). You can also put this in your .bashrc
file or similar (you need only supply keys for providers you intend to use):
TOGETHER_API_KEY=your_togetherai_api_key_here
OPENAI_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here
HUGGINGFACEHUB_API_TOKEN=your_huggingface_key_here
ARGUE issues asynchronous requests to model providers. By default, a maximum of 10 parallel requests are permitted (enforced via semaphore). This maximum can be adjusted by setting the ARGUE_MAX_CONCURRENCY
environment variable, which may be necessary to avoid hitting rate limits, depending on the model and provider you are using and depending on your usage tier.
Auto-ARGUE requires access to source documents in order to verify citations. The case study presented in our paper focuses on Chinese, Russian, and Farsi collections from the TREC 2024 NeuCLIR track (neuclir/1/zh
, neuclir/1/fa
, neuclir/1/ru
). These collections can be downloaded from HuggingFace:
git clone https://huggingface.co/datasets/orionweller/neuclir-docs-lookup
Note: The collections are large (56G across all three) and thus should be stored on a large partition. Once you have downloaded the collections, set the DOC_LOOKUP
environment variable to point to that path:
export DOC_LOOKUP=/path/to/neuclir-docs-lookup/
If you wish to evaluate the reports submitted to the TREC 2024 NeuCLIR track, you will need to obtain these through TREC: https://pages.nist.gov/trec-browser/trec33/neuclir/runs/. NOTE: If you are not a TREC participant, you may need to request access by emailing trec@nist.gov
.
Nuggets for the NeuCLIR 2024 topics are included in assets/neuclir-nuggets/
By default, the document lookup defaults to the path given by DOC_LOOKUP
. If you want to override this default with some other collection, you can pass a path to that collection via the --collection-dir
(-C
) option to eval.py
.
Since the NeuCLIR collections are large, a mini collection can be helpful to have for quick testing to ensure basic functionality of small changes during development. We provide such a collection (with only a handful of Russian documents) in assets/sample-inputs/sample-docs-lookup/
in this repository, which can be provided as the argument to --collection-dir
when evaluating on the sample report (assets/sample-inputs/sample_report.jsonl
). The documents in this collection include all of the ones cited by assets/sample-inputs/sample_report.jsonl
and all those associated with the nuggets in assets/sample-inputs/sample_nuggets_388.v3.json
. Thus, these are the corresponding report and nuggets files you should use for testing on this mini collection.
A few points on how Auto-ARGUE works with documents in memory:
- Documents are loaded on first access and a document cache is created
- The cache is shared across evaluations
- The cache persists until program exits
- Each collection typically requires 1-2GB of memory
Some additional tips for memory management:
- Process reports in smaller batches
- Clear cache between large batches
Before proceeding, please make sure you have chosen a model provider (e.g. together
), set an API key for that provider (e.g. TOGETHER_API_KEY
), and have identified the appropriate path for your document lookup (i.e. DOC_LOOKUP
). See Referencing the Lookup.
Once you have done this, you can run auto_argue/eval.py
, which will produce judgments on a sample report (annotate
) and then score those judgments (score
). An example invocation is shown below, using Anthropic as the model provider and running evaluation on the sample report:
python -m auto_argue.eval assets/sample-inputs/sample_report.jsonl assets/sample-inputs/sample_nuggets_388.v3.json results/sample -p together -a annotate score --verbose --validate-judgments
This will output the judgments to a file sample.judgments.jsonl
and the corresponding scores to a file sample.scores.tsv
— both in an output directory named results/
. Examples of the judgments and scores files can be found in assets/sample-outputs/
.
Here, we explain some of the most important command line options. Explanations of others can be found in auto_argue/eval.py
:
-a
(--actions
): specifies the actions that the script should run. There are two options:annotate
(which produces the judgments) andscore
(which scores those judgments). In general, you will likely want to set both (i.e.-a annotate score
). Note that when running-a annotate score
, the code will try to find an existing judgments file to use for scoring that matches the providedoutput_file_prefix
(above:results/sample
). To force the model to regenerate the judgments file, specify the--rerun
flag.-p
: specifies the model provider (here,together
). Other valid choices areopenai
,together
,huggingface
(for models queried through the HuggingFace API), andhuggingface_local
(for HuggingFace models on disk).-m
(--model-name
): the unique identifier for the model to use from the chosen provider. Defaults for each provider are described below.-C
(--collection-dir
): path to the directory containing the collection(s) to use for judging and scoring.-c
(--collection
): the identifier(s) of a subset of the collection(s) within the--collection-dir
to use (defaults to all three NeuCLIR collections:neuclir/1/ru
,neuclir/1/fa
,neuclir/1/zh
)--validate-judgments
: flag is used to ensure that the judgments produced by the LLM are valid for scoring. We recommend always setting this flag.
You can specify the provider with the -p
flag and the specific model to be used with the -m
flag in the eval.py
command describe above. If you do not explicitly specify a provider, Auto-ARGUE will try to use Together AI by default. If you specify a provider (-p
) without specifying a model (-m
), the default model configured for that provider will be used:
openai
:gpt-4o-mini-2024-07-18
anthropic
:claude-sonnet-4-5-20250929
together
:meta-llama/Llama-3.3-70B-Instruct-Turbo
huggingface
:meta-llama/Llama-3.3-70B-Instruct
huggingface_local
:meta-llama/Llama-3.1-8B-Instruct
A note on reasoning models: Because Auto-ARGUE expects YES/NO responses to each judgment, maximum output tokens are restricted by default (to 10). This is fine for non-reasoning models, but can sometimes yield poor results when using a reasoning model. If you do use reasoning models, you may need to adjust the max_tokens
parameter in auto_argue/utils.py:get_model_response
.
The input JSONL file containing the reports to be judged/scored should contain entries with this structure, one per report:
{
"metadata": {
"team_id": "my_fantastic_team",
"run_id": "my_best_run_02",
"topic_id": "101",
},
"responses": [
{
"text": "Sky is blue.",
"citations": [
"docid001",
"docid003",
]
},
{
"text": "The moon is made out of blue cheese.",
"citations": [
"docid002",
]
},
{
"text": "This is all.",
"citations": {}
},
],
"references": [
"docid0001",
"docid0002",
"docid0003",
]
}
Alternatively, the citations
field for each sentence in the responses
can be a dictionary, with scores (e.g. for relevance) associated with each cited document:
{
"text": "Sky is blue.",
"citations": {
"docid001": 0.3,
"docid003": 0.5
}
},
Nuggets are question-answer pairs associated with a topic that are used to evaluate how effectively a report on that topic recovers key information from the collection. Nuggets can have multiple answers and they come in two varieties:
AND
: Nuggets for which all answers must be provided by the report in order to be counted as correct.OR
: Nuggets for which at least one answer must be provided by the report in order to be counted as correct.
Each nugget answer is expected to be linked to a set of documents in the collection that attest that answer. Topics from the TREC 2024 NeuCLIR track each have between 10 and 20 nuggets associated with them.
The primary format for nugget files is illustrated in assets/sample-inputs/sample_nuggets_388.v3.json
. (Other formats are supported but are deprecated and not discussed in this README.) All nugget files adopting this format are expected to follow a naming convention in which the topic ID is given as the last element of a '_'-separated list of terms and where the suffix is v3.json
(e.g. nuggets_388.v3.json
). Failing to adhere to this convention will result in errors.
The auto_argue/eval.py
script produces a set of judgments for each sentence of a report. There are seven types of judgments, each of which directly corresponds to one of the blue diamonds shown in the ARGUE framework diagram (see above). The judgment types are as follows and are specified in auto_argue/utils.py
:
SENTENCE_ATTESTED
: Is the report sentence attested/supported by a specific document it cites?SENTENCE_ANSWERS_QUESTION
: For a given nugget question, does the report sentence provide a correct answer to that question?REQUIRES_CITATION
: Does the report sentence require a citation?CITED_DOCUMENT_RELEVANCE
: For a given cited document, is this document relevant (NOTE: in Auto-ARGUE, a document is considered relevant iff it provides at least part of an answer to at least one of the nugget questions associated with the report request?)FIRST_INSTANCE
: Does the report sentence contain repeated information, relayed earlier in the report?NEGATIVE_ASSERTION
: Does the report sentence assert that some information is unknown or not answerable?
This last judgment type (NEGATIVE_ASSERTION
), though part of the ARGUE framework, is not currently supported in our automatic implementation. While this would be interesting to explore, in practice there are very few of these sorts of statements in the NeuCLIR reports on which we evaluated our implementation.
The CITED_DOCUMENT_RELEVANCE
judgment type is unique in being — by default — a deterministic check that DOES NOT invoke an LLM. Recall from above that a document is considered relevant iff it was annotated (by a human assessor) as containing at least part of an answer to at least some nugget question. We can therefore determine whether a document is relevant simply via lookup, checking whether it was annotated as answering a nugget question. Alternatively, you have the option to use an LLM to assess which nuggets a given cited document attests. To enable this behavior, set the --always-check-all-nuggets
option for eval.py
. WARNING:: this is very expensive and will issue many requests.
All of the remaining judgments are evaluated via an LLM call. The system and user prompts used for these calls can be found in auto_argue/prompts/
. These prompts all expect "YES" or "NO" response. In general, recent models are competent at abiding by instructions to output only one of these two responses. Occasionally, however, a model may fail to do this. In such cases, we currently fall back to a "default" response for each judgment type. These default responses are as follows:
SENTENCE_ATTESTED
: falseSENTENCE_ANSWERS_QUESTION
: falseREQUIRES_CITATION
: trueFIRST_INSTANCE
: true
Judgments follow a Pydantic data model defined in auto_argue/judge.py
. Each judgment is saved in the output as a JSON dictionary with the following keys:
judgment_type_id
: a unique identifier for the judgment type (see theJudgmentType
class inauto_argue/utils.py
)response
: the model response(s) for this judgmentevaluator
: a unique identifier for the model or system that produced this judgmentprovenance
: information needed to determine what the judgment is about (e.g. a document identifier forSENTENCE_ATTESTED
)
The format of the response
and provenance
fields is dependent on the judgment type (judgment_type_id
). The data models for specific judgment types can also be found in auto_argue/judge.py
.
The JSONL file output by auto_argue/eval.py
has one line per report with the following fields:
request_id
: the report request ID for this report (str
)run_id
: an identifier for the run that produced this report (str
)collection_ids
: a list of the names of the document collections used to produce and evaluate this report (List[str]
)segments
: a list of sentences in the report, including the citations and judgments associated with them (List[Dict[str,Any]]
)
Each element of segments
has the following fields:
segment_type
: What kind of text segment this is (here, alwayssentence
)text
: The sentence text (str
)citations
: A list of of documents cited by this sentence, each represented by its document ID (doc_id
) and the document text (text
) (List[Dict[str, str]]
)judgments
: A list of ARGUE judgments produced for this sentence (List[Judgment]
)
Auto-ARGUE produces a number of metrics, which are written to a *.scores.tsv
file.
We note that ARGUE itself does not dictate specific metrics to be reported, but rather provides a framework for producing judgments about a report, which can then be scored using various metrics. Auto-ARGUE emphasizes two primary metrics that were suggested by Mayfield et al. (2024):
- Nugget Coverage (also called Nugget Recall) = (Number of unique nuggets correctly answered by the report) / (Total number of nuggets for the request)
- This is reported as nugget_coverage in the score TSVs
- Sentence Support = (Number of rewarded sentences) / (Total scored sentences)
- This is reported as sentence_support in the score TSVs
We also report an F1 Score computed on the basis of the above two scores:
- f1: an F1 score computed from nugget_coverage and sentence_support
Importantly, there is also a weighted version of nugget_coverage (nugget_coverage_weighted) based on the importance label associated with that nugget — either vital
or okay
— as well as a corresponding f1_weighted score. Currently, vital
nuggets carry a weight of 2.0 and okay
nuggets carry a weight of 1.0. These weights were determined manually following lots of close investigation and quality control of different nugget sets. If no importance labels are provided for a given nugget set, nugget_coverage_weighted will be equivalent to nugget_coverage.
We report per-topic results for all metrics, but we additionally report micro- and macro-averaged versions of the following metrics for each run across all topics:
- nugget_coverage
- nugget_coverage_weighted
- sentence_support
- f1
- f1_weighted
- citation_support (see below)
- citation_relevance (see below)
For these aggregated results, _micro or _macro is appended to the base metric name to distinguish between the micro- and macro-averaged versions.
Beyond nugget_coverage, sentence_support, and f1, there are a number of additional metrics that will be written to the output score TSVs:
- sentences: The total number of sentences in the report
- correctly_cited_sentences: The number of report sentences that (1) have at least one citation, and where (2) all these citations (individually) support the sentence
- sentences_missing_citation: The number of report sentences that are lacking a citation and that are judged to need one
- first_instance_sentences_missing_citation: of sentences_missing_citation, the number of sentences that are also judged to be "first instances" — i.e. sentences that present novel information not discussed earlier in the report
- citations: total number of citations present in the report. (Note that a single document cited by two different sentences counts as two citations.)
- relevant_citations: of citations, the number of citations to relevant documents — i.e. documents that contain an answer to at least one nugget question
- supporting_citations: of citations, the number that are judged to support the sentence they're attached to (regardless of whether they are relevant or not)
- citation_support: proportion of citations that are judged to support the sentence they're attached to
- citation_relevance: proportion of cited documents that are relevant
- correct_nuggets: total number of nuggets for which the report provides the full correct answer. For
AND
nuggets, a system must produce all answers for that nugget. ForOR
nuggets, a system must produce one answer for that nugget (and no incorrect answers).
The core ARGUE functionality can be invoked from within Python by calling the evaluate_report
function on a valid report (see the expected report format above). Note that the evaluate_report
function is asynchronous, so you will need to await
it. Example usage is shown below.
import asyncio
import json
import os
from auto_argue.judge import evaluate_report, ModelProvider
from auto_argue.utils import Report
from pathlib import Path
async def evaluate():
my_reports = []
with open("assets/sample-inputs/sample_report.jsonl", "r") as f:
for line in f:
my_reports.append(Report.model_validate(json.loads(line)))
collection_dir = "assets/sample-inputs/sample-docs-lookup/"
assert os.path.isdir(
collection_dir
), f"Collection path {collection_dir} does not exist!"
collection_ids = ["neuclir/1/ru", "neuclir/1/fa", "neuclir/1/zh"]
return await asyncio.gather(
*[
evaluate_report(
r,
Path("assets/sample-inputs/sample_nuggets_388.v3.json"),
collection_ids,
collection_dir,
provider=ModelProvider.TOGETHER,
)
for r in my_reports
]
)
if __name__ == "__main__":
import asyncio
results = asyncio.run(evaluate())
print(results)
By default, ARGUE uses the prompts in auto_argue/prompts/
to obtain judgments from the LLM judge. However, prompts for all ARGUE judgment types can be customized — and likely should be if you're working with novel data or models! To do so, you can supply a prompt configuration JSON file as an argument to the --prompt_config
(-P
) command line option. This file should contain a JSON dictionary with one or more of the following keys, each corresponding to a specific ARGUE judgment type (see Judgment Types):
negative_assertion
requires_citation
first_instance
cited_document_relevance
sentence_attested
sentence_answers_question
The value corresponding to each supplied key should itself be a dictionary with the following fields:
user_prompt
(str
): the user prompt to be used for this judgmentsystem_prompt
(str
): the system prompt to be used for this judgmentdefault_response
(Optional[str]
;YES
orNO
): the default response to be returned in cases where the model's output cannot be parsed
Each judgment type expects certain template variables to be included in the user_prompt
for that judgment (wrapped in curly brackets, e.g. {sentence}
). These variables are as follows:
negative_assertion
:sentence
requires_citation
:sentence
first_instance
:sentence
,previous_sentences
cited_document_relevance
:sentence
,document
sentence_attested
:sentence
,document
sentence_answers_question
:sentence
,nugget_question
,nugget_answer
Apart from these constraints, you have complete freedom with regard to the structure of the prompts. Note that the system prompts are not expected to contain any template variables. An example prompt configuration file that modifies only the sentence_attested
prompts can be found in assets/sample-inputs/sample_prompt_config.json
.
Limited unit tests have been implemented in the tests/
directory. To run all tests, cd
into tests/
and run pytest
.
All bugs and feature requests should be submitted as an issue to this repository. When creating one, please describe the feature request or bug in as much detail as possible (including steps to reproduce in the latter case or desired functionality and suggested implementation in the former case).
If you use this evaluation framework in your research, please cite the following two papers:
@misc{walden2025autoarguellmbasedreportgeneration,
title={Auto-ARGUE: LLM-Based Report Generation Evaluation},
author={William Walden and Marc Mason and Orion Weller and Laura Dietz and John Conroy and Neil Molino and Hannah Recknor and Bryan Li and Gabrielle Kaili-May Liu and Yu Hou and Dawn Lawrie and James Mayfield and Eugene Yang},
year={2025},
eprint={2509.26184},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2509.26184},
}
@inproceedings{Mayfield2024OnTE,
title={On the Evaluation of Machine-Generated Reports},
author={James Mayfield and Eugene Yang and Dawn J Lawrie and Sean MacAvaney and Paul McNamee and Douglas W. Oard and Luca Soldaini and Ian Soboroff and Orion Weller and Efsun Kayi and Kate Sanders and Marc Mason and Noah Hibbler},
booktitle={Annual International ACM SIGIR Conference on Research and Development in Information Retrieval},
year={2024},
url={https://api.semanticscholar.org/CorpusID:269502216}
}