An abstract illustration of swirling shapes, meant to denote a futuristic feeling.

Research - Papers

Explore a selection of our published work on a variety of key research challenges in AI.

SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks

Yilun ZhaoKaiyan ZhangTiansheng HuArman Cohan

2025

NeurIPS

We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and…

SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

David WaddenKejian ShiJacob Daniel MorrisonArman Cohan

2025

EMNLP

We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following instances for training and evaluation, covering 54 tasks. These tasks span…

Intent-Aware Schema Generation And Refinement For Literature Review Tables

Vishakh PadmakumarJoseph Chee ChangKyle LoAakanksha Naik

2025

EMNLP

The increasing volume of academic literature makes it essential for researchers to organize, compare, and contrast collections of documents. Large language models (LLMs) can support this process by…

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Jonathan BraggMike D'ArcyNishant BalepurDaniel S. Weld

2025

arXiv

AI agents hold great real-world promise, with the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new…

Ai2 Scholar QA: Organized Literature Synthesis with Attribution

Amanpreet SinghJoseph Chee ChangChloe AnastasiadesSergey Feldman

2025

ACL

Retrieval-augmented generation is increasingly effective in answering scientific questions from literature, but many state-of-the-art systems are expensive and closed-source. We introduce Ai2…

Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Mingqi GaoYixin LiuXinyu HuArman Cohan

2025

NAACL

Evaluating and ranking the capabilities of different LLMs is crucial for understanding their performance and alignment with human preferences. Due to the high cost and time-consuming nature of human…

Social-RAG: Retrieving from Group Interactions to Socially Ground Proactive AI Generation to Group Preferences

Ruotong WangXinyi ZhouLin QiuAmy X. Zhang

2025

CHI

AI agents are increasingly tasked with making proactive suggestions in online spaces where groups collaborate, but can be unhelpful or even annoying, due to not fitting the group's preferences or…

CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation

Peter JansenOyvind TafjordMarissa RadenskyPeter Clark

2025

ACL (Findings)

Despite the surge of interest in autonomous scientific discovery (ASD) of software artifacts (e.g., improved ML algorithms), current ASD systems face two key limitations: (1) they largely explore…

Paloma: A Benchmark for Evaluating Language Model Fit

Ian MagnussonAkshita BhagiaValentin HofmannJesse Dodge

2024

NeurIPS

Language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains$\unicode{x2013}$varying distributions of…

ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models

Benjamin NewmanYoonjoo LeeAakanksha NaikKyle Lo

2024

EMNLP

When conducting literature reviews, scientists often create literature review tables—tables whose rows are publications and whose columns constitute a schema, a set of aspects used to compare and…

1-10Next