Research - Papers
Explore a selection of our published work on a variety of key research challenges in AI.
SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks
We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and…
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following instances for training and evaluation, covering 54 tasks. These tasks span…
Intent-Aware Schema Generation And Refinement For Literature Review Tables
The increasing volume of academic literature makes it essential for researchers to organize, compare, and contrast collections of documents. Large language models (LLMs) can support this process by…
AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite
AI agents hold great real-world promise, with the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new…
Ai2 Scholar QA: Organized Literature Synthesis with Attribution
Retrieval-augmented generation is increasingly effective in answering scientific questions from literature, but many state-of-the-art systems are expensive and closed-source. We introduce Ai2…
Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
Evaluating and ranking the capabilities of different LLMs is crucial for understanding their performance and alignment with human preferences. Due to the high cost and time-consuming nature of human…
Social-RAG: Retrieving from Group Interactions to Socially Ground Proactive AI Generation to Group Preferences
AI agents are increasingly tasked with making proactive suggestions in online spaces where groups collaborate, but can be unhelpful or even annoying, due to not fitting the group's preferences or…
CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation
Despite the surge of interest in autonomous scientific discovery (ASD) of software artifacts (e.g., improved ML algorithms), current ASD systems face two key limitations: (1) they largely explore…
Paloma: A Benchmark for Evaluating Language Model Fit
Language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains$\unicode{x2013}$varying distributions of…
ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models
When conducting literature reviews, scientists often create literature review tables—tables whose rows are publications and whose columns constitute a schema, a set of aspects used to compare and…