+
Skip to main content ->
Ai2

Research - Papers

Explore a selection of our published work on a variety of key research challenges in AI.

Filter papers

SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks

Yilun ZhaoKaiyan ZhangTiansheng HuArman Cohan
2025
NeurIPS

We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and… 

SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

David WaddenKejian ShiJacob Daniel MorrisonArman Cohan
2025
EMNLP

We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following instances for training and evaluation, covering 54 tasks. These tasks span… 

Intent-Aware Schema Generation And Refinement For Literature Review Tables

Vishakh PadmakumarJoseph Chee ChangKyle LoAakanksha Naik
2025
EMNLP

The increasing volume of academic literature makes it essential for researchers to organize, compare, and contrast collections of documents. Large language models (LLMs) can support this process by… 

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Jonathan BraggMike D'ArcyNishant BalepurDaniel S. Weld
2025
arXiv

AI agents hold great real-world promise, with the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new… 

Ai2 Scholar QA: Organized Literature Synthesis with Attribution

Amanpreet SinghJoseph Chee ChangChloe AnastasiadesSergey Feldman
2025
ACL

Retrieval-augmented generation is increasingly effective in answering scientific questions from literature, but many state-of-the-art systems are expensive and closed-source. We introduce Ai2… 

Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Mingqi GaoYixin LiuXinyu HuArman Cohan
2025
NAACL

Evaluating and ranking the capabilities of different LLMs is crucial for understanding their performance and alignment with human preferences. Due to the high cost and time-consuming nature of human… 

Social-RAG: Retrieving from Group Interactions to Socially Ground Proactive AI Generation to Group Preferences

Ruotong WangXinyi ZhouLin QiuAmy X. Zhang
2025
CHI

AI agents are increasingly tasked with making proactive suggestions in online spaces where groups collaborate, but can be unhelpful or even annoying, due to not fitting the group's preferences or… 

CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation

Peter JansenOyvind TafjordMarissa RadenskyPeter Clark
2025
ACL (Findings)

Despite the surge of interest in autonomous scientific discovery (ASD) of software artifacts (e.g., improved ML algorithms), current ASD systems face two key limitations: (1) they largely explore… 

Paloma: A Benchmark for Evaluating Language Model Fit

Ian MagnussonAkshita BhagiaValentin HofmannJesse Dodge
2024
NeurIPS

Language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains$\unicode{x2013}$varying distributions of… 

ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models

Benjamin NewmanYoonjoo LeeAakanksha NaikKyle Lo
2024
EMNLP

When conducting literature reviews, scientists often create literature review tables—tables whose rows are publications and whose columns constitute a schema, a set of aspects used to compare and… 

1-10Next
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载