这是indexloc提供的服务,不要输入任何密码
Skip to main content

Showing 1–50 of 75 results for author: Eisenstein, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.07949  [pdf, ps, other

    cs.LG

    Cost-Optimal Active AI Model Evaluation

    Authors: Anastasios N. Angelopoulos, Jacob Eisenstein, Jonathan Berant, Alekh Agarwal, Adam Fisch

    Abstract: The development lifecycle of generative AI systems requires continual evaluation, data acquisition, and annotation, which is costly in both resources and time. In practice, rapid iteration often makes it necessary to rely on synthetic annotation data because of the low cost, despite the potential for substantial bias. In this paper, we develop novel, cost-aware methods for actively balancing the u… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  2. arXiv:2503.14481  [pdf, other

    cs.LG cs.CL

    Don't lie to your friends: Learning what you know from collaborative self-play

    Authors: Jacob Eisenstein, Reza Aghajani, Adam Fisch, Dheeru Dua, Fantine Huot, Mirella Lapata, Vicky Zayats, Jonathan Berant

    Abstract: To be helpful assistants, AI agents must be aware of their own capabilities and limitations. This includes knowing when to answer from parametric knowledge versus using tools, when to trust tool outputs, and when to abstain or hedge. Such capabilities are hard to teach through supervised fine-tuning because they require constructing examples that reflect the agent's specific capabilities. We there… ▽ More

    Submitted 31 March, 2025; v1 submitted 18 March, 2025; originally announced March 2025.

  3. arXiv:2412.19792  [pdf, other

    cs.LG cs.CL cs.IT

    InfAlign: Inference-aware language model alignment

    Authors: Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, Ananda Theertha Suresh, Ahmad Beirami

    Abstract: Language model alignment is a critical step in training modern generative language models. Alignment targets to improve win rate of a sample from the aligned model against the base model. Today, we are increasingly using inference-time algorithms (e.g., Best-of-N, controlled decoding, tree search) to decode from language models rather than standard sampling. We show that this train/test mismatch m… ▽ More

    Submitted 6 February, 2025; v1 submitted 27 December, 2024; originally announced December 2024.

  4. arXiv:2410.18077  [pdf, ps, other

    cs.LG cs.AI cs.CL

    ALTA: Compiler-Based Analysis of Transformers

    Authors: Peter Shaw, James Cohan, Jacob Eisenstein, Kenton Lee, Jonathan Berant, Kristina Toutanova

    Abstract: We propose a new programming language called ALTA and a compiler that can map ALTA programs to Transformer weights. ALTA is inspired by RASP, a language proposed by Weiss et al. (2021), and Tracr (Lindner et al., 2023), a compiler from RASP programs to Transformer weights. ALTA complements and extends this prior work, offering the ability to express loops and to compile programs to Universal Trans… ▽ More

    Submitted 19 June, 2025; v1 submitted 23 October, 2024; originally announced October 2024.

    Comments: TMLR 2025

  5. arXiv:2410.08146  [pdf, other

    cs.LG cs.CL

    Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

    Authors: Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, Aviral Kumar

    Abstract: A promising approach for improving reasoning in large language models is to use process reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense, per-step human labels is not scalable, and training PRMs from automatically… ▽ More

    Submitted 10 October, 2024; originally announced October 2024.

  6. arXiv:2409.00358  [pdf, other

    cs.CL cs.AI

    Predicting the Target Word of Game-playing Conversations using a Low-Rank Dialect Adapter for Decoder Models

    Authors: Dipankar Srirag, Aditya Joshi, Jacob Eisenstein

    Abstract: Dialect adapters that improve the performance of LLMs for NLU tasks on certain sociolects/dialects/national varieties ('dialects' for the sake of brevity) have been reported for encoder models. In this paper, we extend the idea of dialect adapters to decoder models in our architecture called LoRDD. Using MD-3, a publicly available dataset of word game-playing conversations between dialectal speake… ▽ More

    Submitted 31 January, 2025; v1 submitted 31 August, 2024; originally announced September 2024.

    Comments: Accepted to NAACL 2025

  7. arXiv:2405.19316  [pdf, other

    cs.LG cs.CL

    Robust Preference Optimization through Reward Model Distillation

    Authors: Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Pete Shaw, Jonathan Berant

    Abstract: Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. However, the empirical evidence suggests that DPO typically assigns im… ▽ More

    Submitted 3 March, 2025; v1 submitted 29 May, 2024; originally announced May 2024.

  8. arXiv:2404.12318  [pdf, other

    cs.CL

    Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment

    Authors: Zhaofeng Wu, Ananth Balashankar, Yoon Kim, Jacob Eisenstein, Ahmad Beirami

    Abstract: Aligning language models (LMs) based on human-annotated preference data is a crucial step in obtaining practical and performant LM-based systems. However, multilingual human preference data are difficult to obtain at scale, making it challenging to extend this framework to diverse languages. In this work, we evaluate a simple approach for zero-shot cross-lingual alignment, where a reward model is… ▽ More

    Submitted 14 October, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

    Comments: EMNLP 2024

  9. arXiv:2402.00742  [pdf, other

    cs.CL cs.AI

    Transforming and Combining Rewards for Aligning Large Language Models

    Authors: Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D'Amour, Sanmi Koyejo, Victor Veitch

    Abstract: A common approach for aligning language models to human preferences is to first learn a reward model from preference data, and then use this reward model to update the language model. We study two closely related problems that arise in this approach. First, any monotone transformation of the reward model preserves preference ranking; is there a choice that is ``better'' than others? Second, we oft… ▽ More

    Submitted 19 July, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

    MSC Class: 68T50 ACM Class: I.2

  10. arXiv:2401.01879  [pdf, ps, other

    cs.LG cs.CL cs.IT

    Theoretical guarantees on the best-of-n alignment policy

    Authors: Ahmad Beirami, Alekh Agarwal, Jonathan Berant, Alexander D'Amour, Jacob Eisenstein, Chirag Nagpal, Ananda Theertha Suresh

    Abstract: A simple and effective method for the inference-time alignment and scaling test-time compute of generative models is best-of-$n$ sampling, where $n$ samples are drawn from a reference policy, ranked based on a reward function, and the highest ranking one is selected. A commonly used analytical expression in the literature claims that the KL divergence between the best-of-$n$ policy and the referen… ▽ More

    Submitted 28 May, 2025; v1 submitted 3 January, 2024; originally announced January 2024.

    Comments: ICML 2025

  11. arXiv:2312.09244  [pdf, other

    cs.LG

    Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking

    Authors: Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D'Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, Peter Shaw, Jonathan Berant

    Abstract: Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often termed \emph{reward hacking}. A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust… ▽ More

    Submitted 16 August, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

    Comments: Published at the 2024 Conference on Language Modeling (CoLM)

  12. arXiv:2305.14613  [pdf, other

    cs.CL cs.AI

    Selectively Answering Ambiguous Questions

    Authors: Jeremy R. Cole, Michael J. Q. Zhang, Daniel Gillick, Julian Martin Eisenschlos, Bhuwan Dhingra, Jacob Eisenstein

    Abstract: Trustworthy language models should abstain from answering questions when they do not know the answer. However, the answer to a question can be unknown for a variety of reasons. Prior research has focused on the case in which the question is clear and the answer is unambiguous but possibly unknown, but the answer to a question can also be unclear due to uncertainty of the questioner's intent or con… ▽ More

    Submitted 14 November, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: To appear in EMNLP 2023. 9 pages, 5 figures, 2 pages of appendix

  13. arXiv:2305.11355  [pdf, other

    cs.CL

    MD3: The Multi-Dialect Dataset of Dialogues

    Authors: Jacob Eisenstein, Vinodkumar Prabhakaran, Clara Rivera, Dorottya Demszky, Devyani Sharma

    Abstract: We introduce a new dataset of conversational speech representing English from India, Nigeria, and the United States. The Multi-Dialect Dataset of Dialogues (MD3) strikes a new balance between open-ended conversational speech and task-oriented dialogue by prompting participants to perform a series of short information-sharing tasks. This facilitates quantitative cross-dialectal comparison, while av… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: InterSpeech 2023

  14. arXiv:2212.08037  [pdf, other

    cs.CL

    Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models

    Authors: Bernd Bohnet, Vinh Q. Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Massimiliano Ciaramita, Jacob Eisenstein, Kuzman Ganchev, Jonathan Herzig, Kai Hui, Tom Kwiatkowski, Ji Ma, Jianmo Ni, Lierni Sestorain Saralegui, Tal Schuster, William W. Cohen, Michael Collins, Dipanjan Das, Donald Metzler, Slav Petrov, Kellie Webster

    Abstract: Large language models (LLMs) have shown impressive results while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial in this setting. We formulate and study Attributed QA as a key first step in the development of… ▽ More

    Submitted 10 February, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

  15. arXiv:2211.00922  [pdf, other

    cs.CL

    Dialect-robust Evaluation of Generated Text

    Authors: Jiao Sun, Thibault Sellam, Elizabeth Clark, Tu Vu, Timothy Dozat, Dan Garrette, Aditya Siddhant, Jacob Eisenstein, Sebastian Gehrmann

    Abstract: Evaluation metrics that are not robust to dialect variation make it impossible to tell how well systems perform for many groups of users, and can even penalize systems for producing text in lower-resource dialects. However, currently, there exists no way to quantify how metrics respond to change in the dialect of a generated utterance. We thus formalize dialect robustness and dialect awareness as… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

  16. arXiv:2210.13628  [pdf, other

    cs.CL cs.CY cs.SI

    Predicting Long-Term Citations from Short-Term Linguistic Influence

    Authors: Sandeep Soni, David Bamman, Jacob Eisenstein

    Abstract: A standard measure of the influence of a research paper is the number of times it is cited. However, papers may be cited for many reasons, and citation count offers limited information about the extent to which a paper affected the content of subsequent publications. We therefore propose a novel method to quantify linguistic influence in timestamped document collections. There are two main steps:… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: 17 pages, 3 figures, to appear in the Findings of EMNLP 2022

  17. arXiv:2210.11005  [pdf, ps, other

    cs.CL cs.AI

    Pre-trained Sentence Embeddings for Implicit Discourse Relation Classification

    Authors: Murali Raghu Babu Balusu, Yangfeng Ji, Jacob Eisenstein

    Abstract: Implicit discourse relations bind smaller linguistic units into coherent texts. Automatic sense prediction for implicit relations is hard, because it requires understanding the semantics of the linked arguments. Furthermore, annotated datasets contain relatively few labeled examples, due to the scale of the phenomenon: on average each discourse relation encompasses several dozen words. In this pap… ▽ More

    Submitted 20 October, 2022; originally announced October 2022.

    Comments: 6 pages

  18. arXiv:2210.02498  [pdf, other

    cs.CL cs.LG

    Honest Students from Untrusted Teachers: Learning an Interpretable Question-Answering Pipeline from a Pretrained Language Model

    Authors: Jacob Eisenstein, Daniel Andor, Bernd Bohnet, Michael Collins, David Mimno

    Abstract: Explainable question answering systems should produce not only accurate answers but also rationales that justify their reasoning and allow humans to check their work. But what sorts of rationales are useful and how can we train systems to produce them? We propose a new style of rationale for open-book question answering, called \emph{markup-and-mask}, which combines aspects of extractive and free-… ▽ More

    Submitted 24 April, 2024; v1 submitted 5 October, 2022; originally announced October 2022.

    Comments: added details about a human evaluation

  19. arXiv:2204.04487  [pdf, other

    cs.CL cs.LG

    Informativeness and Invariance: Two Perspectives on Spurious Correlations in Natural Language

    Authors: Jacob Eisenstein

    Abstract: Spurious correlations are a threat to the trustworthiness of natural language processing systems, motivating research into methods for identifying and eliminating them. However, addressing the problem of spurious correlations requires more clarity on what they are and how they arise in language data. Gardner et al (2021) argue that the compositional nature of language implies that \emph{all} corre… ▽ More

    Submitted 3 May, 2022; v1 submitted 9 April, 2022; originally announced April 2022.

    Comments: NAACL 2022

  20. arXiv:2109.00725  [pdf, other

    cs.CL cs.LG

    Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond

    Authors: Amir Feder, Katherine A. Keith, Emaad Manzoor, Reid Pryzant, Dhanya Sridhar, Zach Wood-Doughty, Jacob Eisenstein, Justin Grimmer, Roi Reichart, Margaret E. Roberts, Brandon M. Stewart, Victor Veitch, Diyi Yang

    Abstract: A fundamental goal of scientific research is to learn about causal relationships. However, despite its critical role in the life and social sciences, causality has not had the same importance in Natural Language Processing (NLP), which has traditionally placed more emphasis on predictive tasks. This distinction is beginning to fade, with an emerging area of interdisciplinary research at the conver… ▽ More

    Submitted 30 July, 2022; v1 submitted 2 September, 2021; originally announced September 2021.

    Comments: Accepted to Transactions of the Association for Computational Linguistics (TACL)

  21. arXiv:2108.00391  [pdf, other

    cs.CL

    Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information

    Authors: Yuval Pinter, Amanda Stent, Mark Dredze, Jacob Eisenstein

    Abstract: Commonly-used transformer language models depend on a tokenization schema which sets an unchangeable subword vocabulary prior to pre-training, destined to be applied to all downstream tasks regardless of domain shift, novel word formations, or other sources of vocabulary mismatch. Recent work has shown that "token-free" models can be trained directly on characters or bytes, but training these mode… ▽ More

    Submitted 1 August, 2021; originally announced August 2021.

  22. arXiv:2106.16171  [pdf, other

    cs.CL

    Revisiting the Primacy of English in Zero-shot Cross-lingual Transfer

    Authors: Iulia Turc, Kenton Lee, Jacob Eisenstein, Ming-Wei Chang, Kristina Toutanova

    Abstract: Despite their success, large pre-trained multilingual models have not completely alleviated the need for labeled data, which is cumbersome to collect for all target languages. Zero-shot cross-lingual transfer is emerging as a practical solution: pre-trained models later fine-tuned on one transfer language exhibit surprising performance when tested on many target languages. English is the dominant… ▽ More

    Submitted 30 June, 2021; originally announced June 2021.

  23. arXiv:2106.16163  [pdf, other

    cs.CL

    The MultiBERTs: BERT Reproductions for Robustness Analysis

    Authors: Thibault Sellam, Steve Yadlowsky, Jason Wei, Naomi Saphra, Alexander D'Amour, Tal Linzen, Jasmijn Bastings, Iulia Turc, Jacob Eisenstein, Dipanjan Das, Ian Tenney, Ellie Pavlick

    Abstract: Experiments with pre-trained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact tested in the experiment (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure which includes the architecture, training data, initialization scheme, and loss function. Recent work has shown that r… ▽ More

    Submitted 21 March, 2022; v1 submitted 30 June, 2021; originally announced June 2021.

    Comments: Accepted at ICLR'22. Checkpoints and example analyses: http://goo.gle/multiberts

  24. Time-Aware Language Models as Temporal Knowledge Bases

    Authors: Bhuwan Dhingra, Jeremy R. Cole, Julian Martin Eisenschlos, Daniel Gillick, Jacob Eisenstein, William W. Cohen

    Abstract: Many facts come with an expiration date, from the name of the President to the basketball team Lebron James plays for. But language models (LMs) are trained on snapshots of data collected at a specific moment in time, and this can limit their utility, especially in the closed-book setting where the pretraining corpus must contain the facts the model should memorize. We introduce a diagnostic datas… ▽ More

    Submitted 23 April, 2022; v1 submitted 29 June, 2021; originally announced June 2021.

    Comments: Version accepted to TACL

    Journal ref: Transactions of the Association for Computational Linguistics 2022; 10 257-273

  25. arXiv:2106.00545  [pdf, other

    cs.LG cs.AI stat.ML

    Counterfactual Invariance to Spurious Correlations: Why and How to Pass Stress Tests

    Authors: Victor Veitch, Alexander D'Amour, Steve Yadlowsky, Jacob Eisenstein

    Abstract: Informally, a 'spurious correlation' is the dependence of a model on some aspect of the input data that an analyst thinks shouldn't matter. In machine learning, these have a know-it-when-you-see-it character; e.g., changing the gender of a sentence's subject changes a sentiment predictor's output. To check for spurious correlations, we can 'stress test' models by perturbing irrelevant parts of inp… ▽ More

    Submitted 2 November, 2021; v1 submitted 31 May, 2021; originally announced June 2021.

    Comments: Published at NeurIPS 2021 (spotlight)

  26. arXiv:2103.07538  [pdf, other

    cs.CL cs.CY cs.DL cs.SI

    Abolitionist Networks: Modeling Language Change in Nineteenth-Century Activist Newspapers

    Authors: Sandeep Soni, Lauren Klein, Jacob Eisenstein

    Abstract: The abolitionist movement of the nineteenth-century United States remains among the most significant social and political movements in US history. Abolitionist newspapers played a crucial role in spreading information and shaping public opinion around a range of issues relating to the abolition of slavery. These newspapers also serve as a primary source of information about the movement for schola… ▽ More

    Submitted 12 March, 2021; originally announced March 2021.

    Comments: 23 pages, 6 figures, 2 tables

    Journal ref: Journal of Cultural Analytics (2021)

  27. arXiv:2102.13140  [pdf, other

    cs.DC

    Checkpointing with cp: the POSIX Shared Memory System

    Authors: Lehman H. Garrison, Daniel J. Eisenstein, Nina A. Maksimova

    Abstract: We present the checkpointing scheme of Abacus, an $N$-body simulation code that allocates all persistent state in POSIX shared memory, or ramdisk. Checkpointing becomes as simple as copying files from ramdisk to external storage. The main simulation executable is invoked once per time step, memory mapping the input state, computing the output state directly into ramdisk, and unmapping the input st… ▽ More

    Submitted 25 February, 2021; originally announced February 2021.

    Comments: 3 pages, 1 figure. Extended abstract accepted by SuperCheck21. Symposium presentation at https://drive.google.com/file/d/1q63kk1TCyOuh15Lu47bUJ8K7iZ-pYP9U/view

  28. arXiv:2101.06368  [pdf, other

    cs.CL

    Tuiteamos o pongamos un tuit? Investigating the Social Constraints of Loanword Integration in Spanish Social Media

    Authors: Ian Stewart, Diyi Yang, Jacob Eisenstein

    Abstract: Speakers of non-English languages often adopt loanwords from English to express new or unusual concepts. While these loanwords may be borrowed unchanged, speakers may also integrate the words to fit the constraints of their native language, e.g. creating Spanish "tuitear" from English "tweet." Linguists have often considered the process of loanword integration to be more dependent on language-inte… ▽ More

    Submitted 15 January, 2021; originally announced January 2021.

    ACM Class: I.2.7

    Journal ref: Society for Computation in Linguistics, 2021

  29. arXiv:2011.03395  [pdf, other

    cs.LG stat.ML

    Underspecification Presents Challenges for Credibility in Modern Machine Learning

    Authors: Alexander D'Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Natarajan, Christopher Nielson, Thomas F. Osborne , et al. (15 additional authors not shown)

    Abstract: ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predict… ▽ More

    Submitted 24 November, 2020; v1 submitted 6 November, 2020; originally announced November 2020.

    Comments: Updates: Updated statistical analysis in Section 6; Additional citations

  30. arXiv:2010.12707  [pdf, other

    cs.CL

    Learning to Recognize Dialect Features

    Authors: Dorottya Demszky, Devyani Sharma, Jonathan H. Clark, Vinodkumar Prabhakaran, Jacob Eisenstein

    Abstract: Building NLP systems that serve everyone requires accounting for dialect differences. But dialects are not monolithic entities: rather, distinctions between and within dialects are captured by the presence, absence, and frequency of dozens of dialect features in speech and text, such as the deletion of the copula in "He {} running". In this paper, we introduce the task of dialect feature detection… ▽ More

    Submitted 6 May, 2021; v1 submitted 23 October, 2020; originally announced October 2020.

    Comments: NAACL camera-ready

  31. arXiv:2009.09123  [pdf, other

    cs.CL cs.AI

    Will it Unblend?

    Authors: Yuval Pinter, Cassandra L. Jacobs, Jacob Eisenstein

    Abstract: Natural language processing systems often struggle with out-of-vocabulary (OOV) terms, which do not appear in training data. Blends, such as "innoventor", are one particularly challenging class of OOV, as they are formed by fusing together two or more bases that relate to the intended meaning in unpredictable manners and degrees. In this work, we run experiments on a novel dataset of English OOV b… ▽ More

    Submitted 18 September, 2020; originally announced September 2020.

    Comments: Findings of EMNLP 2020

  32. arXiv:2006.11834  [pdf, other

    cs.CL

    AdvAug: Robust Adversarial Augmentation for Neural Machine Translation

    Authors: Yong Cheng, Lu Jiang, Wolfgang Macherey, Jacob Eisenstein

    Abstract: In this paper, we propose a new adversarial augmentation method for Neural Machine Translation (NMT). The main idea is to minimize the vicinal risk over virtual sentences sampled from two vicinity distributions, of which the crucial one is a novel vicinity distribution for adversarial sentences that describes a smooth interpolated embedding space centered around observed training sentence pairs. W… ▽ More

    Submitted 2 July, 2020; v1 submitted 21 June, 2020; originally announced June 2020.

    Comments: published at ACL2020

  33. arXiv:2005.00181  [pdf, other

    cs.CL

    Sparse, Dense, and Attentional Representations for Text Retrieval

    Authors: Yi Luan, Jacob Eisenstein, Kristina Toutanova, Michael Collins

    Abstract: Dual encoders perform retrieval by encoding documents and queries into dense lowdimensional vectors, scoring each document by its inner product with the query. We investigate the capacity of this architecture relative to sparse bag-of-words models and attentional neural networks. Using both theoretical and empirical analysis, we establish connections between the encoding dimension, the margin betw… ▽ More

    Submitted 16 February, 2021; v1 submitted 30 April, 2020; originally announced May 2020.

    Comments: To appear in TACL 2020. The arXiv version is a pre-MIT Press publication version

  34. arXiv:1909.08784  [pdf, other

    cs.CL cs.SI

    Characterizing Collective Attention via Descriptor Context: A Case Study of Public Discussions of Crisis Events

    Authors: Ian Stewart, Diyi Yang, Jacob Eisenstein

    Abstract: Social media datasets make it possible to rapidly quantify collective attention to emerging topics and breaking news, such as crisis events. Collective attention is typically measured by aggregate counts, such as the number of posts that mention a name or hashtag. But according to rationalist models of natural language communication, the collective salience of each entity will be expressed not onl… ▽ More

    Submitted 31 March, 2020; v1 submitted 18 September, 2019; originally announced September 2019.

    Comments: ICWSM 2020

    ACM Class: H.5.3; I.2.7

  35. arXiv:1909.04189  [pdf, other

    cs.CL cs.SI physics.soc-ph

    Follow the Leader: Documents on the Leading Edge of Semantic Change Get More Citations

    Authors: Sandeep Soni, Kristina Lerman, Jacob Eisenstein

    Abstract: Diachronic word embeddings -- vector representations of words over time -- offer remarkable insights into the evolution of language and provide a tool for quantifying sociocultural change from text documents. Prior work has used such embeddings to identify shifts in the meaning of individual words. However, simply knowing that a word has changed in meaning is insufficient to identify the instances… ▽ More

    Submitted 1 October, 2020; v1 submitted 9 September, 2019; originally announced September 2019.

    Comments: 25 pages, 3 figures, To appear in the Journal of the Association of Information Sciences and Technology

  36. How we do things with words: Analyzing text as social and cultural data

    Authors: Dong Nguyen, Maria Liakata, Simon DeDeo, Jacob Eisenstein, David Mimno, Rebekah Tromble, Jane Winters

    Abstract: In this article we describe our experiences with computational text analysis. We hope to achieve three primary goals. First, we aim to shed light on thorny issues not always at the forefront of discussions about computational text analysis methods. Second, we hope to provide a set of best practices for working with thick social and cultural concepts. Our guidance is based on our own experiences an… ▽ More

    Submitted 2 July, 2019; originally announced July 2019.

    Journal ref: Front. Artif. Intell. 3:62 (2020)

  37. arXiv:1906.03380  [pdf, other

    cs.CL

    Clinical Concept Extraction for Document-Level Coding

    Authors: Sarah Wiegreffe, Edward Choi, Sherry Yan, Jimeng Sun, Jacob Eisenstein

    Abstract: The text of clinical notes can be a valuable source of patient information and clinical assessments. Historically, the primary approach for exploiting clinical notes has been information extraction: linking spans of text to concepts in a detailed domain ontology. However, recent work has demonstrated the potential of supervised machine learning to extract document-level codes directly from the raw… ▽ More

    Submitted 7 June, 2019; originally announced June 2019.

    Comments: ACL BioNLP workshop (2019)

  38. arXiv:1904.02817  [pdf, other

    cs.CL cs.DL cs.LG

    Unsupervised Domain Adaptation of Contextualized Embeddings for Sequence Labeling

    Authors: Xiaochuang Han, Jacob Eisenstein

    Abstract: Contextualized word embeddings such as ELMo and BERT provide a foundation for strong performance across a wide range of natural language processing tasks by pretraining on large corpora of unlabeled text. However, the applicability of this approach is unknown when the target domain varies substantially from the pretraining corpus. We are specifically interested in the scenario in which labeled dat… ▽ More

    Submitted 4 September, 2019; v1 submitted 4 April, 2019; originally announced April 2019.

    Comments: EMNLP 2019

  39. arXiv:1903.05041  [pdf, other

    cs.CL

    Character Eyes: Seeing Language through Character-Level Taggers

    Authors: Yuval Pinter, Marc Marone, Jacob Eisenstein

    Abstract: Character-level models have been used extensively in recent years in NLP tasks as both supplements and replacements for closed-vocabulary token-level word representations. In one popular architecture, character-level LSTMs are used to feed token representations into a sequence tagger predicting token-level annotations such as part-of-speech (POS) tags. In this work, we examine the behavior of POS… ▽ More

    Submitted 12 March, 2019; originally announced March 2019.

  40. arXiv:1902.01541  [pdf, other

    cs.CL cs.LG

    The Referential Reader: A Recurrent Entity Network for Anaphora Resolution

    Authors: Fei Liu, Luke Zettlemoyer, Jacob Eisenstein

    Abstract: We present a new architecture for storing and accessing entity mentions during online text processing. While reading the text, entity references are identified, and may be stored by either updating or overwriting a cell in a fixed-length memory. The update operation implies coreference with the other mentions that are stored in the same cell; the overwrite operation causes these mentions to be for… ▽ More

    Submitted 9 July, 2019; v1 submitted 4 February, 2019; originally announced February 2019.

    Comments: Published at the 57th Annual Meeting of the Association for Computational Linguistics (ACL) 2019. Source code available at: https://github.com/liufly/refreader

  41. arXiv:1902.01509  [pdf, ps, other

    cs.CL cs.LG stat.ML

    Training on Synthetic Noise Improves Robustness to Natural Noise in Machine Translation

    Authors: Vladimir Karpukhin, Omer Levy, Jacob Eisenstein, Marjan Ghazvininejad

    Abstract: We consider the problem of making machine translation more robust to character-level variation at the source side, such as typos. Existing methods achieve greater coverage by applying subword models such as byte-pair encoding (BPE) and character-level encoders, but these methods are highly sensitive to spelling mistakes. We show how training on a mild amount of random synthetic noise can dramatica… ▽ More

    Submitted 4 February, 2019; originally announced February 2019.

  42. arXiv:1809.06951  [pdf, other

    cs.CL cs.SI

    Mind Your POV: Convergence of Articles and Editors Towards Wikipedia's Neutrality Norm

    Authors: Umashanthi Pavalanathan, Xiaochuang Han, Jacob Eisenstein

    Abstract: Wikipedia has a strong norm of writing in a 'neutral point of view' (NPOV). Articles that violate this norm are tagged, and editors are encouraged to make corrections. But the impact of this tagging system has not been quantitatively measured. Does NPOV tagging help articles to converge to the desired style? Do NPOV corrections encourage editors to adopt this style? We study these questions using… ▽ More

    Submitted 18 September, 2018; originally announced September 2018.

    Comments: ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW), 2018

    Journal ref: Umashanthi Pavalanathan, Xiaochuang Han, and Jacob Eisenstein. 2018. Mind Your POV: Convergence of Articles and Editors Towards Wikipedia's Neutrality Norm. Proc. ACM Hum.-Comput. Interact. 2, CSCW, Article 137 (November 2018)

  43. arXiv:1808.08644  [pdf, ps, other

    cs.CL

    Predicting Semantic Relations using Global Graph Properties

    Authors: Yuval Pinter, Jacob Eisenstein

    Abstract: Semantic graphs, such as WordNet, are resources which curate natural language on two distinguishable layers. On the local level, individual relations between synsets (semantic building blocks) such as hypernymy and meronymy enhance our understanding of the words used to express their meanings. Globally, analysis of graph-theoretic properties of the entire net sheds light on the structure of human… ▽ More

    Submitted 26 August, 2018; originally announced August 2018.

    Comments: EMNLP 2018

  44. arXiv:1804.07331  [pdf, other

    cs.CL cs.AI

    Stylistic Variation in Social Media Part-of-Speech Tagging

    Authors: Murali Raghu Babu Balusu, Taha Merghani, Jacob Eisenstein

    Abstract: Social media features substantial stylistic variation, raising new challenges for syntactic analysis of online writing. However, this variation is often aligned with author attributes such as age, gender, and geography, as well as more readily-available social network metadata. In this paper, we report new evidence on the link between language and social networks in the task of part-of-speech tagg… ▽ More

    Submitted 19 April, 2018; originally announced April 2018.

    Comments: 9 pages, Published in Proceedings of NAACL workshop on stylistic variation (2018)

  45. arXiv:1804.05088  [pdf, ps, other

    cs.CL cs.SI

    Sí o no, què penses? Catalonian Independence and Linguistic Identity on Social Media

    Authors: Ian Stewart, Yuval Pinter, Jacob Eisenstein

    Abstract: Political identity is often manifested in language variation, but the relationship between the two is still relatively unexplored from a quantitative perspective. This study examines the use of Catalan, a language local to the semi-autonomous region of Catalonia in Spain, on Twitter in discourse related to the 2017 independence referendum. We corroborate prior findings that pro-independence tweets… ▽ More

    Submitted 13 April, 2018; originally announced April 2018.

    Comments: NAACL 2018

  46. arXiv:1802.06138  [pdf, other

    cs.SI cs.LG physics.soc-ph

    Detecting Social Influence in Event Cascades by Comparing Discriminative Rankers

    Authors: Sandeep Soni, Shawn Ling Ramirez, Jacob Eisenstein

    Abstract: The global dynamics of event cascades are often governed by the local dynamics of peer influence. However, detecting social influence from observational data is challenging due to confounds like homophily and practical issues like missing data. We propose a simple discriminative method to detect influence from observational data. The core of the approach is to train a ranking algorithm to predict… ▽ More

    Submitted 19 July, 2019; v1 submitted 16 February, 2018; originally announced February 2018.

    Comments: Accepted to the SIGKDD Workshop on Causal Discovery, 2019

  47. arXiv:1802.05695  [pdf, other

    cs.CL cs.LG stat.ML

    Explainable Prediction of Medical Codes from Clinical Text

    Authors: James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, Jacob Eisenstein

    Abstract: Clinical notes are text documents that are created by clinicians for each patient encounter. They are typically accompanied by medical codes, which describe the diagnosis and treatment. Annotating these codes is labor intensive and error prone; furthermore, the connection between the codes and the text is not annotated, obscuring the reasons and details behind specific diagnoses and treatments. We… ▽ More

    Submitted 16 April, 2018; v1 submitted 15 February, 2018; originally announced February 2018.

    Comments: NAACL 2018

  48. arXiv:1802.04140   

    cs.CL

    Making "fetch" happen: The influence of social and linguistic context on nonstandard word growth and decline

    Authors: Ian Stewart, Jacob Eisenstein

    Abstract: In an online community, new words come and go: today's "haha" may be replaced by tomorrow's "lol." Changes in online writing are usually studied as a social process, with innovations diffusing through a network of individuals in a speech community. But unlike other types of innovation, language change is shaped and constrained by the system in which it takes part. To investigate the links between… ▽ More

    Submitted 13 February, 2018; v1 submitted 9 February, 2018; originally announced February 2018.

    Comments: replaced by arXiv:1709.00345

    ACM Class: I.2.7

  49. arXiv:1712.01411  [pdf, other

    cs.CL cs.SI

    #anorexia, #anarexia, #anarexyia: Characterizing Online Community Practices with Orthographic Variation

    Authors: Ian Stewart, Stevie Chancellor, Munmun De Choudhury, Jacob Eisenstein

    Abstract: Distinctive linguistic practices help communities build solidarity and differentiate themselves from outsiders. In an online community, one such practice is variation in orthography, which includes spelling, punctuation, and capitalization. Using a dataset of over two million Instagram posts, we investigate orthographic variation in a community that shares pro-eating disorder (pro-ED) content. We… ▽ More

    Submitted 4 December, 2017; originally announced December 2017.

  50. arXiv:1709.00345  [pdf, other

    cs.CL cs.SI physics.soc-ph

    Making "fetch" happen: The influence of social and linguistic context on nonstandard word growth and decline

    Authors: Ian Stewart, Jacob Eisenstein

    Abstract: In an online community, new words come and go: today's "haha" may be replaced by tomorrow's "lol." Changes in online writing are usually studied as a social process, with innovations diffusing through a network of individuals in a speech community. But unlike other types of innovation, language change is shaped and constrained by the system in which it takes part. To investigate the links between… ▽ More

    Submitted 31 August, 2018; v1 submitted 1 September, 2017; originally announced September 2017.

    ACM Class: I.2.7

    Journal ref: EMNLP 2018