这是indexloc提供的服务,不要输入任何密码
Skip to main content

Showing 1–9 of 9 results for author: Noy, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.06261  [pdf, ps, other

    cs.CL cs.AI

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Authors: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu , et al. (3284 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal unde… ▽ More

    Submitted 22 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

    Comments: 72 pages, 17 figures

  2. arXiv:2503.19786  [pdf, other

    cs.CL cs.AI

    Gemma 3 Technical Report

    Authors: Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin , et al. (191 additional authors not shown)

    Abstract: We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achie… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

  3. arXiv:2501.13779  [pdf, ps, other

    cs.LG cs.AI

    Not Every AI Problem is a Data Problem: We Should Be Intentional About Data Scaling

    Authors: Tanya Rodchenko, Natasha Noy, Nino Scherrer

    Abstract: While Large Language Models require more and more data to train and scale, rather than looking for any data to acquire, we should consider what types of tasks are more likely to benefit from data scaling. We should be intentional in our data acquisition. We argue that the shape of the data itself, such as its compositional and structural patterns, informs which tasks to prioritize in data scaling,… ▽ More

    Submitted 3 June, 2025; v1 submitted 23 January, 2025; originally announced January 2025.

  4. arXiv:2408.14636  [pdf, other

    cs.IR cs.CL cs.HC cs.LG

    Relationships are Complicated! An Analysis of Relationships Between Datasets on the Web

    Authors: Kate Lin, Tarfah Alrashed, Natasha Noy

    Abstract: The Web today has millions of datasets, and the number of datasets continues to grow at a rapid pace. These datasets are not standalone entities; rather, they are intricately connected through complex relationships. Semantic relationships between datasets provide critical insights for research and decision-making processes. In this paper, we study dataset relationships from the perspective of user… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

  5. arXiv:2311.13028  [pdf, other

    cs.LG cs.AI cs.DC eess.SP

    DMLR: Data-centric Machine Learning Research -- Past, Present and Future

    Authors: Luis Oala, Manil Maskey, Lilith Bat-Leah, Alicia Parrish, Nezihe Merve Gürel, Tzu-Sheng Kuo, Yang Liu, Rotem Dror, Danilo Brajovic, Xiaozhe Yao, Max Bartolo, William A Gaviria Rojas, Ryan Hileman, Rainier Aliment, Michael W. Mahoney, Meg Risdal, Matthew Lease, Wojciech Samek, Debojyoti Dutta, Curtis G Northcutt, Cody Coleman, Braden Hancock, Bernard Koch, Girmaw Abebe Tadesse, Bojan Karlaš , et al. (13 additional authors not shown)

    Abstract: Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods tow… ▽ More

    Submitted 1 June, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

    Comments: Published in the Journal of Data-centric Machine Learning Research (DMLR) at https://data.mlr.press/assets/pdf/v01-5.pdf

  6. How Developers Choose Names

    Authors: Dror G. Feitelson, Ayelet Mizrahi, Nofar Noy, Aviad Ben Shabat, Or Eliyahu, Roy Sheffer

    Abstract: The names of variables and functions serve as implicit documentation and are instrumental for program comprehension. But choosing good meaningful names is hard. We perform a sequence of experiments in which a total of 334 subjects are required to choose names in given programming scenarios. The first experiment shows that the probability that two developers would select the same name is low: in th… ▽ More

    Submitted 12 March, 2021; originally announced March 2021.

    Comments: 17 pages (incl. appendix), 10 figures, 3 tables. Accepted to IEEE Trans. Software Engineering

    ACM Class: D.2.3

  7. arXiv:2006.06894  [pdf, other

    cs.IR cs.DB

    Google Dataset Search by the Numbers

    Authors: Omar Benjelloun, Shiyu Chen, Natasha Noy

    Abstract: Scientists, governments, and companies increasingly publish datasets on the Web. Google's Dataset Search extracts dataset metadata -- expressed using schema.org and similar vocabularies -- from Web pages in order to make datasets discoverable. Since we started the work on Dataset Search in 2016, the number of datasets described in schema.org has grown from about 500K to almost 30M. Thus, this corp… ▽ More

    Submitted 11 June, 2020; originally announced June 2020.

  8. arXiv:1407.2002  [pdf, ps, other

    cs.SI cs.AI cs.DL physics.data-an

    Discovering Beaten Paths in Collaborative Ontology-Engineering Projects using Markov Chains

    Authors: Simon Walk, Philipp Singer, Markus Strohmaier, Tania Tudorache, Mark A. Musen, Natalya F. Noy

    Abstract: Biomedical taxonomies, thesauri and ontologies in the form of the International Classification of Diseases (ICD) as a taxonomy or the National Cancer Institute Thesaurus as an OWL-based ontology, play a critical role in acquiring, representing and processing information about human health. With increasing adoption and relevance, biomedical ontologies have also significantly increased in size. For… ▽ More

    Submitted 29 February, 2016; v1 submitted 8 July, 2014; originally announced July 2014.

    Comments: Published in the Journal of Biomedical Informatics

  9. How to Apply Markov Chains for Modeling Sequential Edit Patterns in Collaborative Ontology-Engineering Projects

    Authors: Simon Walk, Philipp Singer, Markus Strohmaier, Denis Helic, Natalya F. Noy, Mark Musen

    Abstract: With the growing popularity of large-scale collaborative ontology-engineering projects, such as the creation of the 11th revision of the International Classification of Diseases, we need new methods and insights to help project- and community-managers to cope with the constantly growing complexity of such projects. In this paper, we present a novel application of Markov chains to model sequential… ▽ More

    Submitted 16 February, 2016; v1 submitted 5 March, 2014; originally announced March 2014.