这是indexloc提供的服务,不要输入任何密码
Skip to main content

Showing 1–21 of 21 results for author: Pinto, A S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.06261  [pdf, ps, other

    cs.CL cs.AI

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Authors: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu , et al. (3284 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal unde… ▽ More

    Submitted 22 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

    Comments: 72 pages, 17 figures

  2. arXiv:2507.04858  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS

    Towards Human-in-the-Loop Onset Detection: A Transfer Learning Approach for Maracatu

    Authors: António Sá Pinto

    Abstract: We explore transfer learning strategies for musical onset detection in the Afro-Brazilian Maracatu tradition, which features complex rhythmic patterns that challenge conventional models. We adapt two Temporal Convolutional Network architectures: one pre-trained for onset detection (intra-task) and another for beat tracking (inter-task). Using only 5-second annotated snippets per instrument, we fin… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: Accepted at ISMIR 2025

  3. arXiv:2503.19786  [pdf, other

    cs.CL cs.AI

    Gemma 3 Technical Report

    Authors: Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin , et al. (191 additional authors not shown)

    Abstract: We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achie… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

  4. arXiv:2412.15129  [pdf, other

    cs.CV cs.AI cs.LG

    Jet: A Modern Transformer-Based Normalizing Flow

    Authors: Alexander Kolesnikov, André Susano Pinto, Michael Tschannen

    Abstract: In the past, normalizing generative flows have emerged as a promising class of generative models for natural images. This type of model has many modeling advantages: the ability to efficiently compute log-likelihood of the input data, fast generation and simple overall structure. Normalizing flows remained a topic of active research but later fell out of favor, as visual quality of the samples was… ▽ More

    Submitted 19 December, 2024; originally announced December 2024.

  5. arXiv:2412.03555  [pdf, other

    cs.CV

    PaliGemma 2: A Family of Versatile VLMs for Transfer

    Authors: Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mesnard, Ibrahim Alabdulmohsin, Lucas Beyer, Xiaohua Zhai

    Abstract: PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broa… ▽ More

    Submitted 4 December, 2024; originally announced December 2024.

  6. arXiv:2411.19722  [pdf, other

    cs.LG cs.AI cs.CV

    JetFormer: An Autoregressive Generative Model of Raw Images and Text

    Authors: Michael Tschannen, André Susano Pinto, Alexander Kolesnikov

    Abstract: Removing modeling constraints and unifying architectures across domains has been a key driver of the recent progress in training large multimodal models. However, most of these models still rely on many separately trained components such as modality-specific encoders and decoders. In this work, we further streamline joint generative modeling of images and text. We propose an autoregressive decoder… ▽ More

    Submitted 19 May, 2025; v1 submitted 29 November, 2024; originally announced November 2024.

    Comments: ICLR 2025. Code available at https://github.com/google-research/big_vision

  7. arXiv:2407.07726  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    PaliGemma: A versatile 3B VLM for transfer

    Authors: Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bauer, Matko Bošnjak, Xi Chen, Matthias Minderer , et al. (10 additional authors not shown)

    Abstract: PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more… ▽ More

    Submitted 10 October, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

    Comments: v2 adds Appendix H and I and a few citations

  8. arXiv:2403.19596  [pdf, other

    cs.CV

    LocCa: Visual Pretraining with Location-aware Captioners

    Authors: Bo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim Alabdulmohsin, Xiao Wang, André Susano Pinto, Andreas Steiner, Lucas Beyer, Xiaohua Zhai

    Abstract: Image captioning has been shown as an effective pretraining method similar to contrastive pretraining. However, the incorporation of location-aware information into visual pretraining remains an area with limited research. In this paper, we propose a simple visual pretraining method with location-aware captioners (LocCa). LocCa uses a simple image captioner task interface, to teach a model to read… ▽ More

    Submitted 11 November, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

  9. arXiv:2401.06790  [pdf, other

    cs.CL cs.AI

    Using Zero-shot Prompting in the Automatic Creation and Expansion of Topic Taxonomies for Tagging Retail Banking Transactions

    Authors: Daniel de S. Moraes, Pedro T. C. Santos, Polyana B. da Costa, Matheus A. S. Pinto, Ivan de J. P. Pinto, Álvaro M. G. da Veiga, Sergio Colcher, Antonio J. G. Busson, Rafael H. Rocha, Rennan Gaio, Rafael Miceli, Gabriela Tourinho, Marcos Rabaioli, Leandro Santos, Fellipe Marques, David Favaro

    Abstract: This work presents an unsupervised method for automatically constructing and expanding topic taxonomies using instruction-based fine-tuned LLMs (Large Language Models). We apply topic modeling and keyword extraction techniques to create initial topic taxonomies and LLMs to post-process the resulting terms and create a hierarchy. To expand an existing taxonomy with new terms, we use zero-shot promp… ▽ More

    Submitted 11 February, 2024; v1 submitted 7 January, 2024; originally announced January 2024.

  10. arXiv:2303.17376  [pdf, other

    cs.CV cs.AI cs.LG

    A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision

    Authors: Lucas Beyer, Bo Wan, Gagan Madan, Filip Pavetic, Andreas Steiner, Alexander Kolesnikov, André Susano Pinto, Emanuele Bugliarello, Xiao Wang, Qihang Yu, Liang-Chieh Chen, Xiaohua Zhai

    Abstract: There has been a recent explosion of computer vision models which perform many tasks and are composed of an image encoder (usually a ViT) and an autoregressive decoder (usually a Transformer). However, most of this work simply presents one system and its results, leaving many questions regarding design decisions and trade-offs of such systems unanswered. In this work, we aim to provide such answer… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

  11. arXiv:2302.08242  [pdf, other

    cs.CV

    Tuning computer vision models with task rewards

    Authors: André Susano Pinto, Alexander Kolesnikov, Yuge Shi, Lucas Beyer, Xiaohua Zhai

    Abstract: Misalignment between model predictions and intended usage can be detrimental for the deployment of computer vision models. The issue is exacerbated when the task involves complex structured outputs, as it becomes harder to design procedures which address this misalignment. In natural language processing, this is often addressed using reinforcement learning techniques that align models with a task… ▽ More

    Submitted 16 February, 2023; originally announced February 2023.

    Comments: 11 pages

  12. arXiv:2205.10337  [pdf, other

    cs.CV

    UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

    Authors: Alexander Kolesnikov, André Susano Pinto, Lucas Beyer, Xiaohua Zhai, Jeremiah Harmsen, Neil Houlsby

    Abstract: We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks. In contrast to previous models, UViM has the same functional form for all tasks; it requires no task-specific modifications which require extensive human expertise. The approach involves two components: (I) a base model (feed-forward) which is trained to directly predict raw vision outputs, guided by a… ▽ More

    Submitted 14 October, 2022; v1 submitted 20 May, 2022; originally announced May 2022.

    Comments: 22 pages. Accepted at NeurIPS 2022

  13. arXiv:2202.12015  [pdf, other

    cs.CV cs.LG

    Learning to Merge Tokens in Vision Transformers

    Authors: Cedric Renggli, André Susano Pinto, Neil Houlsby, Basil Mustafa, Joan Puigcerver, Carlos Riquelme

    Abstract: Transformers are widely applied to solve natural language understanding and computer vision tasks. While scaling up these architectures leads to improved performance, it often comes at the expense of much higher computational costs. In order for large-scale models to remain practical in real-world systems, there is a need for reducing their computational overhead. In this work, we present the Patc… ▽ More

    Submitted 24 February, 2022; originally announced February 2022.

    Comments: 11 pages, 9 figures

  14. arXiv:2106.05974  [pdf, other

    cs.CV cs.LG stat.ML

    Scaling Vision with Sparse Mixture of Experts

    Authors: Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, Neil Houlsby

    Abstract: Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When app… ▽ More

    Submitted 10 June, 2021; originally announced June 2021.

    Comments: 44 pages, 38 figures

  15. arXiv:2011.01637  [pdf, other

    cs.SD cs.IR

    Shift If You Can: Counting and Visualising Correction Operations for Beat Tracking Evaluation

    Authors: A. Sá Pinto, I. Domingues, M. E. P. Davies

    Abstract: In this late-breaking abstract we propose a modified approach for beat tracking evaluation which poses the problem in terms of the effort required to transform a sequence of beat detections such that they maximise the well-known F-measure calculation when compared to a sequence of ground truth annotations. Central to our approach is the inclusion of a shifting operation conducted over an additiona… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: ISMIR 2020 Late Breaking/Demo

  16. arXiv:2010.06866  [pdf, other

    cs.LG cs.CV stat.ML

    Deep Ensembles for Low-Data Transfer Learning

    Authors: Basil Mustafa, Carlos Riquelme, Joan Puigcerver, André Susano Pinto, Daniel Keysers, Neil Houlsby

    Abstract: In the low-data regime, it is difficult to train good supervised models from scratch. Instead practitioners turn to pre-trained models, leveraging transfer learning. Ensembling is an empirically and theoretically appealing way to construct powerful predictive models, but the predominant approach of training multiple deep networks with different random initialisations collides with the need for tra… ▽ More

    Submitted 19 October, 2020; v1 submitted 14 October, 2020; originally announced October 2020.

  17. arXiv:2010.06402  [pdf, other

    cs.LG cs.CV

    Which Model to Transfer? Finding the Needle in the Growing Haystack

    Authors: Cedric Renggli, André Susano Pinto, Luka Rimanic, Joan Puigcerver, Carlos Riquelme, Ce Zhang, Mario Lucic

    Abstract: Transfer learning has been recently popularized as a data-efficient alternative to training models from scratch, in particular for computer vision tasks where it provides a remarkably solid baseline. The emergence of rich model repositories, such as TensorFlow Hub, enables the practitioners and researchers to unleash the potential of these models across a wide range of downstream tasks. As these r… ▽ More

    Submitted 25 March, 2022; v1 submitted 13 October, 2020; originally announced October 2020.

  18. arXiv:2010.00332  [pdf, other

    cs.CV cs.LG

    Training general representations for remote sensing using in-domain knowledge

    Authors: Maxim Neumann, André Susano Pinto, Xiaohua Zhai, Neil Houlsby

    Abstract: Automatically finding good and general remote sensing representations allows to perform transfer learning on a wide range of applications - improving the accuracy and reducing the required number of training samples. This paper investigates development of generic remote sensing representations, and explores which characteristics are important for a dataset to be a good source for representation le… ▽ More

    Submitted 30 September, 2020; originally announced October 2020.

    Comments: Accepted at the IEEE International Geoscience and Remote Sensing Symposium (IGARSS) 2020. arXiv admin note: substantial text overlap with arXiv:1911.06721

  19. arXiv:2009.13239  [pdf, other

    cs.LG cs.CV stat.ML

    Scalable Transfer Learning with Expert Models

    Authors: Joan Puigcerver, Carlos Riquelme, Basil Mustafa, Cedric Renggli, André Susano Pinto, Sylvain Gelly, Daniel Keysers, Neil Houlsby

    Abstract: Transfer of pre-trained representations can improve sample efficiency and reduce computational requirements for new tasks. However, representations used for transfer are usually generic, and are not tailored to a particular distribution of downstream tasks. We explore the use of expert representations for transfer with a simple, yet effective, strategy. We train a diverse set of experts by exploit… ▽ More

    Submitted 28 September, 2020; originally announced September 2020.

  20. arXiv:1911.06721  [pdf, other

    cs.CV

    In-domain representation learning for remote sensing

    Authors: Maxim Neumann, Andre Susano Pinto, Xiaohua Zhai, Neil Houlsby

    Abstract: Given the importance of remote sensing, surprisingly little attention has been paid to it by the representation learning community. To address it and to establish baselines and a common evaluation protocol in this domain, we provide simplified access to 5 diverse remote sensing datasets in a standardized form. Specifically, we investigate in-domain representation learning to develop generic remote… ▽ More

    Submitted 15 November, 2019; originally announced November 2019.

  21. arXiv:1910.04867  [pdf, other

    cs.CV cs.LG stat.ML

    A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

    Authors: Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, Lucas Beyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, Sylvain Gelly, Neil Houlsby

    Abstract: Representation learning promises to unlock deep learning for the long tail of vision tasks without expensive labelled datasets. Yet, the absence of a unified evaluation for general visual representations hinders progress. Popular protocols are often too constrained (linear classification), limited in diversity (ImageNet, CIFAR, Pascal-VOC), or only weakly related to representation quality (ELBO, r… ▽ More

    Submitted 21 February, 2020; v1 submitted 1 October, 2019; originally announced October 2019.