这是indexloc提供的服务,不要输入任何密码
Skip to main content

Showing 1–44 of 44 results for author: Kolesnikov, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.05249  [pdf, ps, other

    cs.CV cond-mat.str-el cs.LG physics.data-an

    Physics-Guided Dual Implicit Neural Representations for Source Separation

    Authors: Yuan Ni, Zhantao Chen, Alexander N. Petsch, Edmund Xu, Cheng Peng, Alexander I. Kolesnikov, Sugata Chowdhury, Arun Bansil, Jana B. Thayer, Joshua J. Turner

    Abstract: Significant challenges exist in efficient data analysis of most advanced experimental and observational techniques because the collected signals often include unwanted contributions--such as background and signal distortions--that can obscure the physically relevant information of interest. To address this, we have developed a self-supervised machine-learning approach for source separation using a… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  2. arXiv:2503.19786  [pdf, other

    cs.CL cs.AI

    Gemma 3 Technical Report

    Authors: Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin , et al. (191 additional authors not shown)

    Abstract: We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achie… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

  3. arXiv:2412.15129  [pdf, other

    cs.CV cs.AI cs.LG

    Jet: A Modern Transformer-Based Normalizing Flow

    Authors: Alexander Kolesnikov, André Susano Pinto, Michael Tschannen

    Abstract: In the past, normalizing generative flows have emerged as a promising class of generative models for natural images. This type of model has many modeling advantages: the ability to efficiently compute log-likelihood of the input data, fast generation and simple overall structure. Normalizing flows remained a topic of active research but later fell out of favor, as visual quality of the samples was… ▽ More

    Submitted 19 December, 2024; originally announced December 2024.

  4. arXiv:2411.19722  [pdf, other

    cs.LG cs.AI cs.CV

    JetFormer: An Autoregressive Generative Model of Raw Images and Text

    Authors: Michael Tschannen, André Susano Pinto, Alexander Kolesnikov

    Abstract: Removing modeling constraints and unifying architectures across domains has been a key driver of the recent progress in training large multimodal models. However, most of these models still rely on many separately trained components such as modality-specific encoders and decoders. In this work, we further streamline joint generative modeling of images and text. We propose an autoregressive decoder… ▽ More

    Submitted 19 May, 2025; v1 submitted 29 November, 2024; originally announced November 2024.

    Comments: ICLR 2025. Code available at https://github.com/google-research/big_vision

  5. arXiv:2407.07726  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    PaliGemma: A versatile 3B VLM for transfer

    Authors: Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bauer, Matko Bošnjak, Xi Chen, Matthias Minderer , et al. (10 additional authors not shown)

    Abstract: PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more… ▽ More

    Submitted 10 October, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

    Comments: v2 adds Appendix H and I and a few citations

  6. arXiv:2407.00503  [pdf, other

    cs.CV

    Toward a Diffusion-Based Generalist for Dense Vision Tasks

    Authors: Yue Fan, Yongqin Xian, Xiaohua Zhai, Alexander Kolesnikov, Muhammad Ferjad Naeem, Bernt Schiele, Federico Tombari

    Abstract: Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated inspiring results. In this paper, we explore diffusion-based vision generalists, where we unify different types of dense prediction tasks as conditional image g… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

    Comments: Published at CVPR 2024 as a workshop paper

  7. arXiv:2312.08077  [pdf, other

    cs.GT math.FA

    Auctions and mass transportation

    Authors: Alexander V. Kolesnikov

    Abstract: In this survey paper we present classical and recent results relating the auction design and the optimal transportation theory. In particular, we discuss in details the seminal result of Daskalakis, Deckelbaum and Tzamos \cite{DDT} about duality between auction design with $1$ bidder and the weak transportation problem. Later investigations revealed the connection of multi-bidder case to the Beckm… ▽ More

    Submitted 4 June, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

  8. arXiv:2310.09199  [pdf, other

    cs.CV

    PaLI-3 Vision Language Models: Smaller, Faster, Stronger

    Authors: Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut

    Abstract: This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classific… ▽ More

    Submitted 17 October, 2023; v1 submitted 13 October, 2023; originally announced October 2023.

  9. arXiv:2305.18565  [pdf, other

    cs.CV cs.CL cs.LG

    PaLI-X: On Scaling up a Multilingual Vision and Language Model

    Authors: Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic , et al. (18 additional authors not shown)

    Abstract: We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-sh… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

  10. arXiv:2305.13035  [pdf, other

    cs.CV cs.LG

    Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design

    Authors: Ibrahim Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, Lucas Beyer

    Abstract: Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers. Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size… ▽ More

    Submitted 9 January, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: 10 pages, 7 figures, 9 tables. Version 2: Layout fixes

    ACM Class: I.2.10; I.2.6

    Journal ref: 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

  11. arXiv:2304.03949  [pdf, other

    cond-mat.str-el cs.AI cs.CV physics.data-an

    Capturing dynamical correlations using implicit neural representations

    Authors: Sathya Chitturi, Zhurun Ji, Alexander Petsch, Cheng Peng, Zhantao Chen, Rajan Plumley, Mike Dunne, Sougata Mardanya, Sugata Chowdhury, Hongwei Chen, Arun Bansil, Adrian Feiguin, Alexander Kolesnikov, Dharmalingam Prabhakaran, Stephen Hayden, Daniel Ratner, Chunjing Jia, Youssef Nashed, Joshua Turner

    Abstract: The observation and description of collective excitations in solids is a fundamental issue when seeking to understand the physics of a many-body system. Analysis of these excitations is usually carried out by measuring the dynamical structure factor, S(Q, $ω$), with inelastic neutron or x-ray scattering techniques and comparing this against a calculated dynamical model. Here, we develop an artific… ▽ More

    Submitted 8 April, 2023; originally announced April 2023.

    Comments: 12 pages, 7 figures

  12. arXiv:2303.17376  [pdf, other

    cs.CV cs.AI cs.LG

    A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision

    Authors: Lucas Beyer, Bo Wan, Gagan Madan, Filip Pavetic, Andreas Steiner, Alexander Kolesnikov, André Susano Pinto, Emanuele Bugliarello, Xiao Wang, Qihang Yu, Liang-Chieh Chen, Xiaohua Zhai

    Abstract: There has been a recent explosion of computer vision models which perform many tasks and are composed of an image encoder (usually a ViT) and an autoregressive decoder (usually a Transformer). However, most of this work simply presents one system and its results, leaving many questions regarding design decisions and trade-offs of such systems unanswered. In this work, we aim to provide such answer… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

  13. arXiv:2303.15343  [pdf

    cs.CV cs.AI

    Sigmoid Loss for Language Image Pre-Training

    Authors: Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer

    Abstract: We propose a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP). Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller b… ▽ More

    Submitted 27 September, 2023; v1 submitted 27 March, 2023; originally announced March 2023.

    Comments: ICCV'23 Oral. arXiv v2: fix typo in pseudocode; v3: clarify t vs t' init; v4: add SigLIP Base, Large, Shape-Optimized 400M results. Models released at: https://github.com/google-research/big_vision. Xiaohua and Lucas contributed equally

  14. arXiv:2302.08242  [pdf, other

    cs.CV

    Tuning computer vision models with task rewards

    Authors: André Susano Pinto, Alexander Kolesnikov, Yuge Shi, Lucas Beyer, Xiaohua Zhai

    Abstract: Misalignment between model predictions and intended usage can be detrimental for the deployment of computer vision models. The issue is exacerbated when the task involves complex structured outputs, as it becomes harder to design procedures which address this misalignment. In natural language processing, this is often addressed using reinforcement learning techniques that align models with a task… ▽ More

    Submitted 16 February, 2023; originally announced February 2023.

    Comments: 11 pages

  15. arXiv:2302.05442  [pdf, other

    cs.CV cs.AI cs.LG

    Scaling Vision Transformers to 22 Billion Parameters

    Authors: Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver , et al. (17 additional authors not shown)

    Abstract: The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al… ▽ More

    Submitted 10 February, 2023; originally announced February 2023.

  16. arXiv:2212.08013  [pdf, other

    cs.CV cs.AI cs.LG

    FlexiViT: One Model for All Patch Sizes

    Authors: Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, Filip Pavetic

    Abstract: Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of w… ▽ More

    Submitted 23 March, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

    Comments: Code and pre-trained models available at https://github.com/google-research/big_vision. All authors made significant technical contributions. CVPR 2023

  17. arXiv:2211.09862  [pdf, other

    q-bio.GN cs.LG

    Knowledge distillation for fast and accurate DNA sequence correction

    Authors: Anastasiya Belyaeva, Joel Shor, Daniel E. Cook, Kishwar Shafin, Daniel Liu, Armin Töpfer, Aaron M. Wenger, William J. Rowell, Howard Yang, Alexey Kolesnikov, Cory Y. McLean, Maria Nattestad, Andrew Carroll, Pi-Chuan Chang

    Abstract: Accurate genome sequencing can improve our understanding of biology and the genetic basis of disease. The standard approach for generating DNA sequences from PacBio instruments relies on HMM-based models. Here, we introduce Distilled DeepConsensus - a distilled transformer-encoder model for sequence correction, which improves upon the HMM-based methods with runtime constraints in mind. Distilled D… ▽ More

    Submitted 17 November, 2022; originally announced November 2022.

    Journal ref: Learning Meaningful Representations of Life, NeurIPS 2022 workshop oral paper

  18. arXiv:2209.06794  [pdf, other

    cs.CV cs.CL

    PaLI: A Jointly-Scaled Multilingual Language-Image Model

    Authors: Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner , et al. (4 additional authors not shown)

    Abstract: Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaL… ▽ More

    Submitted 5 June, 2023; v1 submitted 14 September, 2022; originally announced September 2022.

    Comments: ICLR 2023 (Notable-top-5%)

  19. arXiv:2205.10337  [pdf, other

    cs.CV

    UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

    Authors: Alexander Kolesnikov, André Susano Pinto, Lucas Beyer, Xiaohua Zhai, Jeremiah Harmsen, Neil Houlsby

    Abstract: We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks. In contrast to previous models, UViM has the same functional form for all tasks; it requires no task-specific modifications which require extensive human expertise. The approach involves two components: (I) a base model (feed-forward) which is trained to directly predict raw vision outputs, guided by a… ▽ More

    Submitted 14 October, 2022; v1 submitted 20 May, 2022; originally announced May 2022.

    Comments: 22 pages. Accepted at NeurIPS 2022

  20. arXiv:2205.01580  [pdf, other

    cs.CV

    Better plain ViT baselines for ImageNet-1k

    Authors: Lucas Beyer, Xiaohua Zhai, Alexander Kolesnikov

    Abstract: It is commonly accepted that the Vision Transformer model requires sophisticated regularization techniques to excel at ImageNet-1k scale data. Surprisingly, we find this is not the case and standard data augmentation is sufficient. This note presents a few minor modifications to the original Vision Transformer (ViT) vanilla training setting that dramatically improve the performance of plain ViT mo… ▽ More

    Submitted 3 May, 2022; originally announced May 2022.

    Comments: Code available at https://github.com/google-research/big_vision

  21. arXiv:2203.06837  [pdf, other

    econ.TH cs.GT math.FA math.OC

    Beckmann's approach to multi-item multi-bidder auctions

    Authors: Alexander V. Kolesnikov, Fedor Sandomirskiy, Aleh Tsyvinski, Alexander P. Zimin

    Abstract: We consider the problem of revenue-maximizing Bayesian auction design with several bidders having independent private values over several items. We show that it can be reduced to the problem of continuous optimal transportation introduced by Beckmann (1952) where the optimal transportation flow generalizes the concept of ironed virtual valuations to the multi-item setting. We establish the strong… ▽ More

    Submitted 6 September, 2022; v1 submitted 13 March, 2022; originally announced March 2022.

  22. arXiv:2111.07991  [pdf, other

    cs.CV cs.CL cs.LG

    LiT: Zero-Shot Transfer with Locked-image text Tuning

    Authors: Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, Lucas Beyer

    Abstract: This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training. In our empirical study we find that locked pre-trained image models with unlocked text models work best. We call this instance of contrastive-tuning "Locked-image Tuning" (LiT), which just teaches a text model to read out good rep… ▽ More

    Submitted 22 June, 2022; v1 submitted 15 November, 2021; originally announced November 2021.

    Comments: Xiaohua, Xiao, Basil, Andreas and Lucas contributed equally; CVPR 2022

  23. arXiv:2106.10270  [pdf, other

    cs.CV cs.AI cs.LG

    How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

    Authors: Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, Lucas Beyer

    Abstract: Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer's weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation ("AugR… ▽ More

    Submitted 23 June, 2022; v1 submitted 18 June, 2021; originally announced June 2021.

    Comments: Andreas, Alex, Xiaohua and Lucas contributed equally. We release more than 50'000 ViT models trained under diverse settings on various datasets. Available at https://github.com/google-research/big_vision, https://github.com/google-research/vision_transformer and https://github.com/rwightman/pytorch-image-models TMLR review at https://openreview.net/forum?id=4nPswr1KcP

    Journal ref: Transactions on Machine Learning Research (05/2022)

  24. arXiv:2106.05237  [pdf, other

    cs.CV cs.AI cs.LG

    Knowledge distillation: A good teacher is patient and consistent

    Authors: Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, Alexander Kolesnikov

    Abstract: There is a growing discrepancy in computer vision between large-scale models that achieve state-of-the-art performance and models that are affordable in practical applications. In this paper we address this issue and significantly bridge the gap between these two types of models. Throughout our empirical investigation we do not aim to necessarily propose a new method, but strive to identify a robu… ▽ More

    Submitted 21 June, 2022; v1 submitted 9 June, 2021; originally announced June 2021.

    Comments: Lucas, Xiaohua, Amélie, Larisa, and Alex contributed equally; CVPR 2022

  25. arXiv:2106.04560  [pdf, other

    cs.CV cs.AI cs.LG

    Scaling Vision Transformers

    Authors: Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, Lucas Beyer

    Abstract: Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of-the-art results on many computer vision benchmarks. Scale is a primary ingredient in attaining excellent results, therefore, understanding a model's scaling properties is a key to designing future generations effectively. While the laws for scaling Transformer language models have been studied, it… ▽ More

    Submitted 20 June, 2022; v1 submitted 8 June, 2021; originally announced June 2021.

    Comments: Xiaohua, Alex, and Lucas contributed equally; CVPR 2022

  26. arXiv:2105.01601  [pdf, other

    cs.CV cs.AI cs.LG

    MLP-Mixer: An all-MLP Architecture for Vision

    Authors: Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy

    Abstract: Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-… ▽ More

    Submitted 11 June, 2021; v1 submitted 4 May, 2021; originally announced May 2021.

    Comments: v2: Fixed parameter counts in Table 1. v3: Added results on JFT-3B in Figure 2(right); Added Section 3.4 on the input permutations. v4: Updated the x label in Figure 2(right)

  27. arXiv:2104.04191  [pdf, other

    cs.CV cs.AI cs.LG

    SI-Score: An image dataset for fine-grained analysis of robustness to object location, rotation and size

    Authors: Jessica Yung, Rob Romijnders, Alexander Kolesnikov, Lucas Beyer, Josip Djolonga, Neil Houlsby, Sylvain Gelly, Mario Lucic, Xiaohua Zhai

    Abstract: Before deploying machine learning models it is critical to assess their robustness. In the context of deep neural networks for image understanding, changing the object location, rotation and size may affect the predictions in non-trivial ways. In this work we perform a fine-grained analysis of robustness with respect to these factors of variation using SI-Score, a synthetic dataset. In particular,… ▽ More

    Submitted 9 April, 2021; originally announced April 2021.

    Comments: 4 pages (10 pages including references and appendix), 10 figures. Accepted at the ICLR 2021 RobustML Workshop. arXiv admin note: text overlap with arXiv:2007.08558

  28. arXiv:2010.11929  [pdf, other

    cs.CV cs.AI cs.LG

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Authors: Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby

    Abstract: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not nece… ▽ More

    Submitted 3 June, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Fine-tuning code and pre-trained models are available at https://github.com/google-research/vision_transformer. ICLR camera-ready version with 2 small modifications: 1) Added a discussion of CLS vs GAP classifier in the appendix, 2) Fixed an error in exaFLOPs computation in Figure 5 and Table 6 (relative performance of models is basically not affected)

  29. arXiv:2007.08558  [pdf, other

    cs.CV cs.LG

    On Robustness and Transferability of Convolutional Neural Networks

    Authors: Josip Djolonga, Jessica Yung, Michael Tschannen, Rob Romijnders, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Matthias Minderer, Alexander D'Amour, Dan Moldovan, Sylvain Gelly, Neil Houlsby, Xiaohua Zhai, Mario Lucic

    Abstract: Modern deep convolutional networks (CNNs) are often criticized for not generalizing under distributional shifts. However, several recent breakthroughs in transfer learning suggest that these networks can cope with severe distribution shifts and successfully adapt to new tasks from a few training examples. In this work we study the interplay between out-of-distribution and transfer performance of m… ▽ More

    Submitted 23 March, 2021; v1 submitted 16 July, 2020; originally announced July 2020.

    Comments: Accepted at CVPR 2021

  30. arXiv:2006.07159  [pdf, other

    cs.CV cs.LG

    Are we done with ImageNet?

    Authors: Lucas Beyer, Olivier J. Hénaff, Alexander Kolesnikov, Xiaohua Zhai, Aäron van den Oord

    Abstract: Yes, and no. We ask whether recent progress on the ImageNet classification benchmark continues to represent meaningful generalization, or whether the community has started to overfit to the idiosyncrasies of its labeling procedure. We therefore develop a significantly more robust procedure for collecting human annotations of the ImageNet validation set. Using these new labels, we reassess the accu… ▽ More

    Submitted 12 June, 2020; originally announced June 2020.

    Comments: All five authors contributed equally. New labels at https://github.com/google-research/reassessed-imagenet

  31. arXiv:1912.11370  [pdf, other

    cs.CV cs.LG

    Big Transfer (BiT): General Visual Representation Learning

    Authors: Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby

    Abstract: Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components,… ▽ More

    Submitted 5 May, 2020; v1 submitted 24 December, 2019; originally announced December 2019.

    Comments: The first three authors contributed equally. Results on ObjectNet are reported in v3

  32. arXiv:1910.04867  [pdf, other

    cs.CV cs.LG stat.ML

    A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

    Authors: Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, Lucas Beyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, Sylvain Gelly, Neil Houlsby

    Abstract: Representation learning promises to unlock deep learning for the long tail of vision tasks without expensive labelled datasets. Yet, the absence of a unified evaluation for general visual representations hinders progress. Popular protocols are often too constrained (linear classification), limited in diversity (ImageNet, CIFAR, Pascal-VOC), or only weakly related to representation quality (ELBO, r… ▽ More

    Submitted 21 February, 2020; v1 submitted 1 October, 2019; originally announced October 2019.

  33. arXiv:1905.03670  [pdf, other

    cs.CV cs.LG

    S4L: Self-Supervised Semi-Supervised Learning

    Authors: Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, Lucas Beyer

    Abstract: This work tackles the problem of semi-supervised learning of image classifiers. Our main insight is that the field of semi-supervised learning can benefit from the quickly advancing field of self-supervised visual representation learning. Unifying these two approaches, we propose the framework of self-supervised semi-supervised learning and use it to derive two novel semi-supervised image classifi… ▽ More

    Submitted 23 July, 2019; v1 submitted 9 May, 2019; originally announced May 2019.

    Comments: All four authors contributed equally

  34. arXiv:1901.09005  [pdf, other

    cs.CV

    Revisiting Self-Supervised Visual Representation Learning

    Authors: Alexander Kolesnikov, Xiaohua Zhai, Lucas Beyer

    Abstract: Unsupervised visual representation learning remains a largely unsolved problem in computer vision research. Among a big body of recently proposed approaches for unsupervised learning of visual representations, a class of self-supervised techniques achieves superior performance on many challenging benchmarks. A large number of the pretext tasks for self-supervised learning have been studied, but ot… ▽ More

    Submitted 25 January, 2019; originally announced January 2019.

    Comments: All three authors contributed equally. Code is available at https://github.com/google/revisiting-self-supervised

  35. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale

    Authors: Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, Vittorio Ferrari

    Abstract: We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an in… ▽ More

    Submitted 21 February, 2020; v1 submitted 2 November, 2018; originally announced November 2018.

    Comments: Accepted to International Journal of Computer Vision, 2020

  36. arXiv:1807.02136  [pdf, other

    cs.CV

    Detecting Visual Relationships Using Box Attention

    Authors: Alexander Kolesnikov, Alina Kuznetsova, Christoph H. Lampert, Vittorio Ferrari

    Abstract: We propose a new model for detecting visual relationships, such as "person riding motorcycle" or "bottle on table". This task is an important step towards comprehensive structured image understanding, going beyond detecting individual objects. Our main novelty is a Box Attention mechanism that allows to model pairwise interactions between objects using standard object detection pipelines. The resu… ▽ More

    Submitted 2 May, 2019; v1 submitted 5 July, 2018; originally announced July 2018.

  37. arXiv:1705.04258  [pdf, other

    cs.CV

    Probabilistic Image Colorization

    Authors: Amelie Royer, Alexander Kolesnikov, Christoph H. Lampert

    Abstract: We develop a probabilistic technique for colorizing grayscale natural images. In light of the intrinsic uncertainty of this task, the proposed probabilistic framework has numerous desirable properties. In particular, our model is able to produce multiple plausible and vivid colorizations for a given grayscale image and is one of the first colorization models to provide a proper stochastic sampling… ▽ More

    Submitted 11 May, 2017; originally announced May 2017.

  38. arXiv:1612.08185  [pdf, other

    cs.CV

    PixelCNN Models with Auxiliary Variables for Natural Image Modeling

    Authors: Alexander Kolesnikov, Christoph H. Lampert

    Abstract: We study probabilistic models of natural images and extend the autoregressive family of PixelCNN architectures by incorporating auxiliary variables. Subsequently, we describe two new generative image models that exploit different image transformations as auxiliary variables: a quantized grayscale view of the image or a multi-resolution image pyramid. The proposed models tackle two known shortcomin… ▽ More

    Submitted 1 July, 2017; v1 submitted 24 December, 2016; originally announced December 2016.

    Comments: ICML 2017

  39. arXiv:1611.07725  [pdf, other

    cs.CV cs.LG stat.ML

    iCaRL: Incremental Classifier and Representation Learning

    Authors: Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, Christoph H. Lampert

    Abstract: A major open problem on the road to artificial intelligence is the development of incrementally learning systems that learn about more and more concepts over time from a stream of data. In this work, we introduce a new training strategy, iCaRL, that allows learning in such a class-incremental way: only the training data for a small number of classes has to be present at the same time and new class… ▽ More

    Submitted 14 April, 2017; v1 submitted 23 November, 2016; originally announced November 2016.

    Comments: Accepted paper at CVPR 2017

  40. arXiv:1605.05538  [pdf, other

    cs.CV

    Improving Weakly-Supervised Object Localization By Micro-Annotation

    Authors: Alexander Kolesnikov, Christoph H. Lampert

    Abstract: Weakly-supervised object localization methods tend to fail for object classes that consistently co-occur with the same background elements, e.g. trains on tracks. We propose a method to overcome these failures by adding a very small amount of model-specific additional annotation. The main idea is to cluster a deep network's mid-level representations and assign object or distractor labels to each c… ▽ More

    Submitted 18 May, 2016; originally announced May 2016.

  41. arXiv:1603.06098  [pdf, other

    cs.CV

    Seed, Expand and Constrain: Three Principles for Weakly-Supervised Image Segmentation

    Authors: Alexander Kolesnikov, Christoph H. Lampert

    Abstract: We introduce a new loss function for the weakly-supervised training of semantic image segmentation models based on three guiding principles: to seed with weak localization cues, to expand objects based on the information about which classes can occur in an image, and to constrain the segmentations to coincide with object boundaries. We show experimentally that training a deep convolutional neural… ▽ More

    Submitted 6 August, 2016; v1 submitted 19 March, 2016; originally announced March 2016.

    Comments: ECCV 2016

  42. arXiv:1504.07460  [pdf, other

    cs.CV

    Identifying Reliable Annotations for Large Scale Image Segmentation

    Authors: Alexander Kolesnikov, Christoph H. Lampert

    Abstract: Challenging computer vision tasks, in particular semantic image segmentation, require large training sets of annotated images. While obtaining the actual images is often unproblematic, creating the necessary annotation is a tedious and costly process. Therefore, one often has to work with unreliable annotation sources, such as Amazon Mechanical Turk or (semi-)automatic algorithmic techniques. In t… ▽ More

    Submitted 28 April, 2015; originally announced April 2015.

  43. arXiv:1411.5995  [pdf, other

    cs.SI

    Algebraic reputation model RepRank and its application to spambot detection

    Authors: G. V. Ovchinnikov, D. A. Kolesnikov, I. V. Oseledets

    Abstract: Due to popularity surge social networks became lucrative targets for spammers and guerilla marketers, who are trying to game ranking systems and broadcast their messages at little to none cost. Ranking systems, for example Twitter's Trends, can be gamed by scripted users also called bots, who are automatically or semi-automatically twitting essentially the same message. Judging by the prices and a… ▽ More

    Submitted 20 November, 2014; originally announced November 2014.

  44. arXiv:1403.7057  [pdf, other

    cs.LG cs.CV

    Closed-Form Training of Conditional Random Fields for Large Scale Image Segmentation

    Authors: Alexander Kolesnikov, Matthieu Guillaumin, Vittorio Ferrari, Christoph H. Lampert

    Abstract: We present LS-CRF, a new method for very efficient large-scale training of Conditional Random Fields (CRFs). It is inspired by existing closed-form expressions for the maximum likelihood parameters of a generative graphical model with tree topology. LS-CRF training requires only solving a set of independent regression problems, for which closed-form expression as well as efficient iterative solver… ▽ More

    Submitted 27 March, 2014; originally announced March 2014.