In recent years, the utilization of big data has greatly advanced Computer Vision and Machine Learning applications. However, the majority of these tasks have focused on only one modality, such as visuals, with only a few incorporating multiple modalities like audio or thermal. Additionally, the handling of multimodal datasets remains a challenge, particularly in the areas of data acquisition, synchronization, and annotation. As a result, many research investigations have been limited to a single modality, and even when multiple modalities are considered independently, performance tends to suffer compared to an integrated multi-modal learning approach.

There has been a growing focus on leveraging the synchronization of multimodal streams to enhance the transfer of semantic information. Various works have successfully utilized combinations such as audio/video, RGB/depth, RGB/Lidar, visual/text, text/audio, and more, achieving exceptional outcomes. Additionally, intriguing applications have emerged, employing self-supervised methodologies that enable multiple modalities to learn associations without manual labeling. This approach yields a more advanced feature set compared to individual modality processing. Moreover, researchers have explored training paradigms that allow neural networks to perform well even when one modality is absent due to sensor failure, impaired functioning, or unfavorable environmental conditions. These topics have garnered significant interest in the computer vision community, particularly in the field of autonomous driving. Furthermore, recent attention has been directed towards the fusion of language (including Large Language models) and vision, such as in the generation of images/videos from text (e.g., DALL-E, text2video), audio (wav2clip), or vice versa (image2speech). Exploiting multimodal scenarios, diffusion models have also emerged as a fascinating framework to explore.

The “MUltimodal Learning and Applications” MULA workshop series in conjuction with CVPR have generated momentum around this topic of growing interest and encouraged interdisciplinary interaction and collaboration between computer vision, multimedia, remote sensing, and robotics communities. This special issue of the International Journal of Computer Vision (IJCV) consolidates this trending development by presenting a curated selection of papers that offer a comprehensive overview of the most recent advancements and applications in the realm of multimodal learning. This special issue garnered a total of 50 submissions. After undergoing rigorous peer review in line with the journal’s high standards, 23 papers have been accepted.

The benefits of multimodal deep learning approaches are showcased on applications, such as object detection, vision and language, object recognition, vision and sound, mul-timodal data fusion, and multimodal scene understanding.

The proposed approaches also promote a wide array of learning methods, such as multimodal representation learning, cross-modal learning, multimodal transfer learning, zero-shot learning, self-supervised learning.

We now provide a brief summary of each paper:

The first article, by Qiang Qi, Zhenyu Qiu, Yan Yan, Yang Lu and Hanzi Wang, presents a contrastive learning network for high-performance video object detection task. An inter-modality contrastive learning module is designed to enforce the visual features belonging to same classes to be compactly gathered around the corresponding textual semantic representations.

The second article, by Jiuniu Wang, Wenjia Xu, Qingzhong Wang and Antoni Chan, presents a group-based differential distinctive captioning method to enhance the distinctiveness of image captions. The Distinctive Word Rate is proposed as a new evaluation metric to quantitatively assess caption distinctiveness.

The third article, by Lucas Ventura, Cordelia Schmid and Gül Varol, investigates text-to-video retrieval problem that deals with utilizing unlabeled videos for training purpose. It explores a text-to-image retrieval model to provide an initial backbone and image captioning models to provide supervision signal into unlabeled videos.

The fourth article, by Wu Wang, Liang-Jian Deng, Ran Ran and Gemine Vivone, investigates the applicability of invertible network to the image fusion task. It proposes a conditional invertible network that learns the detail mapping based on auxiliary features.

The fifth article, by Jie Ma, Jun Liu, Qi Chai, Pinghui Wang and Jing Tao, addresses the challenge of textbook question answering, given multimodal contexts that include rich paragraphs and diagrams. It proposes an end-to-end diagram perception network for textbook question answering, which is jointly optimized by the supervision of relation predicting, diagram classification, and question answering.

The sixth article, by Marcella Cornia, Lorenzo Baraldi, Giuseppe Fiameni and Rita Cucchiara, addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. It proposes to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.

The seventh article, by Patrick Ruhkamp, Daoyi Gao, Nassir Navab and Benjamin Busam, addresses 6D object pose prediction from multimodal RGB and polarimetric images in a self-supervised manner. The networks leverage the physical properties of polarized light to learn robust geometric representations by encoding shape priors and polarization characteristics derived from the physical model.

The eighth article, by Kohei Uehara and Tatsuya Harada, introduces a framework for acquiring external knowledge by generating questions that enable the model to instantly recognize novel objects. It also contributes with a real-world evaluation in which humans responded to the generated questions, and the model used the acquired knowledge to retrain the object classifier.

The ninth article, by Mirco Planamente, Chiara Plizzari, Simone Alberto Peirone, Barbara Caputo and Andrea Bottino, proposes a new relative norm alignment loss to address the challenge in multimodal learning. It exploits the observation that variations in marginal distributions between modalities manifest as discrepancies in their mean feature norms, and rebalances feature norms across domains, modalities, and classes.

The tenth article, by Xue-Feng Zhu, Tianyang Xu, Zong-tao Liu, Zhangyong Tang, Xiao-Jun Wu and Josef Kittler, presents a new dataset named UniMod1K, to address the data deficiency for multimodal learning. UniMod1K involves three data modalities: vision, depth, and language.

The eleventh article, by Shuyuan Lin, Feiran Huang, Tao-tao Lai, Jianhuang Lai, Hanzi Wang and Jian Weng, proposes a simple yet effective heterogeneous model fitting for multi-source image correspondences. It can effectively estimate the parameters of the transformation model for alleviating the influence of outliers.

The twelfth article, by Chenyi Jiang, Yuming Shen, Dub-ing Chen, Haofeng Zhang, Ling Shao and Philip Torr, addresses the task of zero-shot learning. It introduces a near-instance-level attribute bottleneck to alter class-level attribute vectors as well as visual features throughout the training phase to better reflect their naturalistic correspondences.

The thirteenth article, by Xihang Hu, Fuming Sun, Jing Sun, Fasheng Wang and Haojie Li, addresses the task of RGB-D salient object detection. It introduces a cross-modal fusion and progressive decoding network. It demonstrates that the addition of the feature enhancement module and the edge generation module is not conducive to improving the detection performance.

The fourteenth article, by Hao Zhang and Jiayi Ma, addresses the task of pansharpening. It introduces a scale-transfer learning framework via estimating spectral observation model.

The fifteenth article, by Guofeng Mei, Cristiano Saltori, Elisa Ricci, Nicu Sebe, Qiang Wu, Jian Zhang and Fabio Poiesi, proposes an augmentation-free unsupervised approach for point clouds to learn transferable point-level features by leveraging uni-modal information for soft clustering and cross-modal information for neural rendering. Experiments on downstream applications demonstrate the effectiveness of the framework.

The sixteenth article, by Lin Zhu, Weihan Yin, Yiyao Yang, Fan Wu, Zhaoyu Zeng, Qinying Gu, Xinbing Wang, Chenghu Zhou and Nanyang Ye, proposes the vision-language alignment learning under affinity and divergence principles to adapt vision-language pre-trained models to robust few-shot out-of-distribution generalization. It offers theoretical evidence highlighting the superiority of general language knowledge in achieving more robust out-of-distribution generalization performance.

The seventeenth article, by Elisa Warner, Joonsang Lee, William Hsu, Tanveer Syeda-Mahmood, Charles E. Kahn Jr., Olivier Gevaert and Arvind Rao, provides a survey on multimodal learning in image-based clinical biomedicine. Emphasizing challenges and innovations in addressing mul-timodal representation, fusion, translation, alignment, and co-learning, it explores the transformative potential of mul-timodal models for clinical predictions.

The eighteenth article, by Muhammad Ferjad Naeem, Yongqin Xian, Luc Van Gool and Federico Tombari, addresses the task of zero-shot image classification. It proposes I2DFormer+, a novel transformer-based zero-shot learning framework that jointly learn to encode images and documents by aligning both modalities in a shared embedding space.

The nineteenth article, by Yu Wang, Xinjie Yao, Pengfei Zhu, Weihao Li, Meng Cao and Qinghua Hu, addresses the challenge of incomplete multimodal clustering. It presents a heterogeneous graph attention network. This network can form a complete latent space where incomplete information can be supplemented by other related samples via the learned intrinsic structure.

The twentieth article, by Xixi Wang, Xiao Wang, Bo Jiang, Jin Tang and Bin Luo, extends transformer for multi-modal data representation. Rather than cross-attention in transformer, the new framework employs cross-diffusion attention to conduct the information communication among different modalities.

The twenty-first article, by Yawen Huang, Hao Zheng, Yuexiang Li, Feng Zheng, Xiantong Zhen, GuoJun Qi, Ling Shao and Yefeng Zheng, presents a brain generative adversarial network that explores GANs with multi-constraint and transferable property for cross-modal brain image synthesis. This network can learn meaningful tissue representations with rich variability of brain images.

The twenty-second article, by Junhao Lin, Lei Zhu, Ji-axing Shen, Huazhu Fu, Qing Zhang and Liansheng Wang, presents a new dataset named ViDSOD−100 for RGB-D video salient object detection task. All the frames in each video are manually annotated to a high-quality saliency annotation.

The twenty-third article, by Yuanyuan Liu, Haoyu Zhang, Yibing Zhan, Zijing Chen, Guanghao Yin, Lin Wei and Zhe Chen, addresses the task of multimodal emotion recognition. It introduces multimodal fusion Transformer and noise-aware learning scheme to effectively improve the robustness of multimodal emotion understanding against noisy information.

The guest editors would like to thank all the authors for their outstanding contributions. Reviewers are recognized for their excellent and timely work that made this special issue possible, as well as the editors-in-chief, Svetlana Lazebnik, Jiri Matas, and Yasuyuki Matsushita, who guided the guest editors throughout the process. Additionally, we wish to express our appreciation to the diligent editorial team at Springer.