Guiding Audio-Visual Question Answering with Collective Question Reasoning

Pei, Baoqi; Huang, Yifei; Chen, Guo; Xu, Jilan; Wang, Yali; Wang, Limin; Lu, Tong; Qiao, Yu; Wu, Fei

doi:10.1007/s11263-025-02510-7

Guiding Audio-Visual Question Answering with Collective Question Reasoning

Open access
Published: 03 July 2025

Volume 133, pages 6912–6929, (2025)
Cite this article

You have full access to this open access article

Download PDF

International Journal of Computer Vision Aims and scope Submit manuscript

Guiding Audio-Visual Question Answering with Collective Question Reasoning

Download PDF

Baoqi Pei¹,
Yifei Huang ORCID: orcid.org/0000-0001-8067-6227²,
Guo Chen³,
Jilan Xu⁴,
Yali Wang⁵,
Limin Wang³,
Tong Lu³,
Yu Qiao⁶ &
…
Fei Wu¹

1088 Accesses
Explore all metrics

Abstract

Audio-Visual Question Answering (AVQA) requires the model to answer questions with complex dynamic audio-visual information. Prior works on this task mainly consider only using single question-answer pairs during training, overlooking the rich semantic associations between questions. In this work, we propose a novel Collective Question-Guided Network (CoQo), which accepts multiple question-answer pairs as input and leverages the reasoning over these questions to assist the model training process. The core module is the proposed Question Guided Transformer (QGT), which uses collective question reasoning to perform question-guided feature extraction. Since multiple question-answer pairs are not always available, especially during inference, our QGT uses a set of learnable tokens to learn the collective information from multiple questions during training. At inference time, these learnable tokens bring additional reasoning information even when only one question is used as input. We employ QGT in both spatial and temporal dimensions to extract question-related features effectively and efficiently. To better capture detailed audio-visual associations, we train the model in a finer level by distinguishing feature pairs of different questions within the same video. Extensive experiments demonstrate that our method can achieve state-of-the-art performance on three AVQA datasets while reducing training time significantly. We also observe strong performances of our method on three VQA benchmarks. Detailed ablation studies further confirm the effectiveness of our proposed collective question reasoning scheme, both quantitatively and qualitatively.

Question Type Guided Attention in Visual Question Answering

Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge

Question-Guided Hybrid Convolution for Visual Question Answering

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the growing popularity of intelligent assistants, researchers have begun to focus on the task of Audio Visual Question Answering (AVQA). Humans rely on both visual and audio signals to perceive their surroundings and often engage in question-answering to interact with others. Integrating audio-visual sensing into intelligent assistants enables a deeper understanding of the environment, enhancing the overall interaction experience (Jiang et al., 2022; Huang et al., 2018; Cartas et al., 2019; Huang et al., 2024b).

Being a challenging task, AVQA requires models to not only perceive the spatial-temporal audio-visual relations in the video, but also understand the textual information to provide accurate answers to the given questions. Existing methods (Li et al., 2022b; Lin et al., 2023; Yun et al., 2021) typically first fuse visual and audio features and then use the joint feature to interact with the question to generate an answer. Since videos usually contain numerous components at different times, the fused audio-visual features contain excessive irrelevant information to the given question. As a result, these methods often fail to answer questions regarding the detailed audio-visual relations.

Recent methods (Li et al., 2023a; Chen et al., 2023) make improvements by leveraging the given question to guide the spatial-temporal audio-visual feature extraction. However, these question-guided methods exhibit two limitations: 1) They only learn from single video-question pairs, ignoring the rich semantic associations between different questions. For example, in Figure 1, Q3 “Is the violin on the right rhythmic than the flute?” provides hints that the violin is on the right to answer Q2. Moreover, Q2 and Q3 contain information that can be used to reason through Q4. Such reasoning over multiple questions can provide additional hints for question answering. 2) They perform audio-visual alignment by contrastive learning between videos and randomly selected audio from other videos. Since these samples are easy to distinguish, they can offer limited assistance in capturing fine-grained audio-visual associations relevant to given questions.

In this work, we propose a novel Collective Question-Guided Network (CoQo) to address the above limitations in the AVQA task. To leverage semantic correlations between different questions and allow questions to provide related hints to each other, we give multiple related questions as input for each video. Next, we design a novel Question Guided Transformer module to jointly perform collective question reasoning and extract critical question-guided audio-visual information. The QGT module will first conduct reasoning over multiple questions and then use questions as guidance for audio/visual feature extraction. More specifically, the QGT includes a set of learnable tokens, which are appended to each question during training. Through a series of attention operations (Vaswani et al., 2017a), these tokens incorporate information learned via reasoning over collective questions, and thus can better extract key information and achieve effective modality fusion. We deploy QGT in separate spatial and temporal dimensions to efficiently extract question-related features from different perspectives. Regarding the second limitation, we perform audio-visual alignment through contrastive learning at both coarse and fine-grained levels. In addition to conventional contrastive learning, we select pairs within the same video to encourage the QGT to capture finer-grained audio-visual associations and focus on different aspects of the same video based on the given question.

We evaluate our model on three popular AVQA datasets, MUSIC-AVQA, MUSIC-AVQA2.0, and AVQA. Results on these benchmarks demonstrate the effectiveness of our method compared to previous baselines, even surpassing recent large-scale pretrained multi-modal models (Chen et al., 2023a). Meanwhile, the training time of our method is greatly reduced thanks to the collective question reasoning scheme. We also adopt our method in several VQA benchmarks, including MSRVTT-QA, MSVD-QA, and Activitynet-QA, which further verifies the robustness of our model design. Also, we conduct extensive ablation studies, showing the effectiveness of our collective question reasoning scheme for the AVQA task.

In summary, our main contributions include: (1) We propose CoQo, a novel model with a collective question reasoning strategy to utilize rich semantic associations among different questions for better fine-grained information learning and modality fusion. (2) We propose a novel Question Guided Transformer that performs reasoning over the questions and extracts representations guided by the given questions. (3) To establish cross-modality associations, enabling joint-modal reasoning for question answering, we introduce a new audio-visual alignment mechanism that utilizes contrastive learning in intra-video level to improve the model’s ability to understand more complex questions and scenarios. (4) Our method achieves state-of-the-art performance over existing methods on three AVQA datasets and improvement in three VQA benchmarks.

2 Related Work

2.1 Question Answering

Question answering is a critical task in vision-language understanding and has been extensively studied in recent years (Fan et al., 2019; Gao et al., 2018; Ben-Younes et al., 2017; Xiao et al., 2021; Antol et al., 2015). There have been many works focusing on question answering from modalities other than language, such as Visual Question Answering (VQA) (Rahman et al., 2021; Li et al., 2019; Le et al., 2020; Qian et al., 2020; Lei et al., 2020) and Audio Question Answering (AQA) Fayek and Johnson (2020); Abdelnour et al. (2022); Behera et al. (2023). While remarkable progress has been made, they are limited in comprehending scenarios in natural videos due to the complex associations between audio and visual modalities. Hence, to escalate practicability, several AVQA datasets have been introduced recently (Yang et al., 2022b; Yun et al., 2021; Li et al., 2022b; Liu et al., 2024b), providing a more comprehensive and realistic setting for question answering tasks in video understanding research.

2.2 Audio-Visual Question Answering

Audio-visual question answering is a nascent and challenging task, which requires the model to perceive complex audio-visual information and integrate answers to the given question. Existing methods usually follow a two-stage framework to the AVQA task, which first fuses audio and visual modality with designed modules, then uses the audio-visual joint features to combine with the given question. ST-AVQA (Li et al., 2022b) uses a spatial and temporal grounding module to highlight the correlation between audio and visual modality. Taking advantage of adapters (Hu et al., 2023), LAVISH (Lin et al., 2023) proposes a latent audio-visual hybrid adapter to fuse visual and audio cues with small trainable model parameters. APL Li et al. (2023c) considers fine-grained visual objects and proposes an object-aware adaptive-positivity learning strategy to select multi-modal positive pairs. Due to the large amount of audio-visual components in videos, recent works (Gao et al., 2018) focus on extracting features that are most relevant to the given question. PSTP-net (Li et al., 2023a) uses a temporal segment selection module to select the most relevant audio-visual segments related to the given question (Chen et al., 2023) proposes a question-aware understanding network by using the given question to further refine the fusion of global and local features. CAD (Nadeem et al., 2024) proposes a new pre-training task for dynamically aligning audio and visual information.

While these methods can achieve promising performance, they ignore the rich semantic associations between different questions. In this work, we propose a collective question reasoning scheme, CoQo, in which we design a Question Guided Transformer module to extract question-related features based on the reasoning of associations among multiple questions. One prior work with a similar idea is Lei et al. (2020) on the VQA task. However, their method mainly focuses on directly finding answers to a question from another question. Instead, despite focusing on a different task, our CoQo performs reasoning among multiple questions, using simple questions to provide logical cues for the hard questions.

2.3 Multimodal Learning with Audio

Since AVQA naturally involves visual, audio, and language modalities, our work is also inside the scope of multimodal learning. There are rich audio and visual components in natural videos, and integrating audio and visual information will provide more information than a single modality (Yang et al., 2022a, 2023; Huang et al., 2024a). Thus, there have been many tasks focusing on audio-visual learning, such as event localization (Tian et al., 2018; Wu et al., 2019; Xuan et al., 2020; Duan et al., 2021), audio-visual segmentation (Liu et al., 2024a; Zhou et al., 2022; Gao et al., 2023), sound source localization (Gan et al., 2020; Mo & Tian, 2023; Chen et al., 2021), audio-visual parsing (Lin et al., 2021; Tian et al., 2020; Cheng et al., 2022) and audio-visual source separation (Gao & Grauman, 2021; Tian et al., 2021; Zhao et al., 2018, 2020). Existing methods (Zhao et al., 2018; Xiao et al., 2020; Rao et al., 2022; Alwassel et al., 2020; Korbar et al., 2018; Cheng et al., 2022; Tian et al., 2020; Lin et al., 2021; Zhu et al., 2022; Chen et al., 2023a) for these tasks mainly focus on leveraging audio and visual representations, learning the alignment between two distinct modalities from videos by methods like contrastive learning. In this work, to improve the effect of our QGT module in fine-grained audio-visual alignment, we introduce a contrastive loss to learn audio-visual associations on both inter-video and intra-video levels.

3 Method

An overview of our proposed Collective Question-Guided Network (CoQo) is illustrated in Figure 2. The inputs to the network are video V, audio A, and N questions related to this video. Three encoders first encode the inputs to video, audio, and question features, respectively. The core of our CoQo network is the Question Guided Transformer (QGT) module, which utilizes attention mechanisms to first perform reasoning on the multiple questions, and then extract the most relative question-guided modality information. For the audio modality, we additionally design a Question-Guided Gate to highlight question-relevant segments without destroying the inherent temporal causality in the audio features. To establish cross-modality associations, enabling joint-modal reasoning for question answering, we propose a joint inter- and intra-video audio-visual alignment mechanism to get robust audio-visual joint representations. In the following sections, we introduce the details of our method.

3.1 Input Representation

3.1.1 Feature Encoding

For a fair comparison with previous works, we choose to use the widely used VGGish (Gemmeke et al., 2017a) model to encode each audio segment into a feature vector, and we obtain audio features $F_{a} \in R^{T \times d}$, where T represents the number of segments and d represents the feature dimension. Given a video V, we use a pretrained Swin-Transformer (Liu et al., 2021) to extract visual features $F_{v} \in R^{T \times H \times W \times d}$. For encoding questions, given N questions $Q=\{q_1, \cdots , q_N\}$, we use an LSTM encoder to process words and encode the question into a feature $F_{q} \in R^{N \times d}$. The process of question feature extraction is trained from scratch and the audio and visual feature extraction process is performed offline following the practice of previous works (Li et al., 2022b).

3.2 Collective Question-Guided Network

3.2.1 Collective Question-Answering Training

Existing methods in AVQA typically utilize a single video-question pair during training, which has two notable disadvantages. Firstly, these methods overlook the rich semantic associations that exist between different questions. Secondly, they always select negative samples by randomly selected videos as supervised signals, which are easy to distinguish and can only provide little assistance.

Observing that there is often more than one question for each video, where the multiple questions contain rich semantic associations that can provide extra cues for selecting more discriminative features, we adopt a collective question reasoning strategy by utilizing multiple question-video pairs during training. We design a Question Guided Transformer (QGT) module to perform reasoning on multiple questions and extract the most relative question-guided modality information from the redundant input. We choose to use Transformers (Pei et al., 2025) instead of Graph-based reasoning (Huang et al., 2020) due to their superior multimodal modeling abilities. Specifically, we first provide the QGT module with a set of N questions $Q={q_1, \cdots , q_N}$ and M learnable tokens $R={r_1, \cdots , r_M}$ to enable comprehensive reasoning. With this design, we can learn and store the collective question reasoning ability during training. During inference time, for practical reasons, the model does not require multiple questions as input. Below, we go into detail on how the QGT performs the reasoning and extracts question-guided features for enhancing the AVQA task.

3.2.2 Question Guided Transformer

As shown in the lower part of Figure 2, the QGT primarily consists of self-attention and cross-attention operations. Without loss of generality, we denote $F_m$ as a “modality feature” which can represent either $F_v$ or $F_a$. Given the modality feature $F_m$ and the question feature $F_q$, we first insert several learnable tokens $F_r$ after questions to better integrate semantic associations and interact with other modalities. Then, we use self-attention operations $\theta _{sa}$ to perform reasoning over the N questions and M tokens:

$$\begin{aligned} \begin{aligned}&F_t = concat[F_q,F_r] \\&{\hat{F}_t} = \theta _{sa} (F_t,F_t,F_t) = Softmax(\frac{F_tF_t}{\sqrt{d}})F_t. \\ \end{aligned} \end{aligned}$$

(1)

Next, we execute self-attention $\theta _{sa}$ and cross-attention $\theta _{ca}$ operations simultaneously, where the modality features serve as query, and $\hat{F}_t$ as key and value, respectively. The process can be written as:

$$\begin{aligned} \begin{aligned}&\hat{F_m} = \theta _{sa} (F_m,F_m,F_m) = Softmax(\frac{F_mF_m}{\sqrt{d}})F_m, \\&\hat{F_c} = \theta _{ca} (\hat{F}_m,\hat{F}_t,\hat{F}_t) = Softmax(\frac{\hat{F}_m\hat{F}_t}{\sqrt{d}})\hat{F}_t, \\&F'_m = (\hat{F_m} + \lambda \hat{F_c}) + F_m, \\&{F''_m} = FFN(F'_m) + F'_m, \end{aligned} \end{aligned}$$

(2)

where FFN denotes the feed-forward network and $\lambda $ denotes the weight of cross-attention operation. This operation is repeated n times. By combining the two attention mechanisms within a transformer layer, we can achieve a balance between the extraction of modality features and the extraction of question-related information.

To better aggregate question-guided features, we further pass the modality features through m transformer encoder layers $\phi _{enc}$ (dark blue) to obtain global output representations ${F_{mt}} = \phi _{enc}(F''_m)$.

3.2.3 Divide Spatial-Temporal QGTs for Video

Since videos contain both spatial and temporal dimensions which require a large computational cost (Huang et al., 2022, 2024), we do not directly apply vanilla attention mechanism (Vaswani et al., 2017b), but instead follow the practice of previous works (Li et al., 2022b; Lin et al., 2023) to divide the QGTs for videos into the spatial and temporal dimensions.

Given the visual feature $F_{v} \in R^{T \times H \times W \times d}$, we first apply the spatial QGT $QGT_s$ on the spatial dimensions only by:

$$\begin{aligned} F_{vs}^S = QGT_s(F_v) = Pool(QGT(F_{v})), \end{aligned}$$

(3)

where Pool denotes attention pooling operation (Girdhar & Ramanan, 2017) on spatial dimension and $F_{vs} \in R^{T \times d}$ is the resulting frame-level visual feature.

The architecture of the temporal QGT module $QGT_t$ is similar to that of the spatial QGT, except for a final transformer decoder $\phi _{dec}$ is added on the top to get the class token for answering. The questions $F_{q}$ is used as the query for $\phi _{dec}$. The outputs in the temporal QGT module $F_{vt}, F_{(cls,v)} = QGT_t(F_{vs}, F_{q})$ can be expressed as:

$$\begin{aligned} \begin{aligned}&F_{vt} = QGT(F_{vs}), \\&F_{(cls,v)} = \phi _{dec}(F_{q}, F_{vt}, F_{vt}). \end{aligned} \end{aligned}$$

(4)

3.2.4 Adding Question Guided Temporal Gate for Audio

Previous works (Li et al., 2022b; Chen et al., 2023) utilized the same question-guiding technique on both visual and audio modalities. However, audio features reflect crucial temporal information about ordering and have much less redundancy than visual features. Thus, we design a Question Guided Temporal Gate to highlight question-relevant audio segments in a simple yet effective fashion.

We first use dot-product attention on each question feature in questions $F_q \in R^{1\times d}$ with audio features $F_a \in R^{T \times d}$ and obtain cross-modal temporal attention maps. Then we add residual connections to extract question-aware weighted audio representations, which can be formulated as:

$$\begin{aligned} \begin{aligned}&\tilde{F}_q = W_qF_q , \quad \tilde{F}_a = W_aF_a, \\&F^{out}_{a} = \tilde{F}_a + \sigma (\tilde{F}_q \tilde{F}_a^T) \tilde{F}_a, \\ \end{aligned} \end{aligned}$$

(5)

where $W_q$, and $W_a$ are learnable matrices, $\sigma $ is the sigmoid activation function. The output audio features $F^{out}_{a}$ are then passed into the temporal QGT module in Equation 2. For simplicity, we omit the superscript out in the input ($\hat{F}_m$ when modality is audio) of Equation 2.

3.3 Audio-Visual Alignment

Because of the gap in audio and visual modalities, most existing methods (Li et al., 2022b, 2023a; Chen et al., 2023) employ audio-visual alignment during training to leverage the cross-modal complementarity. They perform the alignment by contrastive learning, where the audio and visual features from the same video are treated as positive pairs and features from different videos are regarded as negative pairs.

However, this type of inter-video pair construction is relatively easy to distinguish and thus cannot help the model to capture fine-grained audio-visual associations within the same video. To address this limitation, we additionally use intra-video pair construction (Huang et al., 2023) to encourage the model to distinguish highly confusing samples. Since our input includes multiple questions, we can regard the audio and visual features interacted with the same question as positive pairs. Conversely, the features from the same video but have interacted with different questions are treated as negative pairs. Below we give the formal definition of the alignment.

3.3.1 Inter-video alignment

The inter-video alignment is composed of two losses. We first use a matching loss (Li et al., 2022b) to predict whether one pair is matched or not by a binary classifier. Given a sample, we randomly select a sample from another video as negative visual input $\overline{F_{vt}}$, and we have:

$$\begin{aligned} \begin{aligned}&{y_{pos}} = FC(concat[F_{vt},F_{at}]) , \\&{y_{neg}} = FC(concat[\overline{F_{vt}},F_{at}]), \\&\mathcal {L}_{AVM} = \mathcal {L}_{CE}(1,{y_{pos}}) + \mathcal {L}_{CE}(0,{y_{neg}}), \end{aligned} \end{aligned}$$

(6)

where we classify positive pairs as 1 and negative pairs as 0, and $\mathcal {L}_{CE}$ denotes cross-entropy loss.

Meanwhile, we use a bi-directional contrastive loss to bridge the gap between the two modalities as Radford et al. (2021):

$$\begin{aligned} \begin{aligned} \mathcal {L}_{AVC} =&-\frac{1}{2B} \big (\sum _{i=1}^{B}\text {log}\frac{\text {exp}(s(F_{at}^i,F_{vt}^i)/\tau )}{\sum _{j=1}^{B}\text {exp}(s(F_{at}^i,F_{vt}^j)/\tau )} \\&- \sum _{i=1}^{B}\text {log}\frac{\text {exp}(s(F_{at}^i,F_{vt}^i)/\tau )}{\sum _{j=1}^{B}\text {exp}(s(F_{at}^j,F_{vt}^i)/\tau )}\big ), \end{aligned} \end{aligned}$$

(7)

where $F_a^i$,$F_v^i$ denotes the audio and visual feature of i-th sample in a batch with size B, s is the cosine similarity function, and $\tau $ denotes a learned temperature parameter. The total $\mathcal {L}_{inter}$ loss is the sum of above two losses:$\mathcal {L}_{inter} = \mathcal {L}_{AVM} + \mathcal {L}_{AVC}$.

3.3.2 Intra-video alignment

Negative pairs sampled only from different videos in the same batch are often easy to distinguish and cannot provide strong supervision signals. Thanks to our collective question reasoning strategy, we can perform intra-video alignment. Given the class audio features $F_{(cls,a)} \in R^{N \times d}$ and class visual features $F_{(cls,v)} \in R^{N \times d}$ guided by the N questions, we regard audio and visual features interacted with the same question as positive pairs, then use a bi-directional contrastive loss to distinguish these question-related features:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{intra} =&-\frac{1}{2N} \big (\sum _{i=1}^{N}\text {log}\frac{\text {exp}(s(F_{(cls,a)}^i,F_{(cls,v)}^i))/\tau )}{\sum _{j=1}^{N}\text {exp}(s(F_{(cls,a)}^i,F_{(cls,v)}^j)/\tau )} \\&- \sum _{i=1}^{N}\text {log}\frac{\text {exp}(s(F_{(cls,a)}^i,F_{(cls,v)}^i)/\tau )}{\sum _{j=1}^{N}\text {exp}(s(F_{(cls,a)}^j,F_{(cls,v)}^i)/\tau )}\big ). \end{aligned} \end{aligned}$$

(8)

Note the temporal dimension is omitted for simplicity.

3.4 Modality Fusion and Answer Prediction

After alignment, audio and visual features from decoder $\phi _{dec}$ are fused to generate the final answer to the input question. In detail, we first concatenate the features and use a linear layer to obtain an audio-visual joint representation $F_{av} = FC(Tanh(concat([F_{(cls,a)}, F_{(cls,v)}]))$. We then combine $F_{av}$ with the origin question feature $F_q$, forming the final joint multimodal feature:

$$\begin{aligned} \begin{aligned} F_{fused} = Tanh(F_{av} \circ F_q) + Tanh(F_{av} + F_q). \end{aligned} \end{aligned}$$

(9)

Where $\circ $ means element-wise multiplication. We empirically find this fusion technique improves the generalization of joint features rather than using only addition or multiplication operations. To predict the final answer, we follow the baseline method (Li et al., 2022b) which uses a linear layer and softmax function to output probabilities for each answer. Then we use cross-entropy loss for supervision: $\mathcal {L}_{qa} = -\sum _{i=1}^{K}y_klog(P_k)$, where $y_k$ is the label of class k and $P_k$ is the predicted likelihood of class k. The total loss is the combination of cross-entropy loss, inter-video alignment loss and intra-video alignment loss.

$$\begin{aligned} \begin{aligned} \mathcal {L} = \mathcal {L}_{qa} + \alpha _1\mathcal {L}_{inter} + \alpha _2\mathcal {L}_{intra}, \end{aligned} \end{aligned}$$

(10)

where $\alpha _1$ and $\alpha _2$ are hyperparameters.

Table 1 Comparisons with the state-of-the-art methods on the test set of MUSIC-AVQA dataset

Full size table

4 Experiments

4.1 Experimental setup

4.1.1 Dataset and Evaluation Metric

We conduct experiments using two public datasets (Li et al., 2022b; Liu et al., 2024b). The MUSIC-AVQA dataset (Li et al., 2022b) is a large-scale dataset that contains 9288 YouTube videos spanning more than 150 hours, 45,867 question-answer pairs, and 9 audio-visual question types with different scenarios. The dataset contains complex and dynamic audio-visual components, thus it is suitable to explore scene understanding and spatio-temporal reasoning over audio and visual modalities. The MUSIC-AVQA2.0 (Liu et al., 2024b) dataset tackles the data bias in the MUSIC-AVQA dataset with an additional 1204 real videos compared to existing 7.4k real videos and an additional inclusion of 8.1k QA pairs. In our experiments, we do not use extra data for training. Following the previous works (Li et al., 2022b; Lin et al., 2023), we use the answer prediction accuracy of each question type as the evaluation metric.

4.1.2 Implementation Details

In our experiments, unless specified otherwise, we set T to 60 and d to 512 as default values. For the question input, we randomly select N from 1 to 6 questions from each video for training. To follow the settings of existing methods, we use one given question during test time. For the visual input, we sample frames at 1fps and randomly select 30 frames in a video as input. For each frame, we resize the spatial resolution to 192 $\times $ 192, then we use Swin-B (Liu et al., 2021) as the visual encoder to extract visual features. For audio input, we sample at 16kHZ following previous works (Li et al., 2022b; Lin et al., 2023) and divide them into one segment per second. For each audio segment, we use a pretrained VGGish (Gemmeke et al., 2017a) model to extract audio features, and then project them into 512-dimension features. In spatial QGT, we use $n = m = 1$ for feature extraction, and in temporal QGT, we use $n = 2,m = 6$. For the weights of losses, we use $\alpha _1 = 0.5$ and $\alpha _2 = 0.4$. The model is trained for 20 epochs using the AdamW optimizer with an initial learning rate of 9e-5 and drops by a factor of 10 at every 8 epochs. We set the batch size to 8 per GPU, and the training is done using two A100 GPUs.

4.2 Comparison with State-of-the-Art Methods

4.2.1 Audio-visual Question Answering Task

Table 1 shows the quantitative result comparisons on the MUSIC-AVQA (Li et al., 2022b) dataset. For fairness, we use different backbones in our model and compare them with the previous methods. Note that we only use multiple questions as input during training. At inference time, we only use one question as input, which is the same with all previous works. From the table, using the same VGGish encoder and Resnet encoder, our method can consistently outperform previous works. In particular, we significantly outperform state-of-the-art method COCA (Lao et al., 2023) by 2% in overall accuracy and by 1.30%, 2.76%, 1.80% in the three question types. To enable comparison with more recent works that use stronger Transformer-based (Vaswani et al., 2017b) visual encoders, in the middle block of Table 1 we conduct experiments using Swin-B (Liu et al., 2021) as the visual encoder and compare with previous works that use encoders of similar capabilities. Results also demonstrate the superiority of our proposed method. Specifically, our CoQo can outperform other methods by 4% in visual questions. We attribute this to our spatial QGT module, since it can not only learn various objects and their locations but also select information most related to the given question. Notably, our method can even outperform LAVISH (Lin et al., 2023) which is an end-to-end training model with significantly reduced model parameters (37.9M vs 238.8M).

Table 2 Detailed comparisons with the state-of-the-art methods on the test set of MUSIC-AVQA2.0 dataset

Full size table

Recently, large-scale multimodal pretraining has been proven effective in producing highly generalizable representations, in which VALOR (Chen et al., 2023a) is one representative work with audio modality involved. Also, CAD (Nadeem et al., 2024) utilizes an end-to-end work with a large pretraining task by ACAV100M (Lee et al., 2021), which contains 100 million videos. For a more comprehensive analysis, we equip the new video foundation model Internvideo2 (Wang et al., 2024) as our audio and visual encoder. In the bottom block of Table 1, results show that our method can improve the performance from 77.4 to 79.6 when using Internvideo2 as our backbone. It is noteworthy that our method, using smaller backbones (VGGish and Swin-B) and without additional training data, outperforms VALOR-B (Chen et al., 2023a) by 1.0% in overall accuracy. This outcome underscores the effectiveness and generalization capability of our model.

Table 3 Comparisons with different QA pairs during inference

Full size table

Table 4 Comparisons with the state-of-the-art methods on the three video question-answering dataset

Full size table

In Table 2, we compare our method with existing works on the MUSIC-AVQA2.0 (Liu et al., 2024b) dataset. There are several question types, such as Ext (Existential), Temp (Temporal), Cnt (Counting), Loc (Location) and Comp (Comparative). While the current state-of-the-art method LAST-Att (Liu et al., 2024b) uses stronger audio and visual encoders, our method can still outperform it by 1.17% in terms of total accuracy, and achieve state-of-the-art performance on 6 out of 9 question types. Our CoQo does not get the best performance on some of the question types, as LAVISH, LAST, and LAST-att all use a large joint audio-visual backbone with significantly more parameters than ours. Interestingly, our model demonstrates stronger capabilities in question types that are considered more difficult, such as comparative (Comp), location (Loc), and temporal (Temp) questions. In particular, the temporal questions witness a 5.27% performance gap. We attribute this success to our collective question reasoning strategy, as challenging questions often require more complex reasoning, and information from other questions can provide valuable cues for this reasoning process. We will show more quantitative and qualitative evidence in the sections below.

In addition to the MUSIV-AVQA dataset, we conduct our experiments on the AVQA dataset, which contains various real-life scenes. Specifically, we use HCRN as the baseline and integrate the QGT structure into the model for experimentation. The results are shown in Table 3. The experimental results show that the performance improves partially after inserting our QGT module. We observe that in the AVQA dataset, videos are associated with only one or two questions. Thus, our collective training strategy can’t be effectively utilized. But our QGT module still shows good performance in the aggregation of text and multimodal information in real-world scenarios.

4.2.2 Visual Question Answering Task

To further validate the robustness of our model design, we apply our method to the more general VQA datasets, including MSRVTT-QA (Xu et al., 2017) , ActivityNet-QA (Yu et al., 2019) and MSVD-QA (Xu et al., 2017).

The results on three VQA datasets are presented in Table 4. Specifically, we leverage UMT (Li et al., 2023b) as the backbone and incorporate our QGT for combining visual and textual features. To assess the efficacy of our QGT, we contrast our model with both self-attention and cross-attention mechanisms. The table reflects our findings across the three VQA datasets, indicating that our approach surpasses the original UMT performance by 0.5%, 2.1%, and 0.2% on the MSRVTT, MSVD, and ANet datasets, respectively. In contrast, employing self-attention operations yields marginal improvements on the ANet-QA dataset. Notably, the utilization of cross-attention operations often leads to a decline in performance. This decline can be primarily attributed to the observation that direct cross-modal interactions can be detrimental to the nuanced understanding of single-modality features.

Table 5 Comparisons with different backbones on the test set of MUSIC-AVQA dataset

Full size table

Table 6 Ablation studies on the number of questions N. When N = 1, it is equivalent to not utilizing the collective question reasoning strategy. Random means we randomly select 2 to 4 questions during training, and we get best performance when using this strategy

Full size table

Table 7 Ablation studies on the number of learnable tokens

Full size table

4.3 Ablation study

4.3.1 Comparing Different Audio And Visual Encoders

To investigate the impact of different backbones on the AVQA task, we conduct additional experiments using several representative audio encoders and visual encoders. The results are presented in Table 5. The combination of VGGish (Gemmeke et al., 2017b) and Swin-B (Liu et al., 2021) gets the most optimal results. Surprisingly, employing larger encoders like Swin-L did not lead to enhanced performance. We attribute this to the size constraints of the MUSIC-AVQA dataset, which can potentially trigger overfitting in larger models. In the bottom row of Table 5, when employing the audio-visual backbones with audio and visual alignment (VALOR (Chen et al., 2023a)) as pretraining, a remarkable enhancement in performance is observed. This shows that our method can work well with strong audio-visual representations.

4.3.2 Collective Question Reasoning

We systematically analyze the key components of our Collective Question Reasoning approach, focusing on the number of questions, the integration of learnable tokens, and training efficiency. We first analyze the impact of the number of questions in Table 6. Experiment results show that collective question reasoning when $2 \le N \le 4$ is the most effective since results outperform the $N=1$ case. When N becomes large, drops in performance are observed, but still better than the origin case. We believe this is due to the different set sizes of questions. When N is too large, there are many duplicate questions in a sample, which will negatively impact the performance. Remarkably, when employing a randomized selection of 2 to 4 questions as input, we achieve surprisingly high performances, underscoring the effectiveness of this approach in optimizing question reasoning capabilities.

Next, in Table 7 we present results on varying the number of learnable tokens. Notably, there is a notable correlation between the number of learnable tokens and performance, with a significant performance shift observed across different configurations. Specifically, using 4 learnable tokens yields the most optimal performance in our evaluation. We observe that when the number is 0 or 1, the performance is sub-optimal. This disparity can be attributed to the necessity for the model to grasp the semantic associations between multiple questions when presented as input. Leveraging multiple learnable tokens proves to be a favorable strategy for effectively addressing this semantic association challenge, thereby enhancing overall performance.

Furthermore, a comprehensive comparison encompassing accuracy, training speeds, and parameter counts across various methodologies is illustrated in Figure 3. In the figure, CoQo(S) indicates our method using $N=1$ question as input, and CoQo(M) denotes our method with $N=4$ questions as input. We can observe that CoQo(M) achieves state-of-the-art performance (77.24% accuracy) while maintaining a moderate number of parameters. Regarding the speed, Figure 3 shows that in training, CoQo with collective question reasoning achieves the fastest speed of 85 questions per second (q/s) along with the best accuracy. This is impressively faster than the ST-AVQA (65q/s) and nearly 4 times faster than CoQo (S). LAVISH, which uses a large joint audio-visual backbone is the slowest (5q/s). The insights gleaned from Figure 3 underscore our model’s good efficiency and performance, positioning it favorably within the landscape of AVQA methods.

Table 8 Comparisons with different QA pairs during inference

Full size table

Then we explore the impact on the results when the questions are incorrect. First, we randomly select questions from all questions during inference. Then we replace the question with incorrect questions, including location, temporal and comparison words. Table 8 shows our results.

From the results, we observe that the performance between randomly selected questions and the original results is negligible. However, when the questions are incorrect, the performance drops significantly, with a notable decline in the audio question type. We believe this is due to the model’s limited ability to comprehend information in the audio modality, making it more susceptible to the influence of incorrect semantic information, which affects the accuracy of audio-related questions.

4.3.3 QGT module and Q-guided Gate

Table 9 Ablation studies on our designed modules on the MUSIC-AVQA dataset

Full size table

We further study the effectiveness of our Question Guided Transformer by ablation studies on MUSIC-AVQA. First, we scrutinize the design of our QGT by substituting it with standard self-attention (Self-Attn) and cross-attention modules (Cross-Attn). As illustrated in Figure 4(a), the results underscore that leveraging QGT outperforms the direct application of these vanilla attention modules. Also, we test the design by changing different n in our spatial QGT module. Results in Figure 4(b) show that the performance first increases then begins to decrease when $n>1$. We believe this is due to the strong visual backbone, where more QGT layers can result in overfitting. Additionally, we test the design of our temporal QGT by changing different n and m in our temporal QGT module. Results in Figure 4(c) and (d) show that the performance first increases as n and m increase, then begins to decrease when $n>2$ and $m>6$. This decline is attributed to the onset of overfitting issues, underscoring the balance required to optimize the architecture of the QGT module for enhanced performance.

Taking one step further, we delve into the individual efficacy of each module and their synergy with the collective question reasoning strategy, as outlined in Table 9. Our findings are as follows: (i) Across all settings, we observed that utilizing the collective question reasoning strategy (top block of Table 9) consistently leads to better performance compared to single question training (bottom block of Table 9). This strongly proves the effectiveness of our collective question reasoning strategy, since it can improve the performance for each module. (ii) From rows 1,2,6,7, the temporal QGT has a greater impact on performance compared to the spatial QGT module, particularly when used in conjunction with the collective question reasoning strategy. This observation suggests that while the visual encoder of our model captures visual information, the temporal information is not adequately modeled without the temporal QGT. As a result, incorporating the temporal QGT module leads to a significant performance improvement of 3.10% in our experiments.

Additionally, comparing the results with and without our Q-Guided Gate in rows 3 and 5, we can find that our Q-Guided Gate can slightly enhance overall accuracy. Delving into the specific question types, the temporal questions enjoy a significant performance increase of 1.5%. This indicates that the Q-Guided Gate effectively extracts relevant temporal information with the guidance of questions.

Transitioning to examining attention operations within our QGT, we discuss the ratio of the two attention operations in QGT. The core of our QGT module lies in the integration of self-attention and cross-attention operations, which allows for a balance of modality feature extraction and question-related feature extraction. By varying the parameter $\lambda $ from 0 to 1 and visualizing the outcomes in Figure 5, we uncover intriguing insights. The performance patterns do not adhere to a linear progression when changing $\lambda $ values. The best performance is achieved when $\lambda $ = 0.3 and $\lambda $ = 0.9. However, when $\lambda $ is very small (lower than 0.3), the performance significantly drops, indicating the effective design of our QGT module.

4.3.4 Impact of the alignment Loss

In Table 10, we investigate the effectiveness of our alignment loss. We fix $\alpha _1$ = 0.5 and $\alpha _2$ = 0.4 as default values, and change the weight of the two losses from 0 to 0.5 respectively. We observe that our network is more sensitive to $\alpha _2$, as evidenced by 0.6% improvement in performance when this weight is adjusted. On the other hand, the performance of our model does not exhibit consistent patterns with changes in $\alpha _2$. This phenomenon suggests that the supervision signal within the same video, as captured by the intra-video loss, is more useful for our model’s training and performance.

Table 10 Ablation studies on the effectiveness of our designed modules. Y denotes we use the specified module in training

Full size table

Additionally, we observed that the intra-video loss is more helpful for audio than for visual. We analyze that this is because the visual component in videos often exhibits relatively small variations. Thus for question types such as counting and location, the focus is similar across different questions. In contrast, the audio component typically undergoes changes as the video progresses, such as variations in volume or changes in the sequence of instruments being played. Correspondingly, the focus for audio-related questions also differ. This further demonstrates the effectiveness of our intra-video loss.

4.4 Qualitative Results

In Figure 6, we conduct quantitative analyses by visualizing the weight between questions and visual patches. Regions with higher values (red) indicate the model considers this part to be more related to the question. We compare our CoQo against ST-AVQA (Li et al., 2022b) that can perform audio-visual grounding, which can provide a clearer illustration of our model’s strengths. In the left example of Figure 6, we can observe that ST-AVQA consistently focuses on the flute located in the middle. Instead, our QGT can dynamically shift attention between the guitar in the 1st and 4th frames and the flute in the 2nd frame. This ability to selectively attend to different objects is due to our model’s spatial-temporal guidance by the question. Moving to a more complex scenario on the right, our model demonstrates significantly better performance. Our model accurately counts the number of instruments, leveraging complex spatial-temporal information. This showcases the capability of our CoQo model to effectively reason and integrate spatial-temporal cues to answer questions that require understanding multiple objects in dynamic scenes.

Figure 7 depicts the temporal attention map of our Question Guided Gate. In a video sequence, numerous frames may lack direct relevance to a specific question. The Question Guided Gate serves to filter out these extraneous frames, prioritizing those that are relevant to the given question. For example, in the first case when the question is “Which instrument makes sounds after the piano?", the Question Guided Gate helps to locate the frames after the piano sounds. The last example showcases the temporal cognition ability of the Question Guided Gate. When given the question "What is the initial instrument introduced?", the Question Guided Gate directs attention to the early frames, demonstrating its capability to align with the temporal flow of the video content and cater to specific query requirements.

In Figure 8, we present additional visualizations to validate the efficacy of our collective question reasoning and the QGT module. As shown in the figure, we present the attention maps of the spatial QGT module between different questions and visual patches to highlight the process of extracting question-related features.

All examples in Figure 8 demonstrate that our model can focus on different objects and locations based on given questions, thereby resulting in better performance. For instance, in the top-left example, our model generates distinct responses for three diverse questions. When the question is “How many types of musical instruments sound in the video?" which requires comprehensive reasoning, our model appropriately allocates attention to all the instruments. When the question is “What kind of instrument makes sounds after the cello?", our model accurately directs focus solely to the guzheng, the correct answer. Furthermore, our model adeptly considers spatial relationships, as evidenced by the attention map corresponding to the third question: "What kind of instrument is the rightmost?" Here, the attention map appropriately assigns greater weight to the rightmost instrument, showcasing our model’s spatial cognition capabilities.

5 Conclusions

In this work, we propose a Collective Question-Guided Network (CoQo) for the AVQA task. At the heart of CoQo lies the Question Guided Transformer (QGT) module, which first performs reasoning over multiple input questions and then uses the questions to guide the process of spatio-temporal audio-visual feature extraction. Through inter-video and intra-video contrastive learning, we establish fine-grained alignment between video and audio modalities. This alignment, in turn, empowers the QGT to discern and prioritize essential features, thereby enhancing the model’s performance and accuracy in the AVQA domain. We conduct extensive experiments on multiple AVQA datasets and the results show our proposed CoQo is both effective and efficient. We also demonstrate the robustness of our method on the VQA datasets. Additionally, the qualitative visualizations support our claim that the QGT module can extract key information related to the question from the redundant visual input.

Data Availability

All datasets used in this work are publicly available. The MUSIC-AVQA and MUSIC-AVQA2.0 datasets can be found at https://gewu-lab.github.io/MUSIC-AVQA/. MSRVTT-QA and MSVD-QA datasets can be found at https://github.com/xudejing/video-question-answering. The ActivityNet-QA dataset can be found at https://github.com/MILVLG/activitynet-qa.

References

Abdelnour, J., Rouat, J., & Salvi, G. (2022). Naaqa: A neural architecture for acoustic question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4), 4997–5009.
Google Scholar
Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., & Tran, D. (2020). Self-supervised learning by cross-modal audio-video clustering. Advances in Neural Information Processing Systems, 33, 9758–9770.
Google Scholar
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433.
Behera, S. R., Injeti, K. M., Patibandla, J. S. K., Pokala, P. K., & Pailla, B. R. (2023). Aquallm: Audio question answering data generation using large language models. arXiv preprint arXiv:2312.17343
Ben-Younes, H., Cadene, R., Cord, M., & Thome, N. (2017). Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE international conference on computer vision, (pp 2612–2620).
Cartas, A., Luque, J., Radeva, P., Segura, C., & Dimiccoli, M. (2019). Seeing and hearing egocentric actions: How much can we learn? In Proceedings of the IEEE/CVF international conference on computer vision workshops, (pp 0–0).
Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., & Zisserman, A. (2021). Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp 16867–16876).
Chen, S., Wu, Y., Wang, C., Liu, S., Tompkins, D., Chen, Z., & Wei, F. (2022). Beats: Audio pre-training with acoustic tokenizers. arXiv preprint arXiv:2212.09058
Chen, S., He, X., Guo, L., Zhu, X., Wang, W., Tang, J., & Liu, J. (2023). Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345
Chen, Z., Wang, L., Wang, P., & Gao, P. (2023). Question-aware global-local video understanding network for audio-visual question answering. IEEE Transactions on Circuits and Systems for Video Technology, 34(5), 4109–4119.
Article Google Scholar
Cheng, H., Liu, Z., Zhou, H., Qian, C., Wu, W., & Wang, L. (2022). Joint-modal label denoising for weakly-supervised audio-visual video parsing. In: European Conference on Computer Vision, Springer, pp 431–448.
Duan, B., Tang, H., Wang, W., Zong, Z., Yang, G., Yan, Y. (2021). Audio-visual event localization via recursive fusion by joint co-attention. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 4013–4022.
Fan, C., Zhang, X., Zhang, S., Wang, W., Zhang, C., & Huang, H. (2019). Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1999–2007.
Fayek, H. M., & Johnson, J. (2020). Temporal reasoning via audio question answering. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 2283–2294.
Article Google Scholar
Gan, C., Huang, D., Zhao, H., Tenenbaum, J. B., & Torralba, A. (2020). Music gesture for visual sound separation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10478–10487.
Gao, P., Li, H., Li, S., Lu, P., Li, Y., Hoi, S. C., & Wang, X. (2018). Question-guided hybrid convolution for visual question answering. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 469–485
Gao, R., Grauman, K. (2021). Visualvoice: Audio-visual speech separation with cross-modal consistency. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp 15490–15500.
Gao, S., Chen, Z., Chen, G., Wang, W., & Lu, T. (2023). Avsegformer: Audio-visual segmentation with transformer. arXiv preprint arXiv:2307.01146
Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., & Ritter, M. (2017). Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 776–780.
Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., & Ritter, M. (2017). Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 776–780.
Girdhar, R., & Ramanan, D. (2017). Attentional pooling for action recognition. Advances in neural information processing systems 30.
Gong, Y., Chung, Y. A., & Glass, J. (2021). Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778
Hu, Z., Lan, Y., Wang, L., Xu, W., Lim, E. P., Lee, R. K. W., Bing, L., & Poria, S. (2023). Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933
Huang, Y., Cai, M., Li, Z., & Sato, Y. (2018). Predicting gaze in egocentric video by learning task-dependent attention transition. In: Proceedings of the European conference on computer vision (ECCV), pp 754–769.
Huang, Y., Sugano, Y., & Sato, Y. (2020). Improving action segmentation via graph-based temporal reasoning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14024–14034.
Huang, Y., Yang, L., & Sato, Y. (2022). Compound prototype matching for few-shot action recognition. In: ECCV.
Huang, Y., Yang, L., & Sato, Y. (2023). Weakly supervised temporal sentence grounding with uncertainty-guided self-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 18908–18918.
Huang, Y., Chen, G., Xu, J., Zhang, M., Yang, L., Pei, B., Zhang, H., Dong, L., Wang, Y., Wang, L., et al. (2024). Egoexolearn: A dataset for bridging asynchronous ego-and exo-centric view of procedural activities in real world. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 22072–22086.
Huang, Y., Xu, J., Pei, B., He, Y., Chen, G., Yang, L., Chen, X., Wang, Y., Nie, Z., Liu, J., et al. (2024). Vinci: A real-time embodied smart assistant based on egocentric vision-language model. arXiv preprint arXiv:2412.21080
Huang, Y., Yang, L., Chen, G., Zhang, H., Lu, F., & Sato, Y. (2024). Matching compound prototypes for few-shot action recognition. International Journal of Computer Vision, 132(9), 3977–4002.
Article Google Scholar
Jiang, H., Murdock, C., & Ithapu, V. K. (2022). Egocentric deep multi-channel audio-visual active speaker localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10544–10552.
Korbar, B., Tran, D., & Torresani, L. (2018). Cooperative learning of audio and video models from self-supervised synchronization. Advances in Neural Information Processing Systems 31.
Lao, M., Pu, N., Liu, Y., He, K., Bakker, E. M., & Lew, M. S. (2023). Coca: Collaborative causal regularization for audio-visual question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 12995–13003.
Article Google Scholar
Le, T. M., Le, V., Venkatesh, S., & Tran, T. (2020). Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9972–9981.
Lee, S., Chung, J., Yu, Y., Kim, G., Breuel, T., Chechik, G., & Song, Y. (2021). Acav100m: Automatic curation of large-scale datasets for audio-visual video representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10274–10284.
Lei, C., Wu, L., Liu, D., Li, Z., Wang, G., Tang, H., & Li, H. (2020). Multi-question learning for visual question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 11328–11335.
Article Google Scholar
Li, D., Li, J., Li, H., Niebles, J. C., & Hoi, S. C. (2022). Align and prompt: Video-and-language pre-training with entity prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4953–4963.
Li, G., Wei, Y., Tian, Y., Xu, C., Wen, J. R., & Hu, D. (2022). Learning to answer questions in dynamic audio-visual scenarios. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19108–19118.
Li, G., Hou, W., & Hu, D. (2023). Progressive spatio-temporal perception for audio-visual question answering. In: Proceedings of the 31st ACM International Conference on Multimedia, pp 7808–7816.
Li, K., Wang, Y., Li, Y., Wang, Y., He, Y., Wang, L., & Qiao, Y. (2023). Unmasked teacher: Towards training-efficient video foundation models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 19948–19960.
Li, X., Song, J., Gao, L., Liu, X., Huang, W., He, X., & Gan, C. (2019). Beyond rnns: Positional self-attention with co-attention for video question answering. Proceedings of the AAAI conference on artificial intelligence, 33, 8658–8665.
Article Google Scholar
Li, Z., Guo, D., Zhou, J., Zhang, J., & Wang, M. (2023). Object-aware adaptive-positivity learning for audio-visual question answering. arXiv preprint arXiv:2312.12816
Lin, Y. B., Tseng, H. Y., Lee, H. Y., Lin, Y. Y., & Yang, M. H. (2021). Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. Advances in Neural Information Processing Systems, 34, 11449–11461.
Google Scholar
Lin, Y. B., Sung, Y. L., Lei, J., Bansal, M., & Bertasius, G. (2023). Vision transformers are parameter-efficient audio-visual learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2299–2309.
Liu, J., Wang, Y., Ju, C., Ma, C., Zhang, Y., & Xie, W. (2024). Annotation-free audio-visual segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 5604–5614.
Liu, X., Dong, Z., & Zhang, P. (2024). Tackling data bias in music-avqa: Crafting a balanced dataset for unbiased question-answering. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 4478–4487.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022.
Mo, S., & Tian, Y. (2023). Audio-visual grouping network for sound localization from mixtures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10565–10574.
Nadeem, A., Hilton, A., Dawes, R., Thomas, G., Mustafa, A. (2024). Cad-contextual multi-modal alignment for dynamic avqa. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 7251–7263.
Pei, B., Huang, Y., Xu, J., Chen, G., He, Y., Yang, L., Wang, Y., Xie, W., Qiao, Y., Wu, F., et al. (2025). Modeling fine-grained hand-object dynamics for egocentric video representation learning. arXiv preprint arXiv:2503.00986
Qian, R., Hu, D., Dinkel, H., Wu, M., Xu, N., & Lin, W. (2020). Multiple sound sources localization from coarse to fine. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, Springer, pp 292–308.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763.
Rahman, T., Chou, S. H., Sigal, L., & Carenini, G. (2021). An improved attention for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1653–1662.
Rao, V., Khalil, M. I., Li, H., Dai, P., & Lu, J. (2022). Dual perspective network for audio-visual event localization. In: European Conference on Computer Vision, Springer, pp 689–704.
Tian, Y., Shi, J., Li, B., Duan, Z., & Xu, C. (2018). Audio-visual event localization in unconstrained videos. In: Proceedings of the European conference on computer vision (ECCV), pp 247–263.
Tian, Y., Li, D., & Xu, C. (2020). Unified multisensory perception: Weakly-supervised audio-visual video parsing. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, Springer, pp 436–454.
Tian, Y., Hu, D., & Xu, C. (2021). Cyclic co-learning of sounding object visual grounding and sound separation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2745–2754.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In: NeurIPS, pp 5998–6008.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems 30
Wang, J., Chen, D., Wu, Z., Luo, C., Zhou, L., Zhao, Y., Xie, Y., Liu, C., Jiang, Y. G., & Yuan, L. (2022). Omnivl: One foundation model for image-language and video-language tasks. Advances in neural information processing systems, 35, 5696–5710.
Google Scholar
Wang, J., Ge, Y., Yan, R., Ge, Y., Lin, K. Q., Tsutsui, S., Lin, X., Cai, G., Wu, J., Shan, Y., et al. (2023). All in one: Exploring unified video-language pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6598–6608.
Wang, Y., Li, K., Li, X., Yu, J., He, Y., Chen, G., Pei, B., Zheng, R., Xu, J., Wang, Z., et al. (2024). Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv preprint arXiv:2403.15377
Wu, Y., Zhu, L., Yan, Y., & Yang, Y. (2019). Dual attention matching for audio-visual event localization. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6292–6300.
Xiao, F., Lee, Y. J., Grauman, K., Malik, J., & Feichtenhofer, C. (2020). Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740
Xiao, J., Shang, X., Yao, A., & Chua, T. S. (2021). Next-qa: Next phase of question-answering to explaining temporal actions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9777–9786.
Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., & Zhuang, Y. (2017). Video question answering via gradually refined attention over appearance and motion. In: Proceedings of the 25th ACM international conference on Multimedia, pp 1645–1653.
Xuan, H., Zhang, Z., Chen, S., Yang, J., & Yan, Y. (2020). Cross-modal attention network for temporal inconsistent audio-visual event localization. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 279–286.
Article Google Scholar
Yang, A., Miech, A., Sivic, J., Laptev, I., & Schmid, C. (2021). Just ask: Learning to answer questions from millions of narrated videos. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1686–1697.
Yang, L., Huang, Y., Sugano, Y., & Sato, Y. (2022). Interact before align: Leveraging cross-modal knowledge for domain adaptive action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14722–14732.
Yang, L., Kong, Q., Yang, H. K., Kehl, W., Sato, Y., Kobori, N. (2023). Deco: Decomposition and reconstruction for compositional temporal grounding via coarse-to-fine contrastive ranking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 23130–23140.
Yang, P., Wang, X., Duan, X., Chen, H., Hou, R., Jin, C., & Zhu, W. (2022). Avqa: A dataset for audio-visual question answering on videos. In: Proceedings of the 30th ACM International Conference on Multimedia, pp 3480–3491.
Yu, Z., Xu, D., Yu, J., Yu, T., Zhao, Z., Zhuang, Y., & Tao, D. (2019). Activitynet-qa: A dataset for understanding complex web videos via question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 9127–9134.
Article Google Scholar
Yun, H., Yu, Y., Yang, W., Lee, K., & Kim, G. (2021). Pano-avqa: Grounded audio-visual question answering on 360deg videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2031–2041.
Zellers, R., Lu, X., Hessel, J., Yu, Y., Park, J. S., Cao, J., Farhadi, A., & Choi, Y. (2021). Merlot: Multimodal neural script knowledge models. Advances in neural information processing systems, 34, 23634–23651.
Google Scholar
Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., & Torralba, A. (2018). The sound of pixels. In: Proceedings of the European conference on computer vision (ECCV), pp 570–586.
Zhou, H., Xu, X., Lin, D., Wang, X., & Liu, Z. (2020). Sep-stereo: Visually guided stereophonic audio generation by associating source separation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, Springer, pp 52–69.
Zhou, J., Wang, J., Zhang, J., Sun, W., Zhang, J., Birchfield, S., Guo, D., Kong, L., Wang, M., & Zhong, Y. (2022). Audio–visual segmentation. In: European Conference on Computer Vision, Springer, pp 386–403.
Zhu, X., Zhu, J., Li, H., Wu, X., Li, H., Wang, X., & Dai, J. (2022). Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 16804–16815.

Download references

Acknowledgements

This work is supported by National Key R&D Program of China (No.2022ZD0160102), and the Industry Collaboration Projects Grant, Shanghai Committee of Science and Technology, China (No.22YF1461500), and the JSPS KAKENHI Grant No.JP22KF0119.

Funding

Open Access funding provided by The University of Tokyo.

Author information

Authors and Affiliations

Zhejiang University, Zhejiang, China
Baoqi Pei & Fei Wu
The University of Tokyo, Tokyo Daigaku, Tokyo, Japan
Yifei Huang
Nanjing University, Nanjing, China
Guo Chen, Limin Wang & Tong Lu
Fudan University, Shanghai, China
Jilan Xu
Shenzhen Institutes of Advanced Technology, Shenzhen, China
Yali Wang
Shanghai AI Laboratory, Shanghai, China
Yu Qiao

Authors

Baoqi Pei
View author publications
Search author on:PubMed Google Scholar
Yifei Huang
View author publications
Search author on:PubMed Google Scholar
Guo Chen
View author publications
Search author on:PubMed Google Scholar
Jilan Xu
View author publications
Search author on:PubMed Google Scholar
Yali Wang
View author publications
Search author on:PubMed Google Scholar
Limin Wang
View author publications
Search author on:PubMed Google Scholar
Tong Lu
View author publications
Search author on:PubMed Google Scholar
Yu Qiao
View author publications
Search author on:PubMed Google Scholar
Fei Wu
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yifei Huang.

Additional information

Communicated by Dima Damen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pei, B., Huang, Y., Chen, G. et al. Guiding Audio-Visual Question Answering with Collective Question Reasoning. Int J Comput Vis 133, 6912–6929 (2025). https://doi.org/10.1007/s11263-025-02510-7

Download citation

Received: 21 August 2024
Accepted: 09 June 2025
Published: 03 July 2025
Version of record: 03 July 2025
Issue date: October 2025
DOI: https://doi.org/10.1007/s11263-025-02510-7

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Guiding Audio-Visual Question Answering with Collective Question Reasoning

Abstract

Similar content being viewed by others

Question Type Guided Attention in Visual Question Answering

Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge

Question-Guided Hybrid Convolution for Visual Question Answering

Explore related subjects

1 Introduction

2 Related Work

2.1 Question Answering

2.2 Audio-Visual Question Answering

2.3 Multimodal Learning with Audio

3 Method

3.1 Input Representation

3.1.1 Feature Encoding

3.2 Collective Question-Guided Network

3.2.1 Collective Question-Answering Training

3.2.2 Question Guided Transformer

3.2.3 Divide Spatial-Temporal QGTs for Video

3.2.4 Adding Question Guided Temporal Gate for Audio

3.3 Audio-Visual Alignment

3.3.1 Inter-video alignment

3.3.2 Intra-video alignment

3.4 Modality Fusion and Answer Prediction

4 Experiments

4.1 Experimental setup

4.1.1 Dataset and Evaluation Metric

4.1.2 Implementation Details

4.2 Comparison with State-of-the-Art Methods

4.2.1 Audio-visual Question Answering Task

4.2.2 Visual Question Answering Task

4.3 Ablation study

4.3.1 Comparing Different Audio And Visual Encoders

4.3.2 Collective Question Reasoning

4.3.3 QGT module and Q-guided Gate

4.3.4 Impact of the alignment Loss

4.4 Qualitative Results

5 Conclusions

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article