Abstract
Recent efforts to use natural language for interpretable driving focus mainly on planning, neglecting perception tasks. In this paper, we address this gap by introducing ROLISP (Risk Object Localization and Intention and Suggestion Prediction), which towards interpretable risk object detection and suggestion for ego car motions. Accurate ROLISP implementation requires extensive reasoning to identify critical traffic objects and infer their intentions, prompting us to explore the capabilities of multimodal large language models (MLLMs). However, the limited perception performance of CLIP-ViT vision encoders in existing MLLMs struggles with capturing essential visual perception information, e.g., high-resolution, multi-scale and visual-related inductive biases, which are important for autonomous driving. Addressing these challenges, we introduce HiLM-D, a resource-efficient framework that enhances visual information processing in MLLMs for ROLISP. Our method is motivated by the fact that the primary variations in autonomous driving scenarios are the motion trajectories rather than the semantic or appearance information (e.g., the shapes and colors) of objects. Hence, the visual process of HiLM-D is a two-stream framework: (i) a temporal reasoning stream, receiving low-resolution dynamic video content, to capture temporal semantics, and (ii) a spatial perception stream, receiving a single high-resolution frame, to capture holistic visual perception-related information. The spatial perception stream can be made very lightweight by a well-designed P-Adapter, which is lightweight, training-efficient, and easily integrated into existing MLLMs. Experiments on the DRAMA-ROLISP dataset show HiLM-D’s significant improvements over current MLLMs, with a \(3.7\%\) in BLEU-4 for captioning and \(8.7\%\) in mIoU for detection. Further tests on the Shikra-RD dataset confirm our method’s generalization capabilities. The DRAMA-ROLISP is available at https://github.com/xmed-lab/HiLM-D.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Over the past decade, there has been remarkable growth in the field of autonomous driving, encompassing both academia and industry (Singh & Saini, 2021) Generally, the modern autonomous driving system integrates a range of tasks including perception, prediction and planning (Caesar et al., 2020). With the success of deep learning (He et al., 2016), data-driven learning-based methods have become a widespread component of modern autonomous driving systems (Ngiam et al., 2021; Hu et al., 2023). By using sensor data as input, e.g., RGB or Lidar data, the end-to-end autonomous driving models can directly predict planning or controls for vehicles. However, as a black box, these methods generally predict a single score, lacking interpretability and being hard to interact with humans (Deruyttere et al., 2019).
Recently, some researchers have explored to use a unified natural language for interpretable end-to-end autonomous driving (Deruyttere et al., 2019; Xu et al., 2023; Dewangan et al., 2023; Jin et al., 2023; Kim et al., 2019). However, these approaches mainly focus on planning tasks, e.g., giving the predictions and reasons for vehicle actions. To explore the perception tasks, DRAMA (Malla et al., 2023) is proposed to conduct joint risk localization and description, i.e., discriminates the most important traffic objects (e.g., cars, traffic cones and so on), explains why the found object is the risk as well as give the bounding box for the found object. In this paper, we go a further step and extend DRAMA to DRAMA-ROLISP (Risk Object Localization and Intention and Suggestion Prediction), including the additional planning description. As shown in Table 1, our task performs more comprehensive tasks compared with current interpretable autonomous driving tasks.
To perform accurate ROLISP, it is imperative for the network to have strong reasoning capabilities, i.e., be able to analyze and discern which traffic object from the current driving scene has the greatest impact on the ego-vehicle, then infer the next intention. Recently, multimodal large language models (MLLMs) have demonstrated remarkable reasoning abilities in addressing many multimodal tasks (Liu et al., 2023; Zhu et al., 2023; Dai et al., 2023). Specifically, they generally align the vision encoder to the large language models (LLMs) by instruct-tuning on image/video-text pairs, thus enabling the LLMs to analyze the context of images or videos following human instructs (Zhang et al., 2023; Zhu et al., 2023; Li et al., 2023). In this paper, we aspire to leverage the powerful ability of MLLMs to address ROLISP task in autonomous driving system.
a Limited referring detection performance. Current MLLMs suffer from failure detection of small objects (left), inaccurate detection of large objects (middle) and over-attention to salient objects (right). b Our proposed HiLM-D. Pre-trained CLIP-ViT (Radford et al., 2021) only focuses on global salient objects. Using 1.36M trainable parameters, our method can inject additional information including temporal, multi-scale, high-resolution, and visual-related biases, into existing MLLMs. c Performance vs. Resolutions. Our method outperforms the base model (MiniGPT-4 (Zhu et al., 2023)) with a clear margin with much less computation and memory cost. See Fig. 7 for more experiments
Although great comprehension ability, the limited visual perception performance of vision encoders, i.e., CLIP-ViT (Radford et al., 2021), in existing MLLMs would suffer from several problems when handling autonomous driving, e.g., missing small objects, generating imprecise bounding box and misidentification risks (refer to Fig. 1 for examples). The reason is that the CLIP-ViT only focuses on the primary visual contents which are just enough for the contrastive training (Li et al., 2023). We argue that in order to process ROLISP, the vision encoder should capture the following information: (i) Temporal Cues. To accurately determine which object poses a risk, it is essential to consider the motion of the current object across consecutive frames. (ii) Multi-scale Information. In autonomous driving, the sizes of objects vary significantly, ranging from small traffic cones to large trucks, single scale information would miss objects. (iii) Visual-Related Inductive Biases. Lacking vision-specific inductive biases ( e.g., local connectivity and spatial invariance) would result in slower convergence and lower performance in object detection (Chen et al., 2022; Wang et al., 2022). Solutions that using the advanced vision encoder, such as PVT (Wang et al., 2022), VIVIT (Arnab et al., 2021), DinoV2 (Oquab et al., 2023), would demand significant training resources and still face challenges in capturing all informations, e.g., temporal, multi-scale, high-resolution information and visual-related inductive biases.
We introduce HiLM-D, a resource-efficient approach to capture whole visual information within existing MLLM frameworks for ROLISP. Compared to training a new vision encoder, HiLM-D utilizes several taloired modules to incorporate temporal, multi-scale information and visual-related inductive biases into CLIP-ViT (Radford et al., 2021), as illustrated in Fig. 1b. Our main motivation is that in autonomous driving scenarios, the primary variations occur in the motion trajectories rather than the semantic or appearance information of objects. For example, the shapes and colors of vehicles surrounding the ego vehicle remain relatively constant, while only their motion states change. Hence, our HiLM-D performs in a two-stream paradigm: a temporal reasoning stream for temporal cues from low-resolution videos, and a spatial perception stream for multi-scale and visual-related inductive biases from high-resolution images. To this end, the temporal reasoning stream consists of a static visual encoder enhanced with the trainable lightweight ST-Adapters (Pan et al., 2022) to enable a pre-trained CLIP-ViT without temporal knowledge to reason about dynamic video content efficiently. The spatial perception stream has the following parts: a high-resolution spatial encoder to extract fine-grained semantics with visual inductive biases from one high-resolution image, a P-Adapter to capture and inject the multi-scale information into the temporal reasoning stream, mining the complete information for the LLM to perform ROLISP. Note that our spatial perception stream is very lightweight and training-efficient, which can also act as a plug-and-play module easily integrated into existing MLLMs.
We conduct experiments on the ROLISP benchmark to demonstrate the superiority of HiLM-D, e.g., outperforming the state-of-the-art MLLMs by \(3.7\%\) on BLEU-4 for captioning and \(8.7\%\) on mIoU for detection with only 1.36M trainable parameters. Experiments on the generic dataset Shikra-RD (Chen et al., 2023) further prove the generalization of our method.
In summary, the main contributions of our method are as follows:
-
We introduce ROLISP, a unified task including the risk object localization and Intention and Suggestion Prediction for interpretable autonomous driving.
-
We propose a spatial perception stream to capture and inject multi-scale fine-grained details with vision-specific biases into existing MLLMs for autonomous driving, without requiring additional pre-training to align the vision encoder with LLMs.
-
We introduce a query-aware detector that leverages LLM hidden states as queries to localize objects in high-resolution feature maps, significantly improving object detection accuracy for MLLMs.
-
We conduct extensive experiments on the ROLISP, trajectory prediction and general tasks to demonstrate the superiority and generalization of our method.
2 Related Work
2.1 Multimodal Large Language Models (MLLMs)
Natural language processing has witnessed significant strides with the advent of Large Language Models (LLMs) e.g., GPT series (OpenAI, 2023; Radford et al., 2019), T5 (Raffel et al., 2020), LLaMA (Touvron et al., 2023) and etc. Motivated by the potential of LLMs, numerous multimodal LLMs (MLLMs), e.g., LLaVA (Liu et al., 2023), MiniGPT-4 (Zhu et al., 2023), Video-LLaMA (Zhang et al., 2023) and InstructBLIP (Dai et al., 2023), have been proposed to expand the LLMs to the multimodal field, i.e., perceiving image/video input, and conversating with users in multiple rounds. Pre-trained on massive image/video-text pairs, the above models can only handle image-level tasks, such as image captioning and question answering. Hence, several works, e.g.ContextDET (Zang et al., 2023), KOSMOS-2(Peng et al., 2023) and Shikra (Chen et al., 2023), have been proposed to enable the grounding ability of MLLMs to produce bounding boxes. However, all of the current MLLMs train the model in low-resolution image-text pairs, presenting limited perception results in high-resolution autonomous driving scenarios. To capture fine-grained visual details from high-resolution inputs, images are split into patches and projected with linear layers for input into Large Vision-Language Models (LVLMs) (Bavishi et al., 2023; Li et al., 2023), avoiding an image encoder but often resulting in poor visual representation and higher training costs. Up-Resize methods like Qwen-VL (Bai et al., 2023) adjust Vision Transformer (ViT) embeddings from \(224 \times 224\) to \(448 \times 448\), adding a training phase to fine-tune the ViT. However, this can degrade visual representation by altering the original position encoding (Radford et al., 2021). Slicing-based methods (Li et al., 2023; Xu et al., 2024) divide high-resolution images into patches matching a pre-trained vision encoder’s input size, maintaining efficiency while achieving competitive performance. However, they demage the original context and spatial continuity across patches, as well as losing visual-related inductive biases (Wang et al., 2022).
Differently, our HiLM-D maintain the whole high-resolution details while capturing multi-scale and visual-related inductive biases informations.
a Overall pipeline of HiLM-D. The visual process consists of two streams, i.e., temporal reasoning stream and spatial perception stream, to capture temporal, multi-scale, fine-grained visual information with visual-related inductive biases, i.e., \(\{ \textbf{I}^i \}_{i=1}^3\). b P-Adapter. Through the cross-attention mechanism, P-Adapter incorporates the enhanced multi-scale information, i.e., \(\{ \textbf{I}^i \}_{i=1}^3\), into the pre-trained vision encoder CLIP-ViT. c Query-Aware Detector. For imprecise bounding boxes of risk objects, the query-aware detector regards the hidden-states of the LLM as the prior knowledge to refer the bounding box in \(\{ \textbf{I}^i \}_{i=1}^3\)
2.2 Risk Object Identification
Risk object identification methodologies can be grouped into explicit and implicit learning. Explicit methods (Zeng et al., 2017; Gao et al., 2019) use binary classification for agent importance estimation, while approaches (Alletto et al., 2016; Tawari et al., 2018) mimic human gaze for risk proxy via pixel-level attention maps. Implicit methods (Malla et al., 2020; Kim & Canny, 2017; Wang et al., 2019; Li et al., 2020) focus on related tasks like trajectory prediction, with intermediate activations indicating perceived risk. However, these methods lack reasoning about model decisions or natural language descriptions, limiting interpretability in autonomous driving and driver-assistance systems. Recently, ADAPT (Jin et al., 2023) leverages the transformer to generate action narration and reasoning for the intention of self-driving vehicles. DRAMA (Malla et al., 2023) proposes a new direction of risk object identification that gives the risk object and explanation in natural languages. In this paper, we go one step further than Drama and ADAPT, i.e., risk object localization and intention prediction (ROLISP) that aims to identify, explain and localize the risk object for the ego-vehicle meanwhile predicting its intention.
2.3 Multi-tasks in Autonomous Driving
Traditional autonomous driving algorithms individually process different tasks, e.g., detection (Chen et al., 2017), tracking (Petrovskaya & Thrun, 2008), reasoning (Kim et al., 2018), and prediction (Ngiam et al., 2021). To extract the richer inter-task information, researchers explore integrating the multi-tasks in the end-to-end training frameworks. For instance, combined training for detection and tracking has been demonstrated in works like D & T (Petrovskaya & Thrun, 2008). FaF (Luo et al., 2018) takes this further by unifying a detector with a trajectory predictor, yielding notable results. IntentNet (Casas et al., 2018) expanded this paradigm by also integrating intention prediction for actors. UniAD (Hu et al., 2023) stands out by amalgamating full-stack driving tasks within a singular framework, albeit still relying on distinct sub-networks for each task. A novel direction in this domain is the use of natural language as a unified output across tasks. For instance, ADAPT (Jin et al., 2023) predicts intention and gives explanations using a single caption, while DRAMA (Malla et al., 2023) aims for the risk object detection and explanation. In this paper, we go one step further than Drama and ADAPT, i.e., ROLISP that aims to identify, explain and localize the risk object for the ego-vehicle meanwhile predicting its intention and giving suggestions.
3 Method
This section details our HiLM-D approach designed to tackle ROLISP using video inputs as shown in Fig. 2. The vision encoder of HiLM-D consists of two streams, a temporal reasoning stream for temporal cues (Sect. 3.1) and a spatial perception stream for fine-grained multi-scale information as well as visual inductive biases (Sect. 3.2). The features from the two streams are performed in a cooperative way to achieve the enhanced visual feature, which is then fed to LLMs to perform risk object identification, reasoning and planning. Additionally, we propose the query-aware detector for the referred bounding box of the found risk object, illustrated in Sect. 3.4.
3.1 Temporal Reasoning Stream
As shown in Fig 2a, the temporal reasoning stream is built on the well pre-trained vision encoder, i.e., CLIP-ViT (Radford et al., 2021), equipped with new learnable spatial-temporal adapters following (Pan et al., 2022) for video reasoning. Formally, given a video with L frames, the CLIP-ViT maps each frame to its kth layer feature, resulting in \(\textbf{V}^k = \{ \textbf{v}_i^k \}_{i=1}^L\), where \(\textbf{v}_i^k\) is the feature of ith frame, \(\textbf{v}_i^k \in \mathbb {R}^{N_v \times D_v}\), \(N_v\) is the patch number and \(D_v\) is the dimension. We then reshape its dimension to \(\textbf{V}^{k\prime } \in \mathbb {R}^{L \times H_v \times W_v \times D_v}\), where \(H_v * W_v = N_v\). Then, to obtain the spatial-temporal information, a standard depth-wise 3D convolution layer (Feichtenhofer, 2020) is used, which can be formulated as:
where \(\operatorname {DWConv3D}\) denotes the depth-wise 3D-convolution, f, \(\textbf{W}_{\text{ down } }\) and \(\textbf{W}_\text {up}\) are the activation function, down-sampling and up-sampling weight. Then, we reshape \(\text {ST-Adapter}(\textbf{V}^{k\prime })\) back to \(\textbf{V}^{k}\).
3.2 Spatial Perception Stream
Achieving effective perception requires high-resolution inputs, visual-related inductive biases, and multi-scale, fine-grained information. However, integrating these elements into existing Multimodal Large Language Models (MLLMs) presents a significant challenge. The primary issue arises from the fact that current MLLMs typically employ CLIP-ViT (Radford et al., 2021) as the vision backbone, which has been pre-aligned with language embeddings for LLMs. CLIP-ViT is designed with a fixed input resolution, such as \(224 \times 224\) or \(336 \times 336\), and lacks visual-specific inductive biases that are crucial for tasks like object detection. Directly training a new high-resolution Vision Transformer (ViT) with visual-related inductive biases, such as Swin Transformers (Liu et al., 2021), for video-based tasks would require vast amounts of data, substantial memory, and significant computational resources, resulting in prohibitive costs in both training time and infrastructure.
In autonomous driving scenarios, the primary variations occur in the motion trajectories rather than the semantic or apperance information of objects. For example, the shapes and colors of vehicles surrounding the ego vehicle remain relatively constant, while only their motion states change, and low-resolution input is sufficient to capture these motion trajectories. Motivated by this, we sample a single frame from the original video in high-resolution to extract the detailed perceptual information (In this paper, we use sampled the last frame from the video; see Table 6 for analysis).
To this end, we propose a spatial perception stream, which captures and integrates multi-scale vision-specific information from the high-resolution image into the MLLM. As shown in Fig. 2a, the proposed stream consists of two main parts: a high-resolution spatial (HRS) encoder to obtain the multi-scale features with visual-specific biases from the high-resolution frame and an P-Adapter to incorporate the extracted features into the temporal reasoning stream.
HRS Encoder. To capture multi-scale vision-specific information for object detection, the HRS encoder is modified from the classic convolution network (CNN) ResNet (He et al., 2016). Compared with the plain ViT in current MLLMs, CNN incurs many advantages: reducing the memory and computation resources, and bringing vision-specific prior for the detection tasks (e.g., local connectivity and spatial invariance). We prove this in Table 9. For multi-scale features, we use a stack of stride-2 \(3 \times 3\) convolution layers to double the number of channels and reduce the size of feature maps, resulting in hierarchical features \(\{ \textbf{I}^i \}_{i=1}^3\) with resolutions of 1/8, 1/16, and 1/32. Then, we flatten and concatenate these feature maps into feature tokens \(\textbf{I} \in \mathbb {R}^{N_{sp}\times D_i}\), where \(N_{sp}\) and \(D_i\) are the token number and the dimension respectively.
P-Adapter. The P-Adapter aims to incorporate the multi-scale high-resolution spatial features into the features from CLIP-ViT (Radford et al., 2021), thus allowing the LLM to receive the complete visual information to compare and decide which object needs the most attention. As shown in Fig. 2b, for the kth block of the vision encoder, X-Adapter take the temporal low-resolution visual embedding \(\textbf{V}^k\) (See Sect. 3.1) as query, and the extracted multi-scale spatial visual features, i.e., \(\textbf{I}\) (See Sect. 3.2) as key and value. Then, the P-Adapter process can be formulated as:
where \(\text {norm}(\cdot )\) and Cross-Attn are LayerNorm and cross-attention layer respectively. \(\alpha \) is a learnable gating factor to adaptively control the importance of enhanced features and \(\textbf{V}^k\). We initialize \(\alpha \) as zero to ensure the feature distribution of pre-trained CLIP-ViT will not be modified drastically, for better knowledge maintaining and stable optimization (Zhang et al., 2023).
Note that although our method adopts a two-stream framework, the motivation and technical details differ from the previous video two-stream framework (such as Slow-Fast (Feichtenhofer et al., 2019)). Existing two-stream frameworks like Slow-Fast focus on temporal dynamics for general video semantic classification, typically using low-resolution frames (224\(\times \)224) with varying temporal resolutions (e.g., Slow-Fast (Feichtenhofer et al., 2019) uses different frame rates and others (Simonyan & Zisserman, 2014; Carreira & Zisserman, 2017) use the additional optical flow stream). In contrast, our method targets both temporal dynamics and fine-grained spatial perception, crucial for tasks like autonomous driving. Our framework achieves this by introducing additional lightweight modules to capture holistic perception semantics and efficiently inject them into MLLM without the need for massive pre-training or architecture modification, e.g.using only 1.36M trainable parameters to achieve \(8.7\%\) improvements on detection tasks.
3.3 Large Language Model
Given the visual tokens and text instruction tokens, the pre-trained LLM ( e.g., Vicuna (Chiang et al., 2023)) is leveraged for descrition generation, including the risk object with explanations, intentions and suggestions for the ego-car. The input to the LLM is the concatenated multimodal tokens \([{\textbf{Z}_v}, \textbf{Z}_t] \in \mathbb {R}^{(N_v+N_t) \times D_t}\), where \(\textbf{Z}_v = \text {project}(\textbf{V}^K)\) and \(\text {project}(\cdot )\) is the projection layer to transfer the visual embedding into tokens that LLM can understand. \(\textbf{Z}_t \in \mathbb {R}^{N_t \times D_t}\) are the text embeddings, tokenized from text prompts, e.g., ‘Which object is at the highest risk? Then predict the motions and suggestions for the ego-car’. Then, the pre-trained LLM recevies the mulimodal tokens \([{\textbf{Z}_v}, \textbf{Z}_t]\) to generate language in an autoregressive way as follows:
where \(\theta \) is the trainable parameters, \(\textbf{Z}_a \in \mathbb {R}^{N_a \times D_t}\) is the generated answer, \(\textbf{Z}_{t,<i}\) and \(\textbf{Z}_{a,<i}\) are the prompts and answer tokens before the current prediction token \(\textbf{z}_i\). The ST-Adapters and the linear layer are supervised by the caption loss \(\mathcal {L}_{cap} = CE(\textbf{Z}_a, \hat{\textbf{Z}}_a)\), where \(\hat{\textbf{Z}}_a\) is the ground-truth answer, CE is the cross-entropy loss.
3.4 Query-Aware Detector
For obtaining the precise bounding box of the identified risk object, we devise a query-aware detector to regard the hidden states as the prior knowledge to find the bounding box in the multi-scale visual features i.e., \(\{ \textbf{I}^i \}_{i=1}^3\). Normally, the last token of the hidden-states fully perceive the whole multimodal context and contains comprehensive instruction-aware semantics (Li et al., 2023). Therefore, we use the hidden states of the last token, \(\textbf{H}\), as the query, referring the visual cues from \(\{ \textbf{I}^i \}_{i=1}^3\) for bounding boxes. This process can be presented as follows:
Finally, for each scale, \(\overline{\textbf{H}}_i\) is fed to two MLP layers to generate the bounding box B and the activation score c,
During training, we define a multi-task loss to supervise the bounding box prediction as \(\mathcal {L}_\text {det} = \mathcal {L}_\text {act} + \mathcal {L}_\text {box}\), where \(\mathcal {L}_\text {act}\) is the cross-entropy loss and \(\mathcal {L}_\text {box}\) is the L1 loss. The overall loss is defined as follows:
where \(\lambda _\text {det}\) is the hyper-parameters.
Examples of generated descriptions by GPT-4 (OpenAI, 2023)
4 Experiments
4.1 Experimental Settings
Architecture Details In this paper, we adopt the well-known MLLM, MiniGPT-4(Zhu et al., 2023), as our backbone. Notably, our proposed P-Adapter is a plug-and-play module that can be seamlessly integrated into existing MLLMs (see Table 5). As described in Sect. 3.2, the HRS encoder is a modified version of the classic CNN-based ResNet (He et al., 2016). Specifically, it consists of an initial 7\(\times \)7 convolutional block followed by four residual stages (64\(\rightarrow \)128\(\rightarrow \)256\(\rightarrow \)512) to extract spatial features. The query-aware detector comprises a cross-attention head (a multi-head attention block) and three MLP layers, each followed by a ReLU activation, to map features to bounding boxes.
Implementation Details Our proposed method is implemented in PyTorch (Paszke et al., 2019) trained using a single machine with 8 NVIDIA V100 GPUs. The input video frames are resized and cropped to the spatial size of \(224 \times 224\). We uniformly sample \(L=5\) frames from the entire video, and ensure that the last frame is sampled for producing bounding boxes. We set \(\lambda _\text {det}\) in Eq. 6 to 2. We use AdamW (Loshchilov & Hutter, 2017) as the optimizer and cosine annealing scheduler (Loshchilov & Hutter, 2016) as the learning rate scheduler with an initial learning rate of \(1e-4\) for the MLLM branch and \(4e-4\) for the spatial perception stream, and global batch size of 64.
Datasets DRAMA (Malla et al., 2023) is a benchmark evaluating visual reasoning in driving scenarios with 17, 785 2-sec interactive scenarios. The videos in DRAMA (Malla et al., 2023) dataset are recorded in 30hz frame rate, \(1928 \times 1280\) resolution or 60hz frame rate, \(2704 \times 1520\) resolution. We split the total 17, 785 videos into \(70\%\) train, \(15\%\) validation and \(15\%\) test. The original captions for DRAMA only contain the risk object and its explanation. To add the intention and suggestions for the ego-car, we use the answer based on the video question answer (VQA) annotations from the dataset to complete the captions for ROLISP (Risk Object Localization and Intention and Suggestion Prediction). We obtain the ground truth of intentions and suggestions for the ego-car by the question: ‘Suggestions (to the ego car)’ and ‘Intention of Ego-vehicle’. However, the intention and suggestion answers are shot and single, i.e., ‘straight’, ‘yield’. Hence, we resort to GPT (OpenAI, 2023) to generate more diverse descriptions. For example, we use the prompt as follows:
Figure 3 illustrates the newly generated descriptions. Similarly, we can also obtain the descriptions for ‘yield’. We randomly select one of the 20 generated descriptions and add it to the original risk object’s caption in the order of intention and suggestion, obtaining the full caption for ROLISP. The whole process is illustrated in Fig. 4.
Evaluation Metrics ROLISP comprises two tasks: (1) captioning to identify and explain risk objects while predicting ego-vehicle intentions and motions, and (2) risk object detection. The captioning performance follows standard metrics (Malla et al., 2023), i.e., BLEU-4 (B4), METEOR (M), CIDER (C), and SPICE (S). We also utilize the LLM, i.e., GPT-4o (OpenAI, 2024), to assess the degree of semantic alignment between the predicted results and the ground truth. The degree of semantic alignment is rated on a scale from 0 to 5, with 0 representing the lowest and 5 representing the highest score. The final result is the average score of all the predicted samples. The detailed prompts for GPT-4o are shown in Fig. 5. We employ the mean intersection over union (mIoU) for detection assessment. Additionally, we present IoU scores categorized by object size: small (IoU\(_{S}\)), medium (IoU\(_{M}\)), and large (IoU\(_{L}\)).
4.2 Comparison with the State-of-the-Art Methods
We conduct the experiments on DRAMA-ROLISP to compare with both image-based and video-based MLLMs, including BLIP-2 (Li et al., 2023), MiniGPT-4 (Zhu et al., 2023), LLaVA\(-\)1.5 (Liu et al., 2023), InstrutBLIP (Dai et al., 2023), Shikra (Chen et al., 2023), Monkey, (Li et al., 2023), Qwen-VL (Bai et al., 2023), LLaVA-Next (Liu et al., 2024) eP-ALM (Shukor et al., 2023), and Video-LLaMA (Zhang et al., 2023), Video-LLaMA 2 (Cheng et al., 2024); For all models, we select the 7B parameter version of the LLM. Note that except for Shikra and Qwen-VL, other models have no ability to detect objects. Hence, we integrate the query-aware detector into them for bounding boxes, i.e., replacing \(\textbf{I}_i\) to \(\textbf{Z}_v\) in Eq. 4, which can be formulated as \( \overline{\textbf{H}}_i = \text {Cross-Attn}(\textbf{H},\textbf{Z}_v)\).
Table 2 demonstrates the superior performance of our method in both captioning and detection tasks. Video-based approaches generally surpass image-based ones in captioning, benefiting from temporal data for enhanced risk identification and intention/suggestion prediction. However, their detection capabilities tend to decline, possibly due to the redundant noise in videos affecting object detection. Uniquely, our video-based approach enhances captioning performance (e.g., B4 score increases from 56.3 to 58.3) without compromising detection accuracy. Notably, we observe significant improvements for small objects, underscoring the advantages of high-resolution input. Besides the standard caption metrics, our method also improves the performance of baselines on semantic correctness based on the LLM. Figure 6 offers a visual comparison of captioning and detection results from Shikra(Chen et al., 2023), Video-LLaMA (Zhang et al., 2023), and our method, further highlighting our superior performance in both domains.
Except for the MLLMs, we also compare our method with some specialist SOTAs, i.e., ADAPT (Jin et al., 2023) and Grounding-Dino (Liu et al., 2023). ADAPT is designed for captioning tasks in autonomous driving and Grounding-Dino is a model for language-grounding detection tasks. Since Grounding-Dino can not generate descriptions for risk objects, we directly use the ‘The most risk object’ as prompts, and select the object with the highest confidence score for the detection result. Our approach demonstrates comparable performance to existing specialized models. Notably, while Grounding-DINO performs highly in detecting large objects, its efficacy is significantly reduced for small objects. The primary reason for this is Grounding-DINO’s inability to accurately identify risk objects, showing a tendency to classify larger objects as risk objects.
The visualization comparison with the the state-of-the-art. For clarity, we only show some keywords of the full captions. The inaccurate bounding box and error-predicted words in captions are marked by red. Compared with the current MLLMs, Our model can a detect the small object, i.e., traffic light, b generate more precise bounding boxes and c accurately identify the highest risk object (Color figure online)
Comparison of the baseline and ours across different resolution inputs. The baseline processes varying resolution videos, while our method, thanks to the spatial perception stream, uses a single high-resolution image. "OOM" indicates out of memory. FLOPs and memory metrics are based on a V100 GPU with a batch size of one, and FLOPs values are scaled to \(10^{10}\) for clarity.a FLOPs vs. Resolutions. b Memory cost vs. Resolutions. c Captioning (B4) vs. Resolutions. d Detection (mIoU) vs. Resolutions
4.3 Ablation Study
In this section, we conduct the ablation study to evaluate and analyze the spatial perception stream (illustrated in Sect. 3.2), the key part of the proposed HiLM-D. Unless otherwise specified, the base MLLM model used here is MiniGPT-4 (Zhu et al., 2023) excluding the ST-Adapter to reduce training overhead.
The Effect of Each Module in the Spatial Perception Stream Table 3 reports the effect of our proposed modules in the spatial perception stream (SPS). From the table, we have the following observations:
-
(i)
Compared with the lines a and f, we find that the improvements mainly come from our proposed SPS.
-
(ii)
Compared with the line (a)-(c), high-resolution spatial encoder (HRES) and query-aware detector (QAD) are essential for the detection performance, e.g., without HRES and QAD dropping \(16.8\%\) and \(13.0\%\) respectively in terms of the mIoU score.
-
(iii)
Compared with the line (a), (d), P-Adapter benefits both caption and detection performance. Specifically, the model without P-Adapter would degrade \(3.2\%\) and \(6.7\%\) in terms of B4 and mIoU, since the captured highlighted can not be integrated into the MLLM.
Performance Comparison of the Baseline and Ours Across Different Resolution Inputs An intuitive way to enhance the perception ability is to increase the resolution of inputs. To further study the effectiveness of our proposed spatial perception stream, we conduct experiments to compare the baseline and our method across different resolution inputs in Fig. 7. Specifically, inputs of varying resolutions are fed into the baseline and our method respectively. Note that the baseline model in this ablation is the model with ST-Adapter, hence can receive HR videos, while ours only feed one current HR frame. From Fig. 7, we can see that as the input video resolution increases, the FLOPs of the baseline model grow proportionally. For example, as the input resolution escalates from \(224^2\) to \(1000^2\), the FLOPs of the baseline increase from 156.9 to 2040.2, a 13 times growth. Differently, benefiting from spatial perception stream, our approach requires only a single high-resolution image, significantly reducing FLOPs, only increasing 1.05 times. Furthermore, the baseline model encounters OOM issues when the resolution exceeds \(400^2\). In contrast, our approach remains stable up to a resolution of \(1000^2\). In summary, our method can outperform the baseline model while using much less computation and memory cost.
The Effect of LLM To prove the effect of LLM for processing ROLISP, we compare two LLMs, i.e.OPT-1.3B (Zhang et al., 2022) and Vicuna-7B (Chiang et al., 2023) with two text decoders specific-deigned for autonomous driving, i.e., ADAPT (Jin et al., 2023) and DRAMA (Malla et al., 2023). Table 4 shows that LLM outperforms the general text decoder with a clear margin. For example, using Vicuna-7B (Chiang et al., 2023), the model achieves \(56.3\%\) in terms of B4, outperforming ADAPT (Jin et al., 2023) and DRAMA (Malla et al., 2023) over \(3.7\%\) and \(5.0\%\) respectively. We can also observe that more powerful LLM would benefit the ROLISP, i.e., Vicuna-7B (Chiang et al., 2023) outperforms OPT-1.3B (Zhang et al., 2022) over \(1.5\%\) and \(2.2\%\) in terms of B4 and mIoU.
The Versatility of Spatial Perception Stream Table 5 demonstrates that our proposed spatial perception stream can be applied to existing MLLMs to improve their performance on ROLISP. Specifically, with our spatial perception stream, BLIP-2 (Li et al., 2023) sees improvements of \(4.8\%\) in B4 and \(13.5\%\) in mIoU scores, while Video-LLaMA (Zhang et al., 2023) benefits with gains of \(2.0\%\) and \(16.5\%\) in captioning and detection results, respectively.
Effect of the Sampled Frame Table 6 analyzes the effect of frames selected at different positions in the spatial perception stream. ‘first’, ’middle’ and ‘last’ mean sampling the first, the middle and the last frames of video as the high-resolution frame fed into the spatial perception stream. We observe that sampling the last frame outperforms other sampling strategies clearly, e.g., \(4.4\%\) and \(2.0\%\) over the first and the middle frames respectively. The reason is that the assessment of risk objects largely depends on the position of objects in the last frame.
Furthermore, To address the concern of potentially missing dynamic objects due to sampling a single frame, we conduct additional experiments on our proposed dataset to explore this issue further. Specifically, we considered alternative frame sampling strategies to evaluate whether more frames could help capture sudden changes in object status. In these additional experiments, we compared four different frame sampling strategies:
-
One Frame (The last frame, as used in the original setting),
-
Two Frames (Uniformly sampling two frames from the last 1/3 of the video),
-
Three Frames (Uniformly sampling three frames from the last 1/3 of the video).
Formally, denote the sampled frames as \( \{ \textbf{F}_{t_i} \}_{i=1}^N \), where \( N \in \{2, 3\} \) is the number of frames sampled from the last 1/3 of the video. These frames are then processed by the high-resolution stream encoder (HRSE) to extract individual feature vectors \( \{ \textbf{V}_{t_i} \}_{i=1}^N \). The features are then combined using average pooling to produce a unified feature vector \( \overline{\textbf{V}} \), which has the same dimensionality as the feature vector from a single frame. This pooled feature vector is then used in subsequent stages, which are consistent with the settings used in the original experiments. The experimental setup and training/inference configurations were kept identical for all sampling strategies to ensure a fair comparison.
From the results in Table 7, we observe that the performance difference between using one or multiple frames is negligible. This indicates that, in most cases of ROLISP, the last frame already contains sufficient information for risk object detection and captioning tasks, as the temporal reasoning stream has already captured the full temporal semantics. Hence, additional frames do not provide significant additional information for these tasks.
Effect of Joint Training on Detection and Captioning. Table 8 demonstrates that training detection and captioning tasks independently leads to a clear decline in performance for both tasks. In contrast, joint training significantly enhances the results across both metrics, indicating a strong mutual benefit. One possible explanation is that improved object detection enhances risk identification, which in turn contributes to more accurate intent recognition and suggestion generation in the captioning task. Conversely, better captioning performance implies a more effective identification of risk-related objects, thereby boosting detection accuracy. The substantial improvements in both B4 and mIoU scores (from 54.3 to 56.3 and from 52.5 to 60.5, respectively) further validate the advantage of joint training over independent training.
4.4 Effect of Different Designs
Here, we conduct the experiment to analyze the effect of different designs, i.e., the high-resolution spatial encoder, the P-Adapter and the query-aware detector.
High-Resolution Spatial Encoder (HRES)
(1) Ablation of different backbones: In Table 9, we conduct experiments to analyze the effect of different backbones of HRES. To save resources, we use the resolution of \(400^2\) to train all models in the same setting. From the table, we can find that (1) the plain ViT without the vision-specific prior achieves limited performance on detection; (2) using the advanced backbones would benefit the detection performance while bringing more memory and computation costs; (3) pre-trained weight would improve the performance. Since the performance gain obtained by the advanced backbones is limited, we use ResNet50 as the backbone for efficiency.
(2) Effect of Multi-scale Features We conduct an ablation study in Table 10 to evaluate the impact of multi-scale features, i.e., \({ \textbf{I}_i }{i=1}^3\) in Sect. 3.2. The results demonstrate that incorporating multi-scale features enhances the detection performance of our model, particularly for medium and small objects. For example, utilizing all scales of features, the model achieves \(33.5\%\) in IoU\(_s\), surpassing the single-scale model by \(6.8\%\). Furthermore, the additional scale information enables the model to perceive objects of various sizes more effectively, leading to more accurate risk object identification.
P-Adapter.
(1) The Number of P-Adapter: In Table 11, we study the effect of numbers of P-Adapter. We uniformly incorporate P-Adapter into the different layers of the CLIP-ViT. We find that the model accuracy saturates when the numbers go larger. For example, as the number arises from 1 to 3, the mIoU achieves \(7.9\%\) improvement. We also find that adding more P-Adapter cannot monotonically promote performance when the number is larger than 3. Therefore, we empirically set the number to 3 by default.
(2) Effect of \(\alpha \): In Table 12, we conduct experiments to prove the effect of \(\alpha \) in Eq. 2. Specifically, using \(\alpha \) would achieve \(1.6\%\) and \(4.0\%\) improvement in terms of B4 and mIoU, compared with without \(\alpha \).
Types of the Query-Aware Detector (QAD) Table 13 shows the experiments to analyze the effect of the different types of QAD. Besides ours shown in Fig. 2b, we select two other modules to compare, i.e., directly using LLM to regress the bounding boxes using numerical values as Shikra (Chen et al., 2023) and the modified decoder in DETR (Carion et al., 2020), which is detailed in the following:
LLM Specifically, ‘LLM’ indicates that directly using LLM to regress the bounding boxes as Shikra (Chen et al., 2023). To this end, we integrate the positions of bounding boxes into the captions as this:
where [x1, y1, x2, y2] indicates the top-left and bottom-right of the bounding box. For example, given the full caption:
And the corresponding bounding box for the pedestrian is [1264, 756, 1324, 939]. We first normalize the coordinate by \(\left[ 1264/W, 756/H, 1324/W, 939/H \right] \), where H and W is the height and width of the image (in this case \(H=1520\), \(W=2704\)). Hence, the normalized bounding box is \(\left[ 0.465, 0.495, 0.490, 0.614 \right] \). Then, the final caption can be:
DETR The implementation of the QAD in the DETR (Carion et al., 2020) manner, i.e., using the learned query embedding, is shown in Fig. 8.
The results show that using LLM achieves limited performance in detection, which shows that our method is more effective in this small-scale detection dataset. Moreover, ours performs better than DETR, e.g., \(59.6\%\) vs \(56.3\%\) on the detection task due to prior knowledge.
Frozen vs. LoRA-Finetuning Table 14 reports comparing the effect of freezing and efficient fine-tuning of the LLM. We use LoRA (Hu et al., 2021) to fine-tune LLM efficiently. The results show freezing the LLM can achieve better performance with more efficiency.
Different Position Representations for LLM For detect objects in the autoregressive model, several methods (Wang et al., 2023) introduce extra vocabularies ( e.g., \(<bin0>\), \(\cdot \cdot \cdot \), \(<bin1000>\)) to represent coordinates for object detection in spatially discretized images. In contrast, Shikra (Chen et al., 2023) represents coordinates naturally and intuitively, using numbers directly. Here, we also conduct experiments to compare these two different position representations for LLM, when directly using LLM for producing bounding boxes. As shown in Table 15, using numerical directly achieves better performance compared with adding extra vocabularies. Also, using extra vocabularies would degrade the captioning performance. This is because using extra vocabularies requires fine-tuning the LLM, thus impairing the capabilities of the original LLM.
4.5 Open-Loop Evaluation on Trajectory Prediction
In this section, we also conduct the open-loop evaluation on trajectory prediction to better demonstrate the effectiveness of our approach. Since our proposed method, ROLISP, is primarily designed for language-based trajectory prediction rather than direct trajectory forecasting, we adapt the NuScenes dataset to incorporate the evaluation of these trajectory metrics. Specifically, we constructed a language-based trajectory prediction task from NuScenes [1] to better evaluate the performance of our method.
To clarify the task setup, the QA pairs used for trajectory prediction are as follows:
Here, \((x, y)\) represents the coordinates in bird’s-eye view (BEV) for future frames. We sampled 5 consecutive keyframes from NuScenes, where the first two frames are used as input to predict the motion trajectories for the subsequent three frames. We trained the model on a total of 1000 videos from the train split of NuScenes and evaluated it on 200 videos from the validation split. For each trajectory prediction, we calculated the ADE (Average Displacement Error) and FDE (Final Displacement Error) to assess the performance of our method. We followed the data processing and evaluation setup of the UniAD [2] for this experiment.
The results in Table 17 show that our method improves the baseline models (MiniGPT-4 and Video-LLaMA) with better performance in both ADE and FDE metrics. This improvement demonstrates the fine-grained perception provided by our model.
We have not included closed-loop evaluation metrics because our approach does not directly involve trajectory control or real-time feedback mechanisms. The focus of our work is on improving the fine-grained perception capabilities of MLLMs. We plan to explore such settings in future experiments to further enhance the robustness and adaptability of our model.
4.6 Apply to Generalized Domain
To explore the effect of our spaitial perception stream on the generalized domain, we also conduct experiments on Shikra-RD (Chen et al., 2023). Shikra-RD is a dataset for referential dialogue, which is constructed from Flickr30K Entities (Plummer et al., 2015) by GPT-4 (OpenAI, 2023). Specifically, besides responding to queries, the task involves providing bounding boxes for objects mentioned in the answers, which is very similar to our ROLISP. We select two MLLMs, i.e., MiniGPT-4 (Zhu et al., 2023) and InstructBLIP (Dai et al., 2023), as the baselines and augment them with our spatial perception stream. The input resolution for the spatial perception stream is \(448 \times 448\) here. The results of their performance on Shikra-RD are reported in Table 16. Results prove our spatial perception stream can benefit the baselines, especially in the detection performance, e.g., outperforming InstructBLIP over \(3.8\%\).
5 Conclusion and Limitations
In this paper, we focus on Risk Object Localization and Intention and Suggestion Prediction (ROLISP), which involves identifying the most important traffic objects and their bounding boxes, providing explanations, and predicting the next intention for the ego car. Considering the limited visual perception ability of pre-trained CLIP-ViT in existing multimodal large language models (MLLMs), we introduce HiLM-D, a two-stream framework designed to enhance visual information processing for ROLISP. Our framework captures comprehensive visual information, including temporal, multi-scale, and high-resolution details. The temporal reasoning stream uses a static visual encoder with trainable, lightweight ST-Adapters, enabling CLIP-ViT to effectively process dynamic video content. The spatial perception stream features a high-resolution spatial encoder and an P-Adapter, which captures multi-scale information and integrates it into the temporal stream, ensuring complete information for ROLISP. Notably, our spatial perception stream is lightweight, training-efficient, and easily integrates into existing MLLMs. Experiments on the ROLISP benchmark show HiLM-D’s significant advantages over leading MLLMs, with improvements of \(3.7\%\) in BLEU-4 for captioning and \(8.7\%\) in mIoU for detection.
Limitations In our dataset, each video contains only one risk object, which might not capture the complexity of real-world scenarios. The dataset also lacks extreme weather conditions like rain or fog and multi-view information, which are crucial for autonomous driving. Furthermore, data samples in our dataset do not contain too many dynamic scenes related to missing fast-moving or sudden objects, especially in the last few frames. We will curate a more diverse and challenging dataset to advance the field in the future.
Data Availability
Our dataset are constructed based on DRAMA: https://usa.honda-ri.com/drama.
References
Alletto, S., Palazzi, A., Solera, F., Calderara, S., & Cucchiara, R. (2016). Dr (eye) ve: A dataset for attention-based tasks with applications to autonomous and assisted driving. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 54–60).
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 6836–6846).
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., & Zhou, J. (2023). Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966
Bavishi, R., Elsen, E., Hawthorne, C., Nye, M., Odena, A., Somani, A., & Taşırlar, S. (2023). Introducing our multimodal models. https://www.adept.ai/blog/fuyu-8b
Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., & Beijbom, O. (2020). nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11621–11631).
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In European conference on computer vision (pp. 213–229). Springer.
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6299–6308).
Casas, S., Luo, W., & Urtasun, R. (2018). Intentnet: Learning to predict intention from raw sensor data. In Conference on robot learning (pp. 947–956). PMLR.
Chen, Z., Duan, Y., Wang, W., He, J., Lu, T., Dai, J., & Qiao, Y. (2022). Vision transformer adapter for dense predictions.
Chen, X., Ma, H., Wan, J., Li, B., & Xia, T. (2017). Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1907–1915).
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., & Zhao, R. (2023). Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195
Cheng, Z., Leng, S., Zhang, H., Xin, Y., Li, X., Chen, G., Zhu, Y., Zhang, W., Luo, Z., Zhao, D., & Bing, L. (2024). Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476
Chiang, W. -L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., et al. (2023). Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna.lmsys.org. Accessed 14 April 2023.
Dai, W., Li, J., Li, D., Huat, A., Zhao, J., Wang, W., Li, B., Fung, P., & Hoi, S. (2023). Instructblip: Towards general-purpose vision-language models with instruction tuning.
Deruyttere, T., Vandenhende, S., Grujicic, D., Van Gool, L., & Moens, M. F. (2019). Talk2car: Taking control of your self-driving car. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 2088–2098).
Dewangan, V., Choudhary, T., Chandhok, S., Priyadarshan, S., Jain, A., Singh, A.K., Srivastava, S., Jatavallabhula, K. M., & Krishna, K. M. (2023). Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving. arXiv preprint arXiv:2310.02251
Feichtenhofer, C. (2020). X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 203–213).
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6202–6211).
Gao, M., Tawari, A., & Martin, S. (2019). Goal-oriented object importance estimation in on-road driving videos. In 2019 international conference on robotics and automation (ICRA) (pp. 5509–5515). IEEE.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/cvpr.2016.90
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685
Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., & Wang, W. (2023). Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 17853–17862).
Jin, B., Liu, X., Zheng, Y., Li, P., Zhao, H., Zhang, T., Zheng, Y., Zhou, G., & Liu, J. (2023). Adapt: Action-aware driving caption transformer. arXiv preprint arXiv:2302.00673
Kim, J., & Canny, J. (2017). Interpretable learning for self-driving cars by visualizing causal attention. In Proceedings of the IEEE international conference on computer vision (pp. 2942–2950).
Kim, J., Misu, T., Chen, Y. -T., Tawari, A., & Canny, J. (2019). Grounding human-to-vehicle advice for self-driving vehicles. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/cvpr.2019.01084
Kim, J., Rohrbach, A., Darrell, T., Canny, J., & Akata, Z. (2018). Textual explanations for self-driving vehicles. In Proceedings of the European conference on computer vision (ECCV) (pp. 563–578).
Li, C., Chan, S. H., & Chen, Y. -T. (2020). Who make drivers stop? towards driver-centric risk assessment: Risk object identification via causal inference. In 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 10711–10718). IEEE.
Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., & Qiao, Y. (2023). Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355
Li, J., Li, D., Savarese, S., & Hoi, S. (2023). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597
Li, J., Pan, K., Ge, Z., Gao, M., Ji, W., Zhang, W., Chua, T. -S., Tang, S., Zhang, H., & Zhuang, Y. (2023). Fine-tuning multimodal llms to follow zero-shot demonstrative instructions. In The 12th international conference on learning representations.
Li, Z., Yang, B., Liu, Q., Ma, Z., Zhang, S., Yang, J., Sun, Y., Liu, Y., & Bai, X. (2023). Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607
Li, B., Zhang, P., Yang, J., Zhang, Y., Pu, F., & Liu, Z. (2023). Otterhd: A high-resolution multi-modality model. arXiv preprint arXiv:2311.04219
Liu, H., Li, C., Li, Y., & Lee, Y. J. (2023). Improved Baselines with Visual Instruction Tuning. arXiv:2310.03744
Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., & Lee, Y. J. (2024). LLaVA-NeXT: Improved reasoning, OCR, and world knowledge. https://llava-vl.github.io/blog/2024-01-30-llava-next/
Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning. arXiv preprint arXiv:2304.08485
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012–10022).
Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al. (2023). Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499
Loshchilov, I., & Hutter, F. (2016). Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983
Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101
Luo, W., Yang, B., & Urtasun, R. (2018). Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3569–3577).
Malla, S., Choi, C., Dwivedi, I., Choi, J. H., & Li, J. (2023). Drama: Joint risk localization and captioning in driving. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 1043–1052).
Malla, S., Dariush, B., & Choi, C. (2020). Titan: Future forecast using action priors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11186–11196).
Ngiam, J., Caine, B., Vasudevan, V., Zhang, Z., Chiang, H. -T. L., Ling, J., Roelofs, R., Bewley, A., Liu, C., Venugopal, A., et al. (2021). Scene transformer: A unified multi-task model for behavior prediction and planning (vol. 2, no. 7). arXiv preprint arXiv:2106.08417
OpenAI, O. (2023). Gpt-4 technical report
OpenAI: (2024). Hello gpt-4o. https://openai.com/index/hello-gpt-4o/
Oquab, M., Darcet, T., Moutakanni, T., Vo, H. V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Howes, R., Huang, P. -Y., Xu, H., Sharma, V., Li, S. -W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., & Bojanowski, P. (2023). DINOv2: Learning robust visual features without supervision.
Pan, J., Lin, Z., Zhu, X., Shao, J., ST-Adapter, H. L. (2022). Parameter-efficient image-to-video transfer learning for action recognition. Preprint at arxiv.org/abs/2206.13559
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems (vol. 32).
Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., & Wei, F. (2023). Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824
Petrovskaya, A., & Thrun, S. (2008). Model based vehicle tracking for autonomous driving in urban environments. In Proceedings of robotics: Science and systems IV, Zurich, Switzerland (vol. 34).
Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., & Lazebnik, S. (2015). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision (pp. 2641–2649).
Qian, T., Chen, J., Zhuo, L., Jiao, Y., & Jiang, Y. -G. (2023). Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. arXiv preprint arXiv:2305.14836
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763). PMLR.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1), 5485–5551.
Shukor, M., Dancette, C., & Cord, M. (2023). ep-alm: Efficient perceptual augmentation of language models. arXiv preprint arXiv:2303.11403
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (vol. 27).
Singh, S., & Saini, B. S. (2021). Autonomous cars: Recent developments, challenges, and possible solutions. IOP Conference Series: Materials Science and Engineering, 1022, Article 012028. https://doi.org/10.1088/1757-899x/1022/1/012028
Tawari, A., Mallela, P., & Martin, S. (2018). Learning to attend to salient targets in driving videos using fully convolutional RNN. In 2018 21st international conference on intelligent transportation systems (ITSC) (pp. 3225–3232). IEEE.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. -A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971
Wang, W., Chen, Z., Chen, X., Wu, J., Zhu, X., Zeng, G., Luo, P., Lu, T., Zhou, J., Qiao, Y., et al. (2023). Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175
Wang, D., Devin, C., Cai, Q. -Z., Yu, F., & Darrell, T. (2019). Deep object-centric policies for autonomous driving. In 2019 international conference on robotics and automation (ICRA) (pp. 8853–8859). IEEE.
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2022). Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3), 415–424.
Xu, R., Yao, Y., Guo, Z., Cui, J., Ni, Z., Ge, C., Chua, T. -S., Liu, Z., Sun, M., & Huang, G. (2024). Llava-uhd: An lmm perceiving any aspect ratio and high-resolution images. arXiv preprint arXiv:2403.11703
Xu, Z., Zhang, Y., Xie, E., Zhao, Z., Guo, Y., Wong, K.K., Li, Z., & Zhao, H. (2023). Drivegpt4: Interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412
Zang, Y., Li, W., Han, J., Zhou, K., Loy, C. C. (2023). Contextual object detection with multimodal large language models. arXiv preprint arXiv:2305.18279
Zeng, K. -H., Chou, S. -H., Chan, F. -H., & Carlos Niebles, J., Sun, M. (2017). Agent-centric risk assessment: Accident anticipation and risky region localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2222–2230).
Zhang, R., Han, J., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H., Gao, P., & Qiao, Y. (2023). Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199
Zhang, H., Li, X., Bing, L. (2023). Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., & Lin, X. V., et al. (2022). Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068
Zhu, D., Chen, J., Shen, X., Li, X., & Elhoseiny, M. (2023). Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592
Acknowledgements
This work was partially supported by a research grant from the National Natural Science Foundation of China (NSFC) and the Research Grants Council (RGC) of Hong Kong under the Joint Research Scheme (JRS), Project No. N_HKUST654/24.
Funding
Open access funding provided by Hong Kong University of Science and Technology
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Liwei Wang.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ding, X., Han, J., Xu, H. et al. HiLM-D: Enhancing MLLMs with Multi-scale High-Resolution Details for Autonomous Driving. Int J Comput Vis 133, 5379–5395 (2025). https://doi.org/10.1007/s11263-025-02433-3
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02433-3