HiLM-D: Enhancing MLLMs with Multi-scale High-Resolution Details for Autonomous Driving

Ding, Xinpeng; Han, Jianhua; Xu, Hang; Zhang, Wei; Li, Xiaomeng

doi:10.1007/s11263-025-02433-3

HiLM-D: Enhancing MLLMs with Multi-scale High-Resolution Details for Autonomous Driving

Open access
Published: 07 May 2025

Volume 133, pages 5379–5395, (2025)
Cite this article

You have full access to this open access article

Download PDF

International Journal of Computer Vision Aims and scope Submit manuscript

HiLM-D: Enhancing MLLMs with Multi-scale High-Resolution Details for Autonomous Driving

Download PDF

Xinpeng Ding¹,
Jianhua Han²,
Hang Xu²,
Wei Zhang² &
…
Xiaomeng Li ORCID: orcid.org/0000-0003-1105-8083^1,3

1420 Accesses
1 Citation
Explore all metrics

Abstract

Recent efforts to use natural language for interpretable driving focus mainly on planning, neglecting perception tasks. In this paper, we address this gap by introducing ROLISP (Risk Object Localization and Intention and Suggestion Prediction), which towards interpretable risk object detection and suggestion for ego car motions. Accurate ROLISP implementation requires extensive reasoning to identify critical traffic objects and infer their intentions, prompting us to explore the capabilities of multimodal large language models (MLLMs). However, the limited perception performance of CLIP-ViT vision encoders in existing MLLMs struggles with capturing essential visual perception information, e.g., high-resolution, multi-scale and visual-related inductive biases, which are important for autonomous driving. Addressing these challenges, we introduce HiLM-D, a resource-efficient framework that enhances visual information processing in MLLMs for ROLISP. Our method is motivated by the fact that the primary variations in autonomous driving scenarios are the motion trajectories rather than the semantic or appearance information (e.g., the shapes and colors) of objects. Hence, the visual process of HiLM-D is a two-stream framework: (i) a temporal reasoning stream, receiving low-resolution dynamic video content, to capture temporal semantics, and (ii) a spatial perception stream, receiving a single high-resolution frame, to capture holistic visual perception-related information. The spatial perception stream can be made very lightweight by a well-designed P-Adapter, which is lightweight, training-efficient, and easily integrated into existing MLLMs. Experiments on the DRAMA-ROLISP dataset show HiLM-D’s significant improvements over current MLLMs, with a $3.7\%$ in BLEU-4 for captioning and $8.7\%$ in mIoU for detection. Further tests on the Shikra-RD dataset confirm our method’s generalization capabilities. The DRAMA-ROLISP is available at https://github.com/xmed-lab/HiLM-D.

Investigating the Mitigation of Stress in Autonomous and Non-autonomous Vehicles Using LLM Feedback

Driving Scene Context-Augmented Trajectory Prediction with Risk-Aware Decision Reasoning Using Multimodal LLM

Enhancing human-centered dynamic scene understanding via multiple LLMs collaborated reasoning

Article Open access 17 March 2025

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Over the past decade, there has been remarkable growth in the field of autonomous driving, encompassing both academia and industry (Singh & Saini, 2021) Generally, the modern autonomous driving system integrates a range of tasks including perception, prediction and planning (Caesar et al., 2020). With the success of deep learning (He et al., 2016), data-driven learning-based methods have become a widespread component of modern autonomous driving systems (Ngiam et al., 2021; Hu et al., 2023). By using sensor data as input, e.g., RGB or Lidar data, the end-to-end autonomous driving models can directly predict planning or controls for vehicles. However, as a black box, these methods generally predict a single score, lacking interpretability and being hard to interact with humans (Deruyttere et al., 2019).

Recently, some researchers have explored to use a unified natural language for interpretable end-to-end autonomous driving (Deruyttere et al., 2019; Xu et al., 2023; Dewangan et al., 2023; Jin et al., 2023; Kim et al., 2019). However, these approaches mainly focus on planning tasks, e.g., giving the predictions and reasons for vehicle actions. To explore the perception tasks, DRAMA (Malla et al., 2023) is proposed to conduct joint risk localization and description, i.e., discriminates the most important traffic objects (e.g., cars, traffic cones and so on), explains why the found object is the risk as well as give the bounding box for the found object. In this paper, we go a further step and extend DRAMA to DRAMA-ROLISP (Risk Object Localization and Intention and Suggestion Prediction), including the additional planning description. As shown in Table 1, our task performs more comprehensive tasks compared with current interpretable autonomous driving tasks.

Table 1 Comparison of our proposed DRAMA-ROLISP with existing language-based driving datasets

HiLM-D: Enhancing MLLMs with Multi-scale High-Resolution Details for Autonomous Driving

Abstract

Similar content being viewed by others

Investigating the Mitigation of Stress in Autonomous and Non-autonomous Vehicles Using LLM Feedback

Driving Scene Context-Augmented Trajectory Prediction with Risk-Aware Decision Reasoning Using Multimodal LLM

Enhancing human-centered dynamic scene understanding via multiple LLMs collaborated reasoning

Explore related subjects

1 Introduction

2 Related Work

2.1 Multimodal Large Language Models (MLLMs)

2.2 Risk Object Identification

2.3 Multi-tasks in Autonomous Driving

3 Method

3.1 Temporal Reasoning Stream

3.2 Spatial Perception Stream

3.3 Large Language Model

3.4 Query-Aware Detector

4 Experiments

4.1 Experimental Settings

4.2 Comparison with the State-of-the-Art Methods

4.3 Ablation Study

4.4 Effect of Different Designs

4.5 Open-Loop Evaluation on Trajectory Prediction

4.6 Apply to Generalized Domain

5 Conclusion and Limitations

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords