Search | arXiv e-print repository

ViMoNet: A Multimodal Vision-Language Framework for Human Behavior Understanding from Motion and Video

Authors: Rajan Das Gupta, Md Yeasin Rahat, Nafiz Fahad, Abir Ahmed, Liew Tze Hui

Abstract: This study investigates how large language models (LLMs) can be used to understand human behavior using motion and video data. We think that mixing both types is essential to completely capture the nuanced movements and meanings of human actions, in contrast to recent models that simply concentrate on motion data or films. To address this, we provide ViMoNet, a straightforward yet effective framew… ▽ More This study investigates how large language models (LLMs) can be used to understand human behavior using motion and video data. We think that mixing both types is essential to completely capture the nuanced movements and meanings of human actions, in contrast to recent models that simply concentrate on motion data or films. To address this, we provide ViMoNet, a straightforward yet effective framework for comprehending, characterizing, and deducing human action. ViMoNet employs a joint training strategy that leverages the advantages of two data types: detailed motion-text data, which is more exact, and generic video-text data, which is more comprehensive but less detailed. This aids in the model's acquisition of rich data regarding time and space in human behavior. Additionally, we provide a brand new dataset named VIMOS that contains a variety of films, motion sequences, instructions, and subtitles. We developed ViMoNet-Bench, a standardized benchmark with carefully labeled samples, to evaluate how well models understand human behavior. Our tests show that ViMoNet outperforms existing methods in caption generation, motion understanding, and behavior interpretation. △ Less

Submitted 16 November, 2025; v1 submitted 13 August, 2025; originally announced August 2025.

Comments: This is the preprint version of the manuscript. It is currently being prepared for submission to an academic conference

arXiv:2506.23623 [pdf, ps, other]

Revisiting Audio-Visual Segmentation with Vision-Centric Transformer

Authors: Shaofei Huang, Rui Ling, Tianrui Hui, Hongyu Li, Xu Zhou, Shifeng Zhang, Si Liu, Richang Hong, Meng Wang

Abstract: Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived from audio features. However, audio-centric Transformers suffer from two limitations: perception ambiguity caused by the mixed nature of audio, and weakened de… ▽ More Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived from audio features. However, audio-centric Transformers suffer from two limitations: perception ambiguity caused by the mixed nature of audio, and weakened dense prediction ability due to visual detail loss. To address these limitations, we propose a new Vision-Centric Transformer (VCT) framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information, enabling queries to better distinguish between different sounding objects from mixed audio and accurately delineate their contours. Additionally, we also introduce a Prototype Prompted Query Generation (PPQG) module within our VCT framework to generate vision-derived queries that are both semantically aware and visually rich through audio prototype prompting and pixel context grouping, facilitating audio-visual information aggregation. Extensive experiments demonstrate that our VCT framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset. The code is available at https://github.com/spyflying/VCT_AVS. △ Less

Submitted 30 June, 2025; originally announced June 2025.

Comments: Accepted by CVPR 2025; Code: https://github.com/spyflying/VCT_AVS; Models: https://huggingface.co/nowherespyfly/VCT_AVS

arXiv:2505.13990 [pdf, ps, other]

DecIF: Improving Instruction-Following through Meta-Decomposition

Authors: Tingfeng Hui, Pengyu Zhu, Bowen Ping, Ling Tang, Guanting Dong, Yaqi Zhang, Sen Su

Abstract: Instruction-following has emerged as a crucial capability for large language models (LLMs). However, existing approaches often rely on pre-existing documents or external resources to synthesize instruction-following data, which limits their flexibility and generalizability. In this paper, we introduce DecIF, a fully autonomous, meta-decomposition guided framework that generates diverse and high-qu… ▽ More Instruction-following has emerged as a crucial capability for large language models (LLMs). However, existing approaches often rely on pre-existing documents or external resources to synthesize instruction-following data, which limits their flexibility and generalizability. In this paper, we introduce DecIF, a fully autonomous, meta-decomposition guided framework that generates diverse and high-quality instruction-following data using only LLMs. DecIF is grounded in the principle of decomposition. For instruction generation, we guide LLMs to iteratively produce various types of meta-information, which are then combined with response constraints to form well-structured and semantically rich instructions. We further utilize LLMs to detect and resolve potential inconsistencies within the generated instructions. Regarding response generation, we decompose each instruction into atomic-level evaluation criteria, enabling rigorous validation and the elimination of inaccurate instruction-response pairs. Extensive experiments across a wide range of scenarios and settings demonstrate DecIF's superior performance on instruction-following tasks. Further analysis highlights its strong flexibility, scalability, and generalizability in automatically synthesizing high-quality instruction data. △ Less

Submitted 10 June, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

Comments: We release the source code and SFT data in this version

arXiv:2501.08282 [pdf, ps, other]

LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding

Authors: Hongyu Li, Jinyu Chen, Ziyu Wei, Shaofei Huang, Tianrui Hui, Jialin Gao, Xiaoming Wei, Si Liu

Abstract: Recent advancements in multimodal large language models (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visua… ▽ More Recent advancements in multimodal large language models (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visual coordinate representations; second, encoding fine-grained temporal and spatial information during video feature compression is inherently difficult. To address these issues, we propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding. In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the textual coordinate special token into the visual space, simplifying the alignment of fine-grained spatial-temporal correspondences. Additionally, we design the Spatial-Temporal Packer, which decouples the feature compression of temporal and spatial resolutions into two distinct point-to-region attention processing streams. Furthermore, we propose ST-Align dataset with 4.3M training samples for fine-grained spatial-temporal multimodal understanding. With ST-align, we present a progressive training pipeline that aligns the visual and textual feature through sequential coarse-to-fine stages.Additionally, we introduce an ST-Align benchmark to evaluate spatial-temporal interleaved fine-grained understanding tasks, which include Spatial-Temporal Video Grounding (STVG) , Event Localization and Captioning (ELC) and Spatial Video Grounding (SVG). LLaVA-ST achieves outstanding performance on 11 benchmarks requiring fine-grained temporal, spatial, or spatial-temporal interleaving multimodal understanding. Our code, data and benchmark will be released at Our code, data and benchmark will be released at https://github.com/appletea233/LLaVA-ST . △ Less

Submitted 1 June, 2025; v1 submitted 14 January, 2025; originally announced January 2025.

Comments: Accepted by CVPR2025

arXiv:2412.11231 [pdf, other]

Smaller Language Models Are Better Instruction Evolvers

Authors: Tingfeng Hui, Lulu Zhao, Guanting Dong, Yaqi Zhang, Hua Zhou, Sen Su

Abstract: Instruction tuning has been widely used to unleash the complete potential of large language models. Notably, complex and diverse instructions are of significant importance as they can effectively align models with various downstream tasks. However, current approaches to constructing large-scale instructions predominantly favour powerful models such as GPT-4 or those with over 70 billion parameters… ▽ More Instruction tuning has been widely used to unleash the complete potential of large language models. Notably, complex and diverse instructions are of significant importance as they can effectively align models with various downstream tasks. However, current approaches to constructing large-scale instructions predominantly favour powerful models such as GPT-4 or those with over 70 billion parameters, under the empirical presumption that such larger language models (LLMs) inherently possess enhanced capabilities. In this study, we question this prevalent assumption and conduct an in-depth exploration into the potential of smaller language models (SLMs) in the context of instruction evolution. Extensive experiments across three scenarios of instruction evolution reveal that smaller language models (SLMs) can synthesize more effective instructions than LLMs. Further analysis demonstrates that SLMs possess a broader output space during instruction evolution, resulting in more complex and diverse variants. We also observe that the existing metrics fail to focus on the impact of the instructions. Thus, we propose Instruction Complex-Aware IFD (IC-IFD), which introduces instruction complexity in the original IFD score to evaluate the effectiveness of instruction data more accurately. Our source code is available at: \href{https://github.com/HypherX/Evolution-Analysis}{https://github.com/HypherX/Evolution-Analysis} △ Less

Submitted 15 December, 2024; originally announced December 2024.

Comments: Work in progress

arXiv:2410.22816 [pdf, other]

Advancing Manipulation Capabilities of a UAV Featuring Dynamic Center-of-Mass Displacement

Authors: Tong Hui, Matteo Fumagalli

Abstract: As aerial robots gain traction in industrial applications, there is growing interest in enhancing their physical interaction capabilities. Pushing tasks performed by aerial manipulators have been successfully demonstrated in contact-based inspections. However, more complex industrial applications require these systems to support higher-DoF (Degree of Freedom) manipulators and generate larger force… ▽ More As aerial robots gain traction in industrial applications, there is growing interest in enhancing their physical interaction capabilities. Pushing tasks performed by aerial manipulators have been successfully demonstrated in contact-based inspections. However, more complex industrial applications require these systems to support higher-DoF (Degree of Freedom) manipulators and generate larger forces while pushing (e.g., drilling, grinding). This paper builds on our previous work, where we introduced an aerial vehicle that can dynamically vary its CoM (Center of Mass) location to improve force exertion during interactions. We propose a novel approach to further enhance this system's force generation by optimizing its CoM location during interactions. Additionally, we study the case of this aerial vehicle equipped with a 2-DoF manipulation arm to extend the system's functionality in tool-based tasks. The effectiveness of the proposed methods is validated through simulations, demonstrating the potential of this system for advanced aerial manipulation in practical settings. △ Less

Submitted 11 April, 2025; v1 submitted 30 October, 2024; originally announced October 2024.

Comments: arXiv admin note: text overlap with arXiv:2404.01110, accepted to the 2025 International Conference on Unmanned Aircraft Systems (ICUAS)

arXiv:2410.01610 [pdf, other]

Upcycling Instruction Tuning from Dense to Mixture-of-Experts via Parameter Merging

Authors: Tingfeng Hui, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Hua Wu, Sen Su

Abstract: Mixture-of-Experts (MoE) shines brightly in large language models (LLMs) and demonstrates outstanding performance in plentiful natural language processing tasks. However, existing methods transforming LLMs from dense to MoE face significant data requirements and typically rely on large-scale post-training. In this paper, we propose Upcycling Instruction Tuning (UpIT), a data-efficient approach for… ▽ More Mixture-of-Experts (MoE) shines brightly in large language models (LLMs) and demonstrates outstanding performance in plentiful natural language processing tasks. However, existing methods transforming LLMs from dense to MoE face significant data requirements and typically rely on large-scale post-training. In this paper, we propose Upcycling Instruction Tuning (UpIT), a data-efficient approach for tuning a dense pre-trained model into a MoE instruction model. Specifically, we first point out that intermediate checkpoints during instruction tuning of the dense model are naturally suitable for specialized experts, and then propose an expert expansion stage to flexibly achieve models with flexible numbers of experts, where genetic algorithm and parameter merging are introduced to ensure sufficient diversity of new extended experts. To ensure that each specialized expert in the MoE model works as expected, we select a small amount of seed data that each expert excels to pre-optimize the router. Extensive experiments with various data scales and upcycling settings demonstrate the outstanding performance and data efficiency of UpIT, as well as stable improvement in expert or data scaling. Further analysis reveals the importance of ensuring expert diversity in upcycling. △ Less

Submitted 2 October, 2024; originally announced October 2024.

Comments: work in progress

arXiv:2409.08251 [pdf, other]

Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding

Authors: Hongyu Li, Tianrui Hui, Zihan Ding, Jing Zhang, Bin Ma, Xiaoming Wei, Jizhong Han, Si Liu

Abstract: Panoptic narrative grounding (PNG), whose core target is fine-grained image-text alignment, requires a panoptic segmentation of referred objects given a narrative caption. Previous discriminative methods achieve only weak or coarse-grained alignment by panoptic segmentation pretraining or CLIP model adaptation. Given the recent progress of text-to-image Diffusion models, several works have shown t… ▽ More Panoptic narrative grounding (PNG), whose core target is fine-grained image-text alignment, requires a panoptic segmentation of referred objects given a narrative caption. Previous discriminative methods achieve only weak or coarse-grained alignment by panoptic segmentation pretraining or CLIP model adaptation. Given the recent progress of text-to-image Diffusion models, several works have shown their capability to achieve fine-grained image-text alignment through cross-attention maps and improved general segmentation performance. However, the direct use of phrase features as static prompts to apply frozen Diffusion models to the PNG task still suffers from a large task gap and insufficient vision-language interaction, yielding inferior performance. Therefore, we propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features and inject the multimodal cues back, which leverages the fine-grained image-text alignment capability of Diffusion models more sufficiently. In addition, we also design a Multi-Level Mutual Aggregation (MLMA) module to reciprocally fuse multi-level image and phrase features for segmentation refinement. Extensive experiments on the PNG benchmark show that our method achieves new state-of-the-art performance. △ Less

Submitted 12 September, 2024; originally announced September 2024.

Comments: Accepted by ACM MM 2024

arXiv:2408.15876 [pdf, other]

Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation

Authors: Shaofei Huang, Rui Ling, Hongyu Li, Tianrui Hui, Zongheng Tang, Xiaoming Wei, Jizhong Han, Si Liu

Abstract: In this paper, we propose an Audio-Language-Referenced SAM 2 (AL-Ref-SAM 2) pipeline to explore the training-free paradigm for audio and language-referenced video object segmentation, namely AVS and RVOS tasks. The intuitive solution leverages GroundingDINO to identify the target object from a single frame and SAM 2 to segment the identified object throughout the video, which is less robust to spa… ▽ More In this paper, we propose an Audio-Language-Referenced SAM 2 (AL-Ref-SAM 2) pipeline to explore the training-free paradigm for audio and language-referenced video object segmentation, namely AVS and RVOS tasks. The intuitive solution leverages GroundingDINO to identify the target object from a single frame and SAM 2 to segment the identified object throughout the video, which is less robust to spatiotemporal variations due to a lack of video context exploration. Thus, in our AL-Ref-SAM 2 pipeline, we propose a novel GPT-assisted Pivot Selection (GPT-PS) module to instruct GPT-4 to perform two-step temporal-spatial reasoning for sequentially selecting pivot frames and pivot boxes, thereby providing SAM 2 with a high-quality initial object prompt. Within GPT-PS, two task-specific Chain-of-Thought prompts are designed to unleash GPT's temporal-spatial reasoning capacity by guiding GPT to make selections based on a comprehensive understanding of video and reference information. Furthermore, we propose a Language-Binded Reference Unification (LBRU) module to convert audio signals into language-formatted references, thereby unifying the formats of AVS and RVOS tasks in the same pipeline. Extensive experiments on both tasks show that our training-free AL-Ref-SAM 2 pipeline achieves performances comparable to or even better than fully-supervised fine-tuning methods. The code is available at: https://github.com/appletea233/AL-Ref-SAM2. △ Less

Submitted 23 December, 2024; v1 submitted 28 August, 2024; originally announced August 2024.

Comments: Accepted by AAAI 2025

arXiv:2408.15008 [pdf, other]

AEROBULL: A Center-of-Mass Displacing Aerial Vehicle Enabling Efficient High-Force Interaction

Authors: Tong Hui, Esteban Zamora, Simone D'Angelo, Stefan Rucareanu, Matteo Fumagalli

Abstract: In various industrial sectors, inspection and maintenance tasks using UAV (Unmanned Aerial Vehicle) require substantial force application to ensure effective adherence and stable contact, posing significant challenges to existing solutions. This paper addresses these industrial needs by introducing a novel lightweight aerial platform (3.12kg) designed to exert high pushing forces on non-horizontal… ▽ More In various industrial sectors, inspection and maintenance tasks using UAV (Unmanned Aerial Vehicle) require substantial force application to ensure effective adherence and stable contact, posing significant challenges to existing solutions. This paper addresses these industrial needs by introducing a novel lightweight aerial platform (3.12kg) designed to exert high pushing forces on non-horizontal surfaces. To increase maneuverability, the proposed platform incorporates tiltable rotors with 5-DoF (Degree of Freedom) actuation. Moreover, it has an innovative shifting-mass mechanism that dynamically adjusts the system's CoM (Center of Mass) during contact-based task execution. A compliant EE (End-Effector) is applied to ensure a smooth interaction with the work surface. We provide a detailed study of the UAV's overall system design, hardware integration of the developed physical prototype, and software architecture of the proposed control algorithm. Physical experiments were conducted to validate the control design and explore the force generation capability of the designed platform via a pushing task. With a total mass of 3.12kg, the UAV exerted a maximum pushing force of above 28N being almost equal to its gravity force. Furthermore, the experiments illustrated the benefits of having displaced CoM by benchmarking with a fixed CoM configuration. △ Less

Submitted 27 August, 2024; originally announced August 2024.

arXiv:2407.16129 [pdf, other]

FoRA: Low-Rank Adaptation Model beyond Multimodal Siamese Network

Authors: Weiying Xie, Yusi Zhang, Tianlin Hui, Jiaqing Zhang, Jie Lei, Yunsong Li

Abstract: Multimodal object detection offers a promising prospect to facilitate robust detection in various visual conditions. However, existing two-stream backbone networks are challenged by complex fusion and substantial parameter increments. This is primarily due to large data distribution biases of multimodal homogeneous information. In this paper, we propose a novel multimodal object detector, named Lo… ▽ More Multimodal object detection offers a promising prospect to facilitate robust detection in various visual conditions. However, existing two-stream backbone networks are challenged by complex fusion and substantial parameter increments. This is primarily due to large data distribution biases of multimodal homogeneous information. In this paper, we propose a novel multimodal object detector, named Low-rank Modal Adaptors (LMA) with a shared backbone. The shared parameters enhance the consistency of homogeneous information, while lightweight modal adaptors focus on modality unique features. Furthermore, we design an adaptive rank allocation strategy to adapt to the varying heterogeneity at different feature levels. When applied to two multimodal object detection datasets, experiments validate the effectiveness of our method. Notably, on DroneVehicle, LMA attains a 10.4% accuracy improvement over the state-of-the-art method with a 149M-parameters reduction. The code is available at https://github.com/zyszxhy/FoRA. Our work was submitted to ACM MM in April 2024, but was rejected. We will continue to refine our work and paper writing next, mainly including proof of theory and multi-task applications of FoRA. △ Less

Submitted 22 July, 2024; originally announced July 2024.

arXiv:2405.17844 [pdf, other]

Enhancing Sliding Performance with Aerial Robots: Analysis and Solutions for Non-Actuated Multi-Wheel Configurations

Authors: Tong Hui, Jefferson Ghielmini, Dimitrios Papageorgiou, Marco Tognon, Roland Siegwart, Matteo Fumagalli

Abstract: Sliding tasks performed by aerial robots are valuable for inspection and simple maintenance tasks at height, such as non-destructive testing and painting. Although various end-effector designs have been used for such tasks, non-actuated wheel configurations are more frequently applied thanks to their rolling capability for sliding motion, mechanical simplicity, and lightweight design. Moreover, a… ▽ More Sliding tasks performed by aerial robots are valuable for inspection and simple maintenance tasks at height, such as non-destructive testing and painting. Although various end-effector designs have been used for such tasks, non-actuated wheel configurations are more frequently applied thanks to their rolling capability for sliding motion, mechanical simplicity, and lightweight design. Moreover, a non-actuated multi-wheel (more than one wheel) configuration in the end-effector design allows the placement of additional equipment e.g., sensors and tools in the center of the end-effector tip for applications. However, there is still a lack of studies on crucial contact conditions during sliding using aerial robots with such an end-effector design. In this article, we investigate the key challenges associated with sliding operations using aerial robots equipped with multiple non-actuated wheels through in-depth analysis grounded in physical experiments. The experimental data is used to create a simulator that closely captures real-world conditions. We propose solutions from both mechanical design and control perspectives to improve the sliding performance of aerial robots. From a mechanical standpoint, design guidelines are derived from experimental data. From a control perspective, we introduce a novel pressure-sensing-based control framework that ensures reliable task execution, even during sliding maneuvers. The effectiveness and robustness of the proposed approaches are then validated and compared using the built simulator, particularly in high-risk scenarios. △ Less

Submitted 10 September, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

arXiv:2404.18466 [pdf, other]

HFT: Half Fine-Tuning for Large Language Models

Authors: Tingfeng Hui, Zhenyu Zhang, Shuohuan Wang, Weiran Xu, Yu Sun, Hua Wu

Abstract: Large language models (LLMs) with one or more fine-tuning phases have become a necessary step to unlock various capabilities, enabling LLMs to follow natural language instructions or align with human preferences. However, it carries the risk of catastrophic forgetting during sequential training, the parametric knowledge or the ability learned in previous stages may be overwhelmed by incoming train… ▽ More Large language models (LLMs) with one or more fine-tuning phases have become a necessary step to unlock various capabilities, enabling LLMs to follow natural language instructions or align with human preferences. However, it carries the risk of catastrophic forgetting during sequential training, the parametric knowledge or the ability learned in previous stages may be overwhelmed by incoming training data. In this paper, we find that by regularly resetting partial parameters, LLMs can restore some of the original knowledge. Inspired by this, we introduce Half Fine-Tuning (HFT) for LLMs, as a substitute for full fine-tuning (FFT), to mitigate the forgetting issues, where half of the parameters are selected to learn new tasks while the other half are frozen to remain previous knowledge. We provide a feasibility analysis from the perspective of optimization and interpret the parameter selection operation as a regularization term. Without changing the model architecture, HFT could be seamlessly integrated into existing fine-tuning frameworks. Extensive experiments and analysis on supervised fine-tuning, direct preference optimization, and continual learning consistently demonstrate the effectiveness, robustness, and efficiency of HFT. Compared with FFT, HFT not only significantly alleviates the forgetting problem, but also achieves the best performance in a series of downstream benchmarks, with an approximately 30% reduction in training time. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: Work in progress

arXiv:2404.01110 [pdf, other]

Dynamic Center-of-Mass Displacement in Aerial Manipulation: An Innovative Platform Design

Authors: Tong Hui, Stefan Rucareanu, Esteban Zamora, Simone D'Angelo, Haotian Liu, Matteo Fumagalli

Abstract: Aerial manipulators are increasingly used in contact-based industrial applications, where tasks like drilling and pushing require platforms to exert significant forces in multiple directions. To enhance force generation capabilities, various approaches, such as thrust vectoring and perching, have been explored. In this article, we introduce a novel approach by investigating the impact of varied Co… ▽ More Aerial manipulators are increasingly used in contact-based industrial applications, where tasks like drilling and pushing require platforms to exert significant forces in multiple directions. To enhance force generation capabilities, various approaches, such as thrust vectoring and perching, have been explored. In this article, we introduce a novel approach by investigating the impact of varied CoM (Center of Mass) locations on an aerial manipulation system's force exertion. Our proposed platform features a design with a dynamically displacing CoM, enabling a smooth transition between free flight and high-force interactions supported by tilting back rotors. We provide detailed modeling and control strategies for this design and validate its feasibility through a series of physical experiments. In a pushing task, the proposed system, weighing 3.12kg, was able to stably exert over 28N of force on a work surface-nearly equivalent to its gravitational force-achieved solely through the tilting of its back rotors. Additionally, we introduce a new factor to evaluate the force generation capabilities of aerial platforms, allowing for a quantitative comparison with state-of-the-art systems, which demonstrates the advantages of our proposed approach. △ Less

Submitted 13 September, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

arXiv:2402.17434 [pdf, other]

Passive Aligning Physical Interaction of Fully-Actuated Aerial Vehicles for Pushing Tasks

Authors: Tong Hui, Eugenio Cuniato, Michael Pantic, Marco Tognon, Matteo Fumagalli, Roland Siegwart

Abstract: Recently, the utilization of aerial manipulators for performing pushing tasks in non-destructive testing (NDT) applications has seen significant growth. Such operations entail physical interactions between the aerial robotic system and the environment. End-effectors with multiple contact points are often used for placing NDT sensors in contact with a surface to be inspected. Aligning the NDT senso… ▽ More Recently, the utilization of aerial manipulators for performing pushing tasks in non-destructive testing (NDT) applications has seen significant growth. Such operations entail physical interactions between the aerial robotic system and the environment. End-effectors with multiple contact points are often used for placing NDT sensors in contact with a surface to be inspected. Aligning the NDT sensor and the work surface while preserving contact, requires that all available contact points at the end-effector tip are in contact with the work surface. With a standard full-pose controller, attitude errors often occur due to perturbations caused by modeling uncertainties, sensor noise, and environmental uncertainties. Even small attitude errors can cause a loss of contact points between the end-effector tip and the work surface. To preserve full alignment amidst these uncertainties, we propose a control strategy which selectively deactivates angular motion control and enables direct force control in specific directions. In particular, we derive two essential conditions to be met, such that the robot can passively align with flat work surfaces achieving full alignment through the rotation along non-actively controlled axes. Additionally, these conditions serve as hardware design and control guidelines for effectively integrating the proposed control method for practical usage. Real world experiments are conducted to validate both the control design and the guidelines. △ Less

Submitted 27 February, 2024; originally announced February 2024.

Comments: Accepted to the 2024 IEEE International Conference on Robotics and Automation (ICRA2024)

arXiv:2402.15243 [pdf, other]

Safety-Conscious Pushing on Diverse Oriented Surfaces with Underactuated Aerial Vehicles

Authors: Tong Hui, Manuel J. Fernandez Gonzalez, Matteo Fumagalli

Abstract: Pushing tasks performed by aerial manipulators can be used for contact-based industrial inspections. Underactuated aerial vehicles are widely employed in aerial manipulation due to their widespread availability and relatively low cost. Industrial infrastructures often consist of diverse oriented work surfaces. When interacting with such surfaces, the coupled gravity compensation and interaction fo… ▽ More Pushing tasks performed by aerial manipulators can be used for contact-based industrial inspections. Underactuated aerial vehicles are widely employed in aerial manipulation due to their widespread availability and relatively low cost. Industrial infrastructures often consist of diverse oriented work surfaces. When interacting with such surfaces, the coupled gravity compensation and interaction force generation of underactuated aerial vehicles can present the potential challenge of near-saturation operations. The blind utilization of these platforms for such tasks can lead to instability and accidents, creating unsafe operating conditions and potentially damaging the platform. In order to ensure safe pushing on these surfaces while managing platform saturation, this work establishes a safety assessment process. This process involves the prediction of the saturation level of each actuator during pushing across variable surface orientations. Furthermore, the assessment results are used to plan and execute physical experiments, ensuring safe operations and preventing platform damage. △ Less

Submitted 23 February, 2024; originally announced February 2024.

Comments: Accepted to the 2024 IEEE International Conference on Robotics and Automation (ICRA2024)

arXiv:2402.14494 [pdf, other]

Noise-BERT: A Unified Perturbation-Robust Framework with Noise Alignment Pre-training for Noisy Slot Filling Task

Authors: Jinxu Zhao, Guanting Dong, Yueyan Qiu, Tingfeng Hui, Xiaoshuai Song, Daichi Guo, Weiran Xu

Abstract: In a realistic dialogue system, the input information from users is often subject to various types of input perturbations, which affects the slot-filling task. Although rule-based data augmentation methods have achieved satisfactory results, they fail to exhibit the desired generalization when faced with unknown noise disturbances. In this study, we address the challenges posed by input perturbati… ▽ More In a realistic dialogue system, the input information from users is often subject to various types of input perturbations, which affects the slot-filling task. Although rule-based data augmentation methods have achieved satisfactory results, they fail to exhibit the desired generalization when faced with unknown noise disturbances. In this study, we address the challenges posed by input perturbations in slot filling by proposing Noise-BERT, a unified Perturbation-Robust Framework with Noise Alignment Pre-training. Our framework incorporates two Noise Alignment Pre-training tasks: Slot Masked Prediction and Sentence Noisiness Discrimination, aiming to guide the pre-trained language model in capturing accurate slot information and noise distribution. During fine-tuning, we employ a contrastive learning loss to enhance the semantic representation of entities and labels. Additionally, we introduce an adversarial attack training strategy to improve the model's robustness. Experimental results demonstrate the superiority of our proposed approach over state-of-the-art models, and further analysis confirms its effectiveness and generalization ability. △ Less

Submitted 6 March, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

Comments: Accepted by ICASSP 2024

arXiv:2312.01663 [pdf, other]

Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training

Authors: Runze He, Shaofei Huang, Xuecheng Nie, Tianrui Hui, Luoqi Liu, Jiao Dai, Jizhong Han, Guanbin Li, Si Liu

Abstract: In this paper, we target the adaptive source driven 3D scene editing task by proposing a CustomNeRF model that unifies a text description or a reference image as the editing prompt. However, obtaining desired editing results conformed with the editing prompt is nontrivial since there exist two significant challenges, including accurate editing of only foreground regions and multi-view consistency… ▽ More In this paper, we target the adaptive source driven 3D scene editing task by proposing a CustomNeRF model that unifies a text description or a reference image as the editing prompt. However, obtaining desired editing results conformed with the editing prompt is nontrivial since there exist two significant challenges, including accurate editing of only foreground regions and multi-view consistency given a single-view reference image. To tackle the first challenge, we propose a Local-Global Iterative Editing (LGIE) training scheme that alternates between foreground region editing and full-image editing, aimed at foreground-only manipulation while preserving the background. For the second challenge, we also design a class-guided regularization that exploits class priors within the generation model to alleviate the inconsistency problem among different views in image-driven editing. Extensive experiments show that our CustomNeRF produces precise editing results under various real scenes for both text- and image-driven settings. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: 14 pages, 13 figures, project website: https://customnerf.github.io/

arXiv:2311.04662 [pdf, other]

doi 10.1007/978-3-031-63596-0_40

Versatile Airborne Ultrasonic NDT Technologies via Active Omni-Sliding with Over-Actuated Aerial Vehicles

Authors: Tong Hui, Florian Braun, Nicolas Scheidt, Marius Fehr, Matteo Fumagalli

Abstract: This paper presents the utilization of advanced methodologies in aerial manipulation to address meaningful industrial applications and develop versatile ultrasonic Non-Destructive Testing (NDT) technologies with aerial robots. The primary objectives of this work are to enable multi-point measurements through sliding without re-approaching the work surface, and facilitate the representation of mate… ▽ More This paper presents the utilization of advanced methodologies in aerial manipulation to address meaningful industrial applications and develop versatile ultrasonic Non-Destructive Testing (NDT) technologies with aerial robots. The primary objectives of this work are to enable multi-point measurements through sliding without re-approaching the work surface, and facilitate the representation of material thickness with B and C scans via dynamic scanning in arbitrary directions (i.e. omnidirections). To accomplish these objectives, a payload that can slide in omnidirections (here we call the omni-sliding payload) is designed for an over-actuated aerial vehicle, ensuring truly omnidirectional sliding mobility while exerting consistent forces in contact with a flat work surface. The omni-sliding payload is equipped with an omniwheel-based active end-effector and an Electro Magnetic Acoustic Transducer (EMAT). Furthermore, to ensure successful development of the designed payload and integration with the aerial vehicle, a comprehensive studying on contact conditions and system dynamics during active sliding is presented, and the derived system constraints are later used as guidelines for the hardware development and control setting. The proposed methods are validated through experiments, encompassing both the wall-sliding task and dynamic scanning for Ultrasonic Testing (UT), employing the aerial platform - Voliro T. △ Less

Submitted 8 November, 2023; originally announced November 2023.

Journal ref: Experimental Robotics. ISER 2023. Springer Proceedings in Advanced Robotics, vol 30

arXiv:2311.01091 [pdf, other]

Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic Narrative Grounding

Authors: Tianrui Hui, Zihan Ding, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Jiao Dai, Jizhong Han, Si Liu

Abstract: Panoptic narrative grounding (PNG) aims to segment things and stuff objects in an image described by noun phrases of a narrative caption. As a multimodal task, an essential aspect of PNG is the visual-linguistic interaction between image and caption. The previous two-stage method aggregates visual contexts from offline-generated mask proposals to phrase features, which tend to be noisy and fragmen… ▽ More Panoptic narrative grounding (PNG) aims to segment things and stuff objects in an image described by noun phrases of a narrative caption. As a multimodal task, an essential aspect of PNG is the visual-linguistic interaction between image and caption. The previous two-stage method aggregates visual contexts from offline-generated mask proposals to phrase features, which tend to be noisy and fragmentary. The recent one-stage method aggregates only pixel contexts from image features to phrase features, which may incur semantic misalignment due to lacking object priors. To realize more comprehensive visual-linguistic interaction, we propose to enrich phrases with coupled pixel and object contexts by designing a Phrase-Pixel-Object Transformer Decoder (PPO-TD), where both fine-grained part details and coarse-grained entity clues are aggregated to phrase features. In addition, we also propose a PhraseObject Contrastive Loss (POCL) to pull closer the matched phrase-object pairs and push away unmatched ones for aggregating more precise object contexts from more phrase-relevant object tokens. Extensive experiments on the PNG benchmark show our method achieves new state-of-the-art performance with large margins. △ Less

Submitted 10 March, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

Comments: Accepted by IJCAI 2023. Since the PNG benchmark adopts a different data partition manner from ours, we update the experimental results on the things/stuff/singulars/plurals subsets based on the PNG's code

arXiv:2310.10169 [pdf, other]

DemoNSF: A Multi-task Demonstration-based Generative Framework for Noisy Slot Filling Task

Authors: Guanting Dong, Tingfeng Hui, Zhuoma GongQue, Jinxu Zhao, Daichi Guo, Gang Zhao, Keqing He, Weiran Xu

Abstract: Recently, prompt-based generative frameworks have shown impressive capabilities in sequence labeling tasks. However, in practical dialogue scenarios, relying solely on simplistic templates and traditional corpora presents a challenge for these methods in generalizing to unknown input perturbations. To address this gap, we propose a multi-task demonstration based generative framework for noisy slot… ▽ More Recently, prompt-based generative frameworks have shown impressive capabilities in sequence labeling tasks. However, in practical dialogue scenarios, relying solely on simplistic templates and traditional corpora presents a challenge for these methods in generalizing to unknown input perturbations. To address this gap, we propose a multi-task demonstration based generative framework for noisy slot filling, named DemoNSF. Specifically, we introduce three noisy auxiliary tasks, namely noisy recovery (NR), random mask (RM), and hybrid discrimination (HD), to implicitly capture semantic structural information of input perturbations at different granularities. In the downstream main task, we design a noisy demonstration construction strategy for the generative framework, which explicitly incorporates task-specific information and perturbed distribution during training and inference. Experiments on two benchmarks demonstrate that DemoNSF outperforms all baseline methods and achieves strong generalization. Further analysis provides empirical guidance for the practical application of generative frameworks. Our code is released at https://github.com/dongguanting/Demo-NSF. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: Findings of EMNLP 2023 (Short Paper)

arXiv:2310.06504 [pdf, other]

Revisit Input Perturbation Problems for LLMs: A Unified Robustness Evaluation Framework for Noisy Slot Filling Task

Authors: Guanting Dong, Jinxu Zhao, Tingfeng Hui, Daichi Guo, Wenlong Wan, Boqi Feng, Yueyan Qiu, Zhuoma Gongque, Keqing He, Zechen Wang, Weiran Xu

Abstract: With the increasing capabilities of large language models (LLMs), these high-performance models have achieved state-of-the-art results on a wide range of natural language processing (NLP) tasks. However, the models' performance on commonly-used benchmark datasets often fails to accurately reflect their reliability and robustness when applied to real-world noisy data. To address these challenges, w… ▽ More With the increasing capabilities of large language models (LLMs), these high-performance models have achieved state-of-the-art results on a wide range of natural language processing (NLP) tasks. However, the models' performance on commonly-used benchmark datasets often fails to accurately reflect their reliability and robustness when applied to real-world noisy data. To address these challenges, we propose a unified robustness evaluation framework based on the slot-filling task to systematically evaluate the dialogue understanding capability of LLMs in diverse input perturbation scenarios. Specifically, we construct a input perturbation evaluation dataset, Noise-LLM, which contains five types of single perturbation and four types of mixed perturbation data. Furthermore, we utilize a multi-level data augmentation method (character, word, and sentence levels) to construct a candidate data pool, and carefully design two ways of automatic task demonstration construction strategies (instance-level and entity-level) with various prompt templates. Our aim is to assess how well various robustness methods of LLMs perform in real-world noisy scenarios. The experiments have demonstrated that the current open-source LLMs generally achieve limited perturbation robustness performance. Based on these experimental observations, we make some forward-looking suggestions to fuel the research in this direction. △ Less

Submitted 10 October, 2023; originally announced October 2023.

Comments: Accepted at NLPCC 2023 (Oral Presentation)

arXiv:2309.11022 [pdf, other]

doi 10.1145/3605764.3623905

Information Leakage from Data Updates in Machine Learning Models

Authors: Tian Hui, Farhad Farokhi, Olga Ohrimenko

Abstract: In this paper we consider the setting where machine learning models are retrained on updated datasets in order to incorporate the most up-to-date information or reflect distribution shifts. We investigate whether one can infer information about these updates in the training data (e.g., changes to attribute values of records). Here, the adversary has access to snapshots of the machine learning mode… ▽ More In this paper we consider the setting where machine learning models are retrained on updated datasets in order to incorporate the most up-to-date information or reflect distribution shifts. We investigate whether one can infer information about these updates in the training data (e.g., changes to attribute values of records). Here, the adversary has access to snapshots of the machine learning model before and after the change in the dataset occurs. Contrary to the existing literature, we assume that an attribute of a single or multiple training data points are changed rather than entire data records are removed or added. We propose attacks based on the difference in the prediction confidence of the original model and the updated model. We evaluate our attack methods on two public datasets along with multi-layer perceptron and logistic regression models. We validate that two snapshots of the model can result in higher information leakage in comparison to having access to only the updated model. Moreover, we observe that data records with rare values are more vulnerable to attacks, which points to the disparate vulnerability of privacy attacks in the update setting. When multiple records with the same original attribute value are updated to the same new value (i.e., repeated changes), the attacker is more likely to correctly guess the updated values since repeated changes leave a larger footprint on the trained model. These observations point to vulnerability of machine learning models to attribute inference attacks in the update setting. △ Less

Submitted 19 September, 2023; originally announced September 2023.

Journal ref: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec '23), November 30, 2023, Copenhagen, Denmark

arXiv:2308.14533 [pdf, other]

A Multi-Task Semantic Decomposition Framework with Task-specific Pre-training for Few-Shot NER

Authors: Guanting Dong, Zechen Wang, Jinxu Zhao, Gang Zhao, Daichi Guo, Dayuan Fu, Tingfeng Hui, Chen Zeng, Keqing He, Xuefeng Li, Liwen Wang, Xinyue Cui, Weiran Xu

Abstract: The objective of few-shot named entity recognition is to identify named entities with limited labeled instances. Previous works have primarily focused on optimizing the traditional token-wise classification framework, while neglecting the exploration of information based on NER data characteristics. To address this issue, we propose a Multi-Task Semantic Decomposition Framework via Joint Task-spec… ▽ More The objective of few-shot named entity recognition is to identify named entities with limited labeled instances. Previous works have primarily focused on optimizing the traditional token-wise classification framework, while neglecting the exploration of information based on NER data characteristics. To address this issue, we propose a Multi-Task Semantic Decomposition Framework via Joint Task-specific Pre-training (MSDP) for few-shot NER. Drawing inspiration from demonstration-based and contrastive learning, we introduce two novel pre-training tasks: Demonstration-based Masked Language Modeling (MLM) and Class Contrastive Discrimination. These tasks effectively incorporate entity boundary information and enhance entity representation in Pre-trained Language Models (PLMs). In the downstream main task, we introduce a multi-task joint optimization framework with the semantic decomposing method, which facilitates the model to integrate two different semantic information for entity classification. Experimental results of two few-shot NER benchmarks demonstrate that MSDP consistently outperforms strong baselines by a large margin. Extensive analyses validate the effectiveness and generalization of MSDP. △ Less

Submitted 28 August, 2023; originally announced August 2023.

Comments: Accepted by CIKM 2023 (Oral Presentation)

arXiv:2306.12167 [pdf, other]

doi 10.1109/AIM46323.2023.10196117

Static-Equilibrium Oriented Interaction Force Modeling and Control of Aerial Manipulation with Uni-Directional Thrust Multirotors

Authors: Tong Hui, Matteo Fumagalli

Abstract: This paper presents a static-equilibrium oriented interaction force modeling and control approach of aerial manipulation employing uni-directional thrust (UDT) multirotors interacting with variously defined environments. First, a simplified system model for a quadrotor-based aerial manipulator is introduced considering parameterized work surfaces under assumptions, and then a range of meaningful m… ▽ More This paper presents a static-equilibrium oriented interaction force modeling and control approach of aerial manipulation employing uni-directional thrust (UDT) multirotors interacting with variously defined environments. First, a simplified system model for a quadrotor-based aerial manipulator is introduced considering parameterized work surfaces under assumptions, and then a range of meaningful manipulation tasks are utilized to explore the system properties in a quasi-static equilibrium state. An explicit interaction force model in relation with the aerial manipulator pose configuration and the environment parameter is derived from the static equilibrium analysis, based on which singularity is pointed out. Then a hybrid attitude/force interaction control strategy is presented to verify the proposed interaction force model, which involves high gain attitude control and feedforward plus feedback force control. This paper represents preliminary results. We study the properties of UDT-based aerial manipulators via specific tasks, and propose a novel framework for interaction force modeling and control aiming at maximizing the commercial values of UDT platforms for aerial manipulation purpose. △ Less

Submitted 21 June, 2023; originally announced June 2023.

Journal ref: 2023 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM)

arXiv:2303.04456 [pdf, other]

RM-Depth: Unsupervised Learning of Recurrent Monocular Depth in Dynamic Scenes

Authors: Tak-Wai Hui

Abstract: Unsupervised methods have showed promising results on monocular depth estimation. However, the training data must be captured in scenes without moving objects. To push the envelope of accuracy, recent methods tend to increase their model parameters. In this paper, an unsupervised learning framework is proposed to jointly predict monocular depth and complete 3D motion including the motions of movin… ▽ More Unsupervised methods have showed promising results on monocular depth estimation. However, the training data must be captured in scenes without moving objects. To push the envelope of accuracy, recent methods tend to increase their model parameters. In this paper, an unsupervised learning framework is proposed to jointly predict monocular depth and complete 3D motion including the motions of moving objects and camera. (1) Recurrent modulation units are used to adaptively and iteratively fuse encoder and decoder features. This not only improves the single-image depth inference but also does not overspend model parameters. (2) Instead of using a single set of filters for upsampling, multiple sets of filters are devised for the residual upsampling. This facilitates the learning of edge-preserving filters and leads to the improved performance. (3) A warping-based network is used to estimate a motion field of moving objects without using semantic priors. This breaks down the requirement of scene rigidity and allows to use general videos for the unsupervised learning. The motion field is further regularized by an outlier-aware training loss. Despite the depth model just uses a single image in test time and 2.97M parameters, it achieves state-of-the-art results on the KITTI and Cityscapes benchmarks. △ Less

Submitted 8 March, 2023; originally announced March 2023.

Comments: Accepted to CVPR 2022 (paper is updated)

arXiv:2302.13610

A Prototypical Semantic Decoupling Method via Joint Contrastive Learning for Few-Shot Name Entity Recognition

Authors: Guanting Dong, Zechen Wang, Liwen Wang, Daichi Guo, Dayuan Fu, Yuxiang Wu, Chen Zeng, Xuefeng Li, Tingfeng Hui, Keqing He, Xinyue Cui, Qixiang Gao, Weiran Xu

Abstract: Few-shot named entity recognition (NER) aims at identifying named entities based on only few labeled instances. Most existing prototype-based sequence labeling models tend to memorize entity mentions which would be easily confused by close prototypes. In this paper, we proposed a Prototypical Semantic Decoupling method via joint Contrastive learning (PSDC) for few-shot NER. Specifically, we decoup… ▽ More Few-shot named entity recognition (NER) aims at identifying named entities based on only few labeled instances. Most existing prototype-based sequence labeling models tend to memorize entity mentions which would be easily confused by close prototypes. In this paper, we proposed a Prototypical Semantic Decoupling method via joint Contrastive learning (PSDC) for few-shot NER. Specifically, we decouple class-specific prototypes and contextual semantic prototypes by two masking strategies to lead the model to focus on two different semantic information for inference. Besides, we further introduce joint contrastive learning objectives to better integrate two kinds of decoupling information and prevent semantic collapse. Experimental results on two few-shot NER benchmarks demonstrate that PSDC consistently outperforms the previous SOTA methods in terms of overall performance. Extensive analysis further validates the effectiveness and generalization of PSDC. △ Less

Submitted 12 April, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

Comments: we want to revise our paper and upload this article in few days

arXiv:2302.13584 [pdf, other]

Revisit Out-Of-Vocabulary Problem for Slot Filling: A Unified Contrastive Frameword with Multi-level Data Augmentations

Authors: Daichi Guo, Guanting Dong, Dayuan Fu, Yuxiang Wu, Chen Zeng, Tingfeng Hui, Liwen Wang, Xuefeng Li, Zechen Wang, Keqing He, Xinyue Cui, Weiran Xu

Abstract: In real dialogue scenarios, the existing slot filling model, which tends to memorize entity patterns, has a significantly reduced generalization facing Out-of-Vocabulary (OOV) problems. To address this issue, we propose an OOV robust slot filling model based on multi-level data augmentations to solve the OOV problem from both word and slot perspectives. We present a unified contrastive learning fr… ▽ More In real dialogue scenarios, the existing slot filling model, which tends to memorize entity patterns, has a significantly reduced generalization facing Out-of-Vocabulary (OOV) problems. To address this issue, we propose an OOV robust slot filling model based on multi-level data augmentations to solve the OOV problem from both word and slot perspectives. We present a unified contrastive learning framework, which pull representations of the origin sample and augmentation samples together, to make the model resistant to OOV problems. We evaluate the performance of the model from some specific slots and carefully design test data with OOV word perturbation to further demonstrate the effectiveness of OOV words. Experiments on two datasets show that our approach outperforms the previous sota methods in terms of both OOV slots and words. △ Less

Submitted 27 February, 2023; originally announced February 2023.

Comments: 5 pages, 3 figures, published to ICASSP 2023

arXiv:2210.02991 [pdf, other]

Cross-Modality Domain Adaptation for Freespace Detection: A Simple yet Effective Baseline

Authors: Yuanbin Wang, Leyan Zhu, Shaofei Huang, Tianrui Hui, Xiaojie Li, Fei Wang, Si Liu

Abstract: As one of the fundamental functions of autonomous driving system, freespace detection aims at classifying each pixel of the image captured by the camera as drivable or non-drivable. Current works of freespace detection heavily rely on large amount of densely labeled training data for accuracy and robustness, which is time-consuming and laborious to collect and annotate. To the best of our knowledg… ▽ More As one of the fundamental functions of autonomous driving system, freespace detection aims at classifying each pixel of the image captured by the camera as drivable or non-drivable. Current works of freespace detection heavily rely on large amount of densely labeled training data for accuracy and robustness, which is time-consuming and laborious to collect and annotate. To the best of our knowledge, we are the first work to explore unsupervised domain adaptation for freespace detection to alleviate the data limitation problem with synthetic data. We develop a cross-modality domain adaptation framework which exploits both RGB images and surface normal maps generated from depth images. A Collaborative Cross Guidance (CCG) module is proposed to leverage the context information of one modality to guide the other modality in a cross manner, thus realizing inter-modality intra-domain complement. To better bridge the domain gap between source domain (synthetic data) and target domain (real-world data), we also propose a Selective Feature Alignment (SFA) module which only aligns the features of consistent foreground area between the two domains, thus realizing inter-domain intra-modality adaptation. Extensive experiments are conducted by adapting three different synthetic datasets to one real-world dataset for freespace detection respectively. Our method performs closely to fully supervised freespace detection methods (93.08 v.s. 97.50 F1 score) and outperforms other general unsupervised domain adaptation methods for semantic segmentation with large margins, which shows the promising potential of domain adaptation for freespace detection. △ Less

Submitted 6 October, 2022; originally announced October 2022.

Comments: ACM Multimedia 2022

arXiv:2208.05647 [pdf, other]

PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding

Authors: Zihan Ding, Zi-han Ding, Tianrui Hui, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Si Liu

Abstract: Panoptic Narrative Grounding (PNG) is an emerging task whose goal is to segment visual objects of things and stuff categories described by dense narrative captions of a still image. The previous two-stage approach first extracts segmentation region proposals by an off-the-shelf panoptic segmentation model, then conducts coarse region-phrase matching to ground the candidate regions for each noun ph… ▽ More Panoptic Narrative Grounding (PNG) is an emerging task whose goal is to segment visual objects of things and stuff categories described by dense narrative captions of a still image. The previous two-stage approach first extracts segmentation region proposals by an off-the-shelf panoptic segmentation model, then conducts coarse region-phrase matching to ground the candidate regions for each noun phrase. However, the two-stage pipeline usually suffers from the performance limitation of low-quality proposals in the first stage and the loss of spatial details caused by region feature pooling, as well as complicated strategies designed for things and stuff categories separately. To alleviate these drawbacks, we propose a one-stage end-to-end Pixel-Phrase Matching Network (PPMN), which directly matches each phrase to its corresponding pixels instead of region proposals and outputs panoptic segmentation by simple combination. Thus, our model can exploit sufficient and finer cross-modal semantic correspondence from the supervision of densely annotated pixel-phrase pairs rather than sparse region-phrase pairs. In addition, we also propose a Language-Compatible Pixel Aggregation (LCPA) module to further enhance the discriminative ability of phrase features through multi-round refinement, which selects the most compatible pixels for each phrase to adaptively aggregate the corresponding visual context. Extensive experiments show that our method achieves new state-of-the-art performance on the PNG benchmark with 4.0 absolute Average Recall gains. △ Less

Submitted 11 August, 2022; originally announced August 2022.

Comments: Accepted by ACM MM 2022

arXiv:2206.03789 [pdf, other]

Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation

Authors: Zihan Ding, Tianrui Hui, Junshi Huang, Xiaoming Wei, Jizhong Han, Si Liu

Abstract: Referring video object segmentation aims to predict foreground labels for objects referred by natural language expressions in videos. Previous methods either depend on 3D ConvNets or incorporate additional 2D ConvNets as encoders to extract mixed spatial-temporal features. However, these methods suffer from spatial misalignment or false distractors due to delayed and implicit spatial-temporal inte… ▽ More Referring video object segmentation aims to predict foreground labels for objects referred by natural language expressions in videos. Previous methods either depend on 3D ConvNets or incorporate additional 2D ConvNets as encoders to extract mixed spatial-temporal features. However, these methods suffer from spatial misalignment or false distractors due to delayed and implicit spatial-temporal interaction occurring in the decoding phase. To tackle these limitations, we propose a Language-Bridged Duplex Transfer (LBDT) module which utilizes language as an intermediary bridge to accomplish explicit and adaptive spatial-temporal interaction earlier in the encoding phase. Concretely, cross-modal attention is performed among the temporal encoder, referring words and the spatial encoder to aggregate and transfer language-relevant motion and appearance information. In addition, we also propose a Bilateral Channel Activation (BCA) module in the decoding phase for further denoising and highlighting the spatial-temporal consistent features via channel-wise activation. Extensive experiments show our method achieves new state-of-the-art performances on four popular benchmarks with 6.8% and 6.9% absolute AP gains on A2D Sentences and J-HMDB Sentences respectively, while consuming around 7x less computational overhead. △ Less

Submitted 8 June, 2022; originally announced June 2022.

Comments: Accepted by CVPR 2022

arXiv:2205.08301 [pdf, other]

Centroidal Aerodynamic Modeling and Control of Flying Multibody Robots

Authors: Tong Hui, Antonello Paolino, Gabriele Nava, Giuseppe L'Erario, Fabio Di Natale, Fabio Bergonti, Francesco Braghin, Daniele Pucci

Abstract: This paper presents a modeling and control framework for multibody flying robots subject to non-negligible aerodynamic forces acting on the centroidal dynamics. First, aerodynamic forces are calculated during robot flight in different operating conditions by means of Computational Fluid Dynamics (CFD) analysis. Then, analytical models of the aerodynamics coefficients are generated from the dataset… ▽ More This paper presents a modeling and control framework for multibody flying robots subject to non-negligible aerodynamic forces acting on the centroidal dynamics. First, aerodynamic forces are calculated during robot flight in different operating conditions by means of Computational Fluid Dynamics (CFD) analysis. Then, analytical models of the aerodynamics coefficients are generated from the dataset collected with CFD analysis. The obtained simplified aerodynamic model is also used to improve the flying robot control design. We present two control strategies: compensating for the aerodynamic effects via feedback linearization and enforcing the controller robustness with gain-scheduling. Simulation results on the jet-powered humanoid robot iRonCub validate the proposed approach. △ Less

Submitted 17 May, 2022; originally announced May 2022.

Comments: 7 pages, 6 figures, to be published in IEEE ICRA 2022. Presentation video: https://youtu.be/WDb-OVlh5XA

arXiv:2204.07335 [pdf, other]

A Keypoint-based Global Association Network for Lane Detection

Authors: Jinsheng Wang, Yinchao Ma, Shaofei Huang, Tianrui Hui, Fei Wang, Chen Qian, Tianzhu Zhang

Abstract: Lane detection is a challenging task that requires predicting complex topology shapes of lane lines and distinguishing different types of lanes simultaneously. Earlier works follow a top-down roadmap to regress predefined anchors into various shapes of lane lines, which lacks enough flexibility to fit complex shapes of lanes due to the fixed anchor shapes. Lately, some works propose to formulate l… ▽ More Lane detection is a challenging task that requires predicting complex topology shapes of lane lines and distinguishing different types of lanes simultaneously. Earlier works follow a top-down roadmap to regress predefined anchors into various shapes of lane lines, which lacks enough flexibility to fit complex shapes of lanes due to the fixed anchor shapes. Lately, some works propose to formulate lane detection as a keypoint estimation problem to describe the shapes of lane lines more flexibly and gradually group adjacent keypoints belonging to the same lane line in a point-by-point manner, which is inefficient and time-consuming during postprocessing. In this paper, we propose a Global Association Network (GANet) to formulate the lane detection problem from a new perspective, where each keypoint is directly regressed to the starting point of the lane line instead of point-by-point extension. Concretely, the association of keypoints to their belonged lane line is conducted by predicting their offsets to the corresponding starting points of lanes globally without dependence on each other, which could be done in parallel to greatly improve efficiency. In addition, we further propose a Lane-aware Feature Aggregator (LFA), which adaptively captures the local correlations between adjacent keypoints to supplement local information to the global association. Extensive experiments on two popular lane detection benchmarks show that our method outperforms previous methods with F1 score of 79.63% on CULane and 97.71% on Tusimple dataset with high FPS. The code will be released at https://github.com/Wolfwjs/GANet. △ Less

Submitted 15 April, 2022; originally announced April 2022.

Comments: Accepted by CVPR2022

arXiv:2108.02388 [pdf, other]

doi 10.1145/3474085.3475397

TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding

Authors: Dailan He, Yusheng Zhao, Junyu Luo, Tianrui Hui, Shaofei Huang, Aixi Zhang, Si Liu

Abstract: Recently proposed fine-grained 3D visual grounding is an essential and challenging task, whose goal is to identify the 3D object referred by a natural language sentence from other distractive objects of the same category. Existing works usually adopt dynamic graph networks to indirectly model the intra/inter-modal interactions, making the model difficult to distinguish the referred object from dis… ▽ More Recently proposed fine-grained 3D visual grounding is an essential and challenging task, whose goal is to identify the 3D object referred by a natural language sentence from other distractive objects of the same category. Existing works usually adopt dynamic graph networks to indirectly model the intra/inter-modal interactions, making the model difficult to distinguish the referred object from distractors due to the monolithic representations of visual and linguistic contents. In this work, we exploit Transformer for its natural suitability on permutation-invariant 3D point clouds data and propose a TransRefer3D network to extract entity-and-relation aware multimodal context among objects for more discriminative feature learning. Concretely, we devise an Entity-aware Attention (EA) module and a Relation-aware Attention (RA) module to conduct fine-grained cross-modal feature matching. Facilitated by co-attention operation, our EA module matches visual entity features with linguistic entity features while RA module matches pair-wise visual relation features with linguistic relation features, respectively. We further integrate EA and RA modules into an Entity-and-Relation aware Contextual Block (ERCB) and stack several ERCBs to form our TransRefer3D for hierarchical multimodal context modeling. Extensive experiments on both Nr3D and Sr3D datasets demonstrate that our proposed model significantly outperforms existing approaches by up to 10.6% and claims the new state-of-the-art. To the best of our knowledge, this is the first work investigating Transformer architecture for fine-grained 3D visual grounding task. △ Less

Submitted 11 August, 2021; v1 submitted 5 August, 2021; originally announced August 2021.

Comments: ACM MM2021

arXiv:2105.07175 [pdf, other]

Cross-Modal Progressive Comprehension for Referring Segmentation

Authors: Si Liu, Tianrui Hui, Shaofei Huang, Yunchao Wei, Bo Li, Guanbin Li

Abstract: Given a natural language expression and an image/video, the goal of referring segmentation is to produce the pixel-level masks of the entities described by the subject of the expression. Previous approaches tackle this problem by implicit feature interaction and fusion between visual and linguistic modalities in a one-stage manner. However, human tends to solve the referring problem in a progressi… ▽ More Given a natural language expression and an image/video, the goal of referring segmentation is to produce the pixel-level masks of the entities described by the subject of the expression. Previous approaches tackle this problem by implicit feature interaction and fusion between visual and linguistic modalities in a one-stage manner. However, human tends to solve the referring problem in a progressive manner based on informative words in the expression, i.e., first roughly locating candidate entities and then distinguishing the target one. In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) scheme to effectively mimic human behaviors and implement it as a CMPC-I (Image) module and a CMPC-V (Video) module to improve referring image and video segmentation models. For image data, our CMPC-I module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the target entity as well as suppress other irrelevant ones by spatial graph reasoning. For video data, our CMPC-V module further exploits action words based on CMPC-I to highlight the correct entity matched with the action cues by temporal graph reasoning. In addition to the CMPC, we also introduce a simple yet effective Text-Guided Feature Exchange (TGFE) module to integrate the reasoned multimodal features corresponding to different levels in the visual backbone under the guidance of textual information. In this way, multi-level features can communicate with each other and be mutually refined based on the textual context. Combining CMPC-I or CMPC-V with TGFE can form our image or video version referring segmentation frameworks and our frameworks achieve new state-of-the-art performances on four referring image segmentation benchmarks and three referring video segmentation benchmarks respectively. △ Less

Submitted 15 May, 2021; originally announced May 2021.

Comments: Accepted by TPAMI 2021

arXiv:2105.06818 [pdf, other]

Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation

Authors: Tianrui Hui, Shaofei Huang, Si Liu, Zihan Ding, Guanbin Li, Wenguan Wang, Jizhong Han, Fei Wang

Abstract: Language-queried video actor segmentation aims to predict the pixel-level mask of the actor which performs the actions described by a natural language query in the target frames. Existing methods adopt 3D CNNs over the video clip as a general encoder to extract a mixed spatio-temporal feature for the target frame. Though 3D convolutions are amenable to recognizing which actor is performing the que… ▽ More Language-queried video actor segmentation aims to predict the pixel-level mask of the actor which performs the actions described by a natural language query in the target frames. Existing methods adopt 3D CNNs over the video clip as a general encoder to extract a mixed spatio-temporal feature for the target frame. Though 3D convolutions are amenable to recognizing which actor is performing the queried actions, it also inevitably introduces misaligned spatial information from adjacent frames, which confuses features of the target frame and yields inaccurate segmentation. Therefore, we propose a collaborative spatial-temporal encoder-decoder framework which contains a 3D temporal encoder over the video clip to recognize the queried actions, and a 2D spatial encoder over the target frame to accurately segment the queried actors. In the decoder, a Language-Guided Feature Selection (LGFS) module is proposed to flexibly integrate spatial and temporal features from the two encoders. We also propose a Cross-Modal Adaptive Modulation (CMAM) module to dynamically recombine spatial- and temporal-relevant linguistic features for multimodal feature interaction in each stage of the two encoders. Our method achieves new state-of-the-art performance on two popular benchmarks with less computational overhead than previous approaches. △ Less

Submitted 14 May, 2021; originally announced May 2021.

Comments: Accepted by CVPR 2021

arXiv:2101.03929 [pdf, other]

doi 10.1109/TIP.2020.3013142

ORDNet: Capturing Omni-Range Dependencies for Scene Parsing

Authors: Shaofei Huang, Si Liu, Tianrui Hui, Jizhong Han, Bo Li, Jiashi Feng, Shuicheng Yan

Abstract: Learning to capture dependencies between spatial positions is essential to many visual tasks, especially the dense labeling problems like scene parsing. Existing methods can effectively capture long-range dependencies with self-attention mechanism while short ones by local convolution. However, there is still much gap between long-range and short-range dependencies, which largely reduces the model… ▽ More Learning to capture dependencies between spatial positions is essential to many visual tasks, especially the dense labeling problems like scene parsing. Existing methods can effectively capture long-range dependencies with self-attention mechanism while short ones by local convolution. However, there is still much gap between long-range and short-range dependencies, which largely reduces the models' flexibility in application to diverse spatial scales and relationships in complicated natural scene images. To fill such a gap, we develop a Middle-Range (MR) branch to capture middle-range dependencies by restricting self-attention into local patches. Also, we observe that the spatial regions which have large correlations with others can be emphasized to exploit long-range dependencies more accurately, and thus propose a Reweighed Long-Range (RLR) branch. Based on the proposed MR and RLR branches, we build an Omni-Range Dependencies Network (ORDNet) which can effectively capture short-, middle- and long-range dependencies. Our ORDNet is able to extract more comprehensive context information and well adapt to complex spatial variance in scene images. Extensive experiments show that our proposed ORDNet outperforms previous state-of-the-art methods on three scene parsing benchmarks including PASCAL Context, COCO Stuff and ADE20K, demonstrating the superiority of capturing omni-range dependencies in deep models for scene parsing task. △ Less

Submitted 11 January, 2021; originally announced January 2021.

Comments: Published at TIP

Journal ref: IEEE Transactions on Image Processing, 2020, 29: 8251-8263

arXiv:2010.00515 [pdf, other]

Linguistic Structure Guided Context Modeling for Referring Image Segmentation

Authors: Tianrui Hui, Si Liu, Shaofei Huang, Guanbin Li, Sansi Yu, Faxi Zhang, Jizhong Han

Abstract: Referring image segmentation aims to predict the foreground mask of the object referred by a natural language sentence. Multimodal context of the sentence is crucial to distinguish the referent from the background. Existing methods either insufficiently or redundantly model the multimodal context. To tackle this problem, we propose a "gather-propagate-distribute" scheme to model multimodal context… ▽ More Referring image segmentation aims to predict the foreground mask of the object referred by a natural language sentence. Multimodal context of the sentence is crucial to distinguish the referent from the background. Existing methods either insufficiently or redundantly model the multimodal context. To tackle this problem, we propose a "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction and implement this scheme as a novel Linguistic Structure guided Context Modeling (LSCM) module. Our LSCM module builds a Dependency Parsing Tree suppressed Word Graph (DPT-WG) which guides all the words to include valid multimodal context of the sentence while excluding disturbing ones through three steps over the multimodal feature, i.e., gathering, constrained propagation and distributing. Extensive experiments on four benchmarks demonstrate that our method outperforms all the previous state-of-the-arts. △ Less

Submitted 5 October, 2020; v1 submitted 1 October, 2020; originally announced October 2020.

Comments: Accepted by ECCV 2020. Code is available at https://github.com/spyflying/LSCM-Refseg

arXiv:2010.00514 [pdf, other]

Referring Image Segmentation via Cross-Modal Progressive Comprehension

Authors: Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao Wei, Jizhong Han, Luoqi Liu, Bo Li

Abstract: Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities, but usually fail to explore informative words of the expression to well align features from the two modalitie… ▽ More Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities, but usually fail to explore informative words of the expression to well align features from the two modalities for accurately identifying the referred entity. In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task. Concretely, the CMPC module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the correct entity as well as suppress other irrelevant ones by multimodal graph reasoning. In addition to the CMPC module, we further leverage a simple yet effective TGFE module to integrate the reasoned multimodal features from different levels with the guidance of textual information. In this way, features from multi-levels could communicate with each other and be refined based on the textual context. We conduct extensive experiments on four popular referring segmentation benchmarks and achieve new state-of-the-art performances. △ Less

Submitted 1 October, 2020; originally announced October 2020.

Comments: Accepted by CVPR 2020. Code is available at https://github.com/spyflying/CMPC-Refseg

arXiv:2007.09319 [pdf, other]

LiteFlowNet3: Resolving Correspondence Ambiguity for More Accurate Optical Flow Estimation

Authors: Tak-Wai Hui, Chen Change Loy

Abstract: Deep learning approaches have achieved great success in addressing the problem of optical flow estimation. The keys to success lie in the use of cost volume and coarse-to-fine flow inference. However, the matching problem becomes ill-posed when partially occluded or homogeneous regions exist in images. This causes a cost volume to contain outliers and affects the flow decoding from it. Besides, th… ▽ More Deep learning approaches have achieved great success in addressing the problem of optical flow estimation. The keys to success lie in the use of cost volume and coarse-to-fine flow inference. However, the matching problem becomes ill-posed when partially occluded or homogeneous regions exist in images. This causes a cost volume to contain outliers and affects the flow decoding from it. Besides, the coarse-to-fine flow inference demands an accurate flow initialization. Ambiguous correspondence yields erroneous flow fields and affects the flow inferences in subsequent levels. In this paper, we introduce LiteFlowNet3, a deep network consisting of two specialized modules, to address the above challenges. (1) We ameliorate the issue of outliers in the cost volume by amending each cost vector through an adaptive modulation prior to the flow decoding. (2) We further improve the flow accuracy by exploring local flow consistency. To this end, each inaccurate optical flow is replaced with an accurate one from a nearby position through a novel warping of the flow field. LiteFlowNet3 not only achieves promising results on public benchmarks but also has a small model size and a fast runtime. △ Less

Submitted 17 July, 2020; originally announced July 2020.

Comments: Accepted to ECCV 2020. Trained models and code package are available at https://github.com/twhui/LiteFlowNet3

arXiv:2004.05304 [pdf, other]

Inter-Region Affinity Distillation for Road Marking Segmentation

Authors: Yuenan Hou, Zheng Ma, Chunxiao Liu, Tak-Wai Hui, Chen Change Loy

Abstract: We study the problem of distilling knowledge from a large deep teacher network to a much smaller student network for the task of road marking segmentation. In this work, we explore a novel knowledge distillation (KD) approach that can transfer 'knowledge' on scene structure more effectively from a teacher to a student model. Our method is known as Inter-Region Affinity KD (IntRA-KD). It decomposes… ▽ More We study the problem of distilling knowledge from a large deep teacher network to a much smaller student network for the task of road marking segmentation. In this work, we explore a novel knowledge distillation (KD) approach that can transfer 'knowledge' on scene structure more effectively from a teacher to a student model. Our method is known as Inter-Region Affinity KD (IntRA-KD). It decomposes a given road scene image into different regions and represents each region as a node in a graph. An inter-region affinity graph is then formed by establishing pairwise relationships between nodes based on their similarity in feature distribution. To learn structural knowledge from the teacher network, the student is required to match the graph generated by the teacher. The proposed method shows promising results on three large-scale road marking segmentation benchmarks, i.e., ApolloScape, CULane and LLAMAS, by taking various lightweight models as students and ResNet-101 as the teacher. IntRA-KD consistently brings higher performance gains on all lightweight models, compared to previous distillation methods. Our code is available at https://github.com/cardwing/Codes-for-IntRA-KD. △ Less

Submitted 11 April, 2020; originally announced April 2020.

Comments: 10 pages, 10 figures; This paper is accepted by CVPR 2020; Our code is available at https://github.com/cardwing/Codes-for-IntRA-KD

arXiv:1911.07472 [pdf, other]

Learning to Synthesize Fashion Textures

Authors: Wu Shi, Tak-Wai Hui, Ziwei Liu, Dahua Lin, Chen Change Loy

Abstract: Existing unconditional generative models mainly focus on modeling general objects, such as faces and indoor scenes. Fashion textures, another important type of visual elements around us, have not been extensively studied. In this work, we propose an effective generative model for fashion textures and also comprehensively investigate the key components involved: internal representation, latent spac… ▽ More Existing unconditional generative models mainly focus on modeling general objects, such as faces and indoor scenes. Fashion textures, another important type of visual elements around us, have not been extensively studied. In this work, we propose an effective generative model for fashion textures and also comprehensively investigate the key components involved: internal representation, latent space sampling and the generator architecture. We use Gram matrix as a suitable internal representation for modeling realistic fashion textures, and further design two dedicated modules for modulating Gram matrix into a low-dimension vector. Since fashion textures are scale-dependent, we propose a recursive auto-encoder to capture the dependency between multiple granularity levels of texture feature. Another important observation is that fashion textures are multi-modal. We fit and sample from a Gaussian mixture model in the latent space to improve the diversity of the generated textures. Extensive experiments demonstrate that our approach is capable of synthesizing more realistic and diverse fashion textures over other state-of-the-art methods. △ Less

Submitted 18 November, 2019; originally announced November 2019.

arXiv:1903.07414 [pdf, other]

A Lightweight Optical Flow CNN - Revisiting Data Fidelity and Regularization

Authors: Tak-Wai Hui, Xiaoou Tang, Chen Change Loy

Abstract: Over four decades, the majority addresses the problem of optical flow estimation using variational methods. With the advance of machine learning, some recent works have attempted to address the problem using convolutional neural network (CNN) and have showed promising results. FlowNet2, the state-of-the-art CNN, requires over 160M parameters to achieve accurate flow estimation. Our LiteFlowNet2 ou… ▽ More Over four decades, the majority addresses the problem of optical flow estimation using variational methods. With the advance of machine learning, some recent works have attempted to address the problem using convolutional neural network (CNN) and have showed promising results. FlowNet2, the state-of-the-art CNN, requires over 160M parameters to achieve accurate flow estimation. Our LiteFlowNet2 outperforms FlowNet2 on Sintel and KITTI benchmarks, while being 25.3 times smaller in the model size and 3.1 times faster in the running speed. LiteFlowNet2 is built on the foundation laid by conventional methods and resembles the corresponding roles as data fidelity and regularization in variational methods. We compute optical flow in a spatial-pyramid formulation as SPyNet but through a novel lightweight cascaded flow inference. It provides high flow estimation accuracy through early correction with seamless incorporation of descriptor matching. Flow regularization is used to ameliorate the issue of outliers and vague flow boundaries through feature-driven local convolutions. Our network also owns an effective structure for pyramidal feature extraction and embraces feature warping rather than image warping as practiced in FlowNet2 and SPyNet. Comparing to LiteFlowNet, LiteFlowNet2 improves the optical flow accuracy on Sintel Clean by 23.3%, Sintel Final by 12.8%, KITTI 2012 by 19.6%, and KITTI 2015 by 18.8%, while being 2.2 times faster. Our network protocol and trained models are made publicly available on https://github.com/twhui/LiteFlowNet2. △ Less

Submitted 14 March, 2020; v1 submitted 15 March, 2019; originally announced March 2019.

Comments: Accepted to TPAMI 2020. arXiv admin note: substantial text overlap with arXiv:1805.07036

arXiv:1805.07036 [pdf, other]

LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation

Authors: Tak-Wai Hui, Xiaoou Tang, Chen Change Loy

Abstract: FlowNet2, the state-of-the-art convolutional neural network (CNN) for optical flow estimation, requires over 160M parameters to achieve accurate flow estimation. In this paper we present an alternative network that outperforms FlowNet2 on the challenging Sintel final pass and KITTI benchmarks, while being 30 times smaller in the model size and 1.36 times faster in the running speed. This is made p… ▽ More FlowNet2, the state-of-the-art convolutional neural network (CNN) for optical flow estimation, requires over 160M parameters to achieve accurate flow estimation. In this paper we present an alternative network that outperforms FlowNet2 on the challenging Sintel final pass and KITTI benchmarks, while being 30 times smaller in the model size and 1.36 times faster in the running speed. This is made possible by drilling down to architectural details that might have been missed in the current frameworks: (1) We present a more effective flow inference approach at each pyramid level through a lightweight cascaded network. It not only improves flow estimation accuracy through early correction, but also permits seamless incorporation of descriptor matching in our network. (2) We present a novel flow regularization layer to ameliorate the issue of outliers and vague flow boundaries by using a feature-driven local convolution. (3) Our network owns an effective structure for pyramidal feature extraction and embraces feature warping rather than image warping as practiced in FlowNet2. Our code and trained models are available at https://github.com/twhui/LiteFlowNet . △ Less

Submitted 17 May, 2018; originally announced May 2018.

Comments: Accepted to CVPR 2018 (spotlight). Project page: http://mmlab.ie.cuhk.edu.hk/projects/LiteFlowNet/

arXiv:1303.5947 [pdf, ps, other]

doi 10.1109/WCNC.2013.6555044

Effect of Receive Spatial Diversity on the Degrees of Freedom Region in Multi-Cell Random Beamforming

Authors: Hieu Duy Nguyen, Rui Zhang, Hon Tat Hui

Abstract: The random beamforming (RBF) scheme, jointly applied with multi-user diversity based scheduling, is able to achieve virtually interference-free downlink transmissions with only partial channel state information (CSI) available at the transmitter. However, the impact of receive spatial diversity on the rate performance of RBF is not fully characterized yet even in a single-cell setup. In this paper… ▽ More The random beamforming (RBF) scheme, jointly applied with multi-user diversity based scheduling, is able to achieve virtually interference-free downlink transmissions with only partial channel state information (CSI) available at the transmitter. However, the impact of receive spatial diversity on the rate performance of RBF is not fully characterized yet even in a single-cell setup. In this paper, we study a multi-cell multiple-input multiple-output (MIMO) broadcast system with RBF applied at each base station (BS) and either the minimum-mean-square-error (MMSE), matched filter (MF), or antenna selection (AS) based spatial receiver employed at each mobile terminal. We investigate the effect of different spatial diversity receivers on the achievable sum-rate of multi-cell RBF systems subject to both the intra- and inter-cell interferences. We first derive closed-form expressions for the distributions of the receiver signal-to-interference-plus-noise ratio (SINR) with different spatial diversity techniques, based on which we compare their rate performances at finite signal-to-noise ratios (SNRs). We then investigate the asymptotically high-SNR regime and for a tractable analysis assume that the number of users in each cell scales in a certain order with the per-cell SNR as SNR goes to infinity. Under this setup, we characterize the degrees of freedom (DoF) region for multi-cell RBF systems with different types of spatial receivers, which consists of all the achievable DoF tuples for the individual sum-rate of all the cells. The DoF region analysis provides a succinct characterization of the interplays among the receive spatial diversity, multiuser diversity, spatial multiplexing gain, inter-/intra-cell interferences, and BSs' collaborative transmission. △ Less

Submitted 19 June, 2013; v1 submitted 24 March, 2013; originally announced March 2013.

Comments: 33 pages, 7 figures, a longer version of the paper submitted to IEEE Transactions on Wireless Communcations. This work was presented in part at IEEE Wireless Communications and Networking Conference (WCNC), Shanghai, China, April 07-10, 2013. The authors are with the Department of Electrical and Computer Engineering, National University of Singapore (emails: {hieudn, elezhang, elehht}@nus.edu.sg)

arXiv:1205.5849 [pdf, ps, other]

doi 10.1109/TSP.2013.2261990

Multi-Cell Random Beamforming: Achievable Rate and Degrees of Freedom Region

Authors: Hieu Duy Nguyen, Rui Zhang, Hon Tat Hui

Abstract: Random beamforming (RBF) is a practically favourable transmission scheme for multiuser multi-antenna downlink systems since it requires only partial channel state information (CSI) at the transmitter. Under the conventional single-cell setup, RBF is known to achieve the optimal sum-capacity scaling law as the number of users goes to infinity, thanks to the multiuser diversity enabled transmission… ▽ More Random beamforming (RBF) is a practically favourable transmission scheme for multiuser multi-antenna downlink systems since it requires only partial channel state information (CSI) at the transmitter. Under the conventional single-cell setup, RBF is known to achieve the optimal sum-capacity scaling law as the number of users goes to infinity, thanks to the multiuser diversity enabled transmission scheduling that virtually eliminates the intra-cell interference. In this paper, we extend the study of RBF to a more practical multi-cell downlink system with single-antenna receivers subject to the additional inter-cell interference (ICI). First, we consider the case of finite signal-to-noise ratio (SNR) at each receiver. We derive a closed-form expression of the achievable sum-rate with the multi-cell RBF, based upon which we show an asymptotic sum-rate scaling law as the number of users goes to infinity. Next, we consider the high-SNR regime and for tractable analysis assume that the number of users in each cell scales in a certain order with the per-cell SNR. Under this setup, we characterize the achievable degrees of freedom (DoF) for the single-cell case with RBF. Then we extend the analysis to the multi-cell RBF case by characterizing the DoF region. It is shown that the DoF region characterization provides useful guideline on how to design a cooperative multi-cell RBF system to achieve optimal throughput tradeoffs among different cells. Furthermore, our results reveal that the multi-cell RBF scheme achieves the "interference-free DoF" region upper bound for the multi-cell system, provided that the per-cell number of users has a sufficiently large scaling order with the SNR. Our result thus confirms the optimality of multi-cell RBF in this regime even without the complete CSI at the transmitter, as compared to other full-CSI requiring transmission schemes such as interference alignment. △ Less

Submitted 8 May, 2013; v1 submitted 25 May, 2012; originally announced May 2012.

Comments: 28 pages, 6 figures, to appear in IEEE Transactions of Signal Processing. This work was presented in part at IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Kyoto, Japan, March 25-30, 2012. The authors are with the Department of Electrical and Computer Engineering, National University of Singapore (emails: {hieudn, elezhang, elehht}@nus.edu.sg)

Showing 1–46 of 46 results for author: Hui, T