这是indexloc提供的服务,不要输入任何密码
Skip to main content

Showing 1–50 of 66 results for author: Chai, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.02591  [pdf, ps, other

    cs.CV

    AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding

    Authors: Weili Xu, Enxin Song, Wenhao Chai, Xuexiang Wen, Tian Ye, Gaoang Wang

    Abstract: The challenge of long video understanding lies in its high computational complexity and prohibitive memory cost, since the memory and computation required by transformer-based LLMs scale quadratically with input sequence length. We propose AuroraLong to address this challenge by replacing the LLM component in MLLMs with a linear RNN language model that handles input sequence of arbitrary length wi… ▽ More

    Submitted 23 July, 2025; v1 submitted 3 July, 2025; originally announced July 2025.

    Comments: ICCV 2025 Camera Ready

  2. arXiv:2506.20066  [pdf, ps, other

    cs.CV

    ToSA: Token Merging with Spatial Awareness

    Authors: Hsiang-Wei Huang, Wenhao Chai, Kuang-Ming Chen, Cheng-Yen Yang, Jenq-Neng Hwang

    Abstract: Token merging has emerged as an effective strategy to accelerate Vision Transformers (ViT) by reducing computational costs. However, existing methods primarily rely on the visual token's feature similarity for token merging, overlooking the potential of integrating spatial information, which can serve as a reliable criterion for token merging in the early layers of ViT, where the visual tokens onl… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: Accepted by IROS 2025

  3. arXiv:2506.14168  [pdf, ps, other

    cs.CV cs.AI

    VideoMAR: Autoregressive Video Generatio with Continuous Tokens

    Authors: Hu Yu, Biao Gong, Hangjie Yuan, DanDan Zheng, Weilong Chai, Jingdong Chen, Kecheng Zheng, Feng Zhao

    Abstract: Masked-based autoregressive models have demonstrated promising image generation capability in continuous space. However, their potential for video generation remains under-explored. In this paper, we propose \textbf{VideoMAR}, a concise and efficient decoder-only autoregressive image-to-video model with continuous tokens, composing temporal frame-by-frame and spatial masked generation. We first id… ▽ More

    Submitted 18 June, 2025; v1 submitted 17 June, 2025; originally announced June 2025.

  4. arXiv:2506.11928  [pdf, ps, other

    cs.SE cs.AI cs.CL cs.LG

    LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

    Authors: Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu, Hansen He, Dongruixuan Li, Stanley Wei, Hangyi Hao, Jianzhu Yao, Peiyao Sheng, Zixuan Wang, Wenhao Chai, Aleksandra Korolova, Peter Henderson, Sanjeev Arora, Pramod Viswanath, Jingbo Shang, Saining Xie

    Abstract: Recent reports claim that large language models (LLMs) now outperform elite humans in competitive programming. Drawing on knowledge from a group of medalists in international algorithmic contests, we revisit this claim, examining how LLMs differ from human experts and where limitations still remain. We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI tha… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: Project Page at https://livecodebenchpro.com/

  5. arXiv:2506.09344  [pdf, ps, other

    cs.AI cs.CL cs.CV cs.LG cs.SD eess.AS

    Ming-Omni: A Unified Multimodal Model for Perception and Generation

    Authors: Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren, Libin Wang, Lixiang Ru, Lele Xie, Longhua Tan , et al. (33 additional authors not shown)

    Abstract: We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: 18 pages,8 figures

  6. arXiv:2505.23606  [pdf, ps, other

    cs.LG cs.CV

    Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

    Authors: Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan

    Abstract: Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We in… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: The code and model are available at https://github.com/M-E-AGI-Lab/Muddit

  7. arXiv:2505.23399  [pdf

    cs.AI

    GAM-Agent: Game-Theoretic and Uncertainty-Aware Collaboration for Complex Visual Reasoning

    Authors: Jusheng Zhang, Yijia Fan, Wenjun Lin, Ruiqi Chen, Haoyi Jiang, Wenhao Chai, Jian Wang, Keze Wang

    Abstract: We propose GAM-Agent, a game-theoretic multi-agent framework for enhancing vision-language reasoning. Unlike prior single-agent or monolithic models, GAM-Agent formulates the reasoning process as a non-zero-sum game between base agents--each specializing in visual perception subtasks--and a critical agent that verifies logic consistency and factual correctness. Agents communicate via structured cl… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  8. arXiv:2505.02471  [pdf, ps, other

    cs.CV

    Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction

    Authors: Inclusion AI, Biao Gong, Cheng Zou, Dandan Zheng, Hu Yu, Jingdong Chen, Jianxin Sun, Junbo Zhao, Jun Zhou, Kaixiang Ji, Lixiang Ru, Libin Wang, Qingpei Guo, Rui Liu, Weilong Chai, Xinyu Xiao, Ziyuan Huang

    Abstract: We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel multi-scale learnable tokens and multi-scale repr… ▽ More

    Submitted 12 June, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

    Comments: https://github.com/inclusionAI/Ming/tree/Ming-Lite-Omni-Preview/Ming-unify

  9. arXiv:2505.01583  [pdf, ps, other

    cs.CV cs.AI

    TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action

    Authors: Jen-Hao Cheng, Vivian Wang, Huayu Wang, Huapeng Zhou, Yi-Hao Peng, Hou-I Liu, Hsiang-Wei Huang, Kuang-Ming Chen, Cheng-Yen Yang, Wenhao Chai, Yi-Ling Chen, Vibhav Vineet, Qin Cai, Jenq-Neng Hwang

    Abstract: Understanding causal event relationships and achieving fine-grained temporal grounding in videos remain challenging for vision-language models. Existing methods either compress video tokens to reduce temporal resolution, or treat videos as unsegmented streams, which obscures fine-grained event boundaries and limits the modeling of causal dependencies. We propose TEMPURA (Temporal Event Masked Pred… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  10. arXiv:2504.14693  [pdf, other

    cs.CV cs.AI

    Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark

    Authors: Enxin Song, Wenhao Chai, Weili Xu, Jianwen Xie, Yuxuan Liu, Gaoang Wang

    Abstract: Recent advancements in language multimodal models (LMMs) for video have demonstrated their potential for understanding video content, yet the task of comprehending multi-discipline lectures remains largely unexplored. We introduce Video-MMLU, a massive benchmark designed to evaluate the capabilities of LMMs in understanding Multi-Discipline Lectures. We evaluate over 90 open-source and proprietary… ▽ More

    Submitted 2 May, 2025; v1 submitted 20 April, 2025; originally announced April 2025.

    Comments: Code, docs, and benchmark are all avaliable at https://enxinsong.com/Video-MMLU-web/

  11. arXiv:2504.13129  [pdf, other

    cs.CV cs.AI cs.LG

    Science-T2I: Addressing Scientific Illusions in Image Synthesis

    Authors: Jialuo Li, Wenhao Chai, Xingyu Fu, Haiyang Xu, Saining Xie

    Abstract: We present a novel approach to integrating scientific knowledge into generative models, enhancing their realism and consistency in image synthesis. First, we introduce Science-T2I, an expert-annotated adversarial dataset comprising adversarial 20k image pairs with 9k prompts, covering wide distinct scientific knowledge categories. Leveraging Science-T2I, we present SciScore, an end-to-end reward m… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: Accepted to CVPR 2025. Code, docs, weight, benchmark and training data are all avaliable at https://jialuo-li.github.io/Science-T2I-Web

  12. arXiv:2504.05979  [pdf, other

    cs.CV

    An Empirical Study of GPT-4o Image Generation Capabilities

    Authors: Sixiang Chen, Jinbin Bai, Zhuoran Zhao, Tian Ye, Qingyu Shi, Donghao Zhou, Wenhao Chai, Xin Lin, Jianzong Wu, Chao Tang, Shilin Xu, Tao Zhang, Haobo Yuan, Yikang Zhou, Wei Chow, Linfeng Li, Xiangtai Li, Lei Zhu, Lu Qi

    Abstract: The landscape of image generation has rapidly evolved, from early GAN-based approaches to diffusion models and, most recently, to unified generative architectures that seek to bridge understanding and generation tasks. Recent advances, especially the GPT-4o, have demonstrated the feasibility of high-fidelity multimodal generation, their architectural design remains mysterious and unpublished. This… ▽ More

    Submitted 10 April, 2025; v1 submitted 8 April, 2025; originally announced April 2025.

  13. arXiv:2504.02826  [pdf, other

    cs.CV

    Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

    Authors: Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, Haodong Duan

    Abstract: Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To study this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (R… ▽ More

    Submitted 27 May, 2025; v1 submitted 3 April, 2025; originally announced April 2025.

  14. arXiv:2503.08604  [pdf, other

    cs.RO cs.AI

    EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments

    Authors: Dongping Li, Tielong Cai, Tianci Tang, Wenhao Chai, Katherine Rose Driggs-Campbell, Gaoang Wang

    Abstract: Developing autonomous home robots controlled by natural language has long been a pursuit of humanity. While advancements in large language models (LLMs) and embodied intelligence make this goal closer, several challenges persist: the lack of a unified benchmark for more complex robot tasks, limited evaluation methods and metrics, data incompatibility between LLMs and mobile manipulation trajectori… ▽ More

    Submitted 14 May, 2025; v1 submitted 11 March, 2025; originally announced March 2025.

  15. arXiv:2503.04240  [pdf, other

    cs.CL

    DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models

    Authors: Ruizhe Chen, Wenhao Chai, Zhifei Yang, Xiaotian Zhang, Joey Tianyi Zhou, Tony Quek, Soujanya Poria, Zuozhu Liu

    Abstract: Inference-time alignment provides an efficient alternative for aligning LLMs with humans. However, these approaches still face challenges, such as limited scalability due to policy-specific value functions and latency during the inference phase. In this paper, we propose a novel approach, Diffusion-styled Preference Optimization (\model), which provides an efficient and policy-agnostic solution fo… ▽ More

    Submitted 25 May, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

    Comments: ACL 2025

  16. arXiv:2502.20172  [pdf, other

    cs.CV cs.CL

    Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think

    Authors: Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, Baobao Chang

    Abstract: The field of advanced text-to-image generation is witnessing the emergence of unified frameworks that integrate powerful text encoders, such as CLIP and T5, with Diffusion Transformer backbones. Although there have been efforts to control output images with additional conditions, like canny and depth map, a comprehensive framework for arbitrary text-image interleaved control is still lacking. This… ▽ More

    Submitted 27 February, 2025; originally announced February 2025.

    Comments: 13 pages, 9 figures, codebase in https://github.com/chenllliang/DreamEngine

  17. arXiv:2502.16303  [pdf, other

    cs.CV

    Pointmap Association and Piecewise-Plane Constraint for Consistent and Compact 3D Gaussian Segmentation Field

    Authors: Wenhao Hu, Wenhao Chai, Shengyu Hao, Xiaotong Cui, Xuexiang Wen, Jenq-Neng Hwang, Gaoang Wang

    Abstract: Achieving a consistent and compact 3D segmentation field is crucial for maintaining semantic coherence across views and accurately representing scene structures. Previous 3D scene segmentation methods rely on video segmentation models to address inconsistencies across views, but the absence of spatial information often leads to object misassociation when object temporarily disappear and reappear.… ▽ More

    Submitted 22 February, 2025; originally announced February 2025.

  18. arXiv:2501.16551  [pdf, other

    cs.CV cs.AI cs.LG

    PackDiT: Joint Human Motion and Text Generation via Mutual Prompting

    Authors: Zhongyu Jiang, Wenhao Chai, Zhuoran Zhou, Cheng-Yen Yang, Hsiang-Wei Huang, Jenq-Neng Hwang

    Abstract: Human motion generation has advanced markedly with the advent of diffusion models. Most recent studies have concentrated on generating motion sequences based on text prompts, commonly referred to as text-to-motion generation. However, the bidirectional generation of motion and text, enabling tasks such as motion-to-text alongside text-to-motion, has been largely unexplored. This capability is esse… ▽ More

    Submitted 27 January, 2025; originally announced January 2025.

  19. arXiv:2411.11922  [pdf, other

    cs.CV

    SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

    Authors: Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang

    Abstract: The Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks but faces challenges in visual object tracking, particularly when managing crowded scenes with fast-moving or self-occluding objects. Furthermore, the fixed-window memory approach in the original model does not consider the quality of memories selected to condition the image features for the next… ▽ More

    Submitted 30 November, 2024; v1 submitted 18 November, 2024; originally announced November 2024.

    Comments: Project page is available at https://yangchris11.github.io/samurai/

  20. arXiv:2410.15074  [pdf, other

    cs.CV cs.AI

    LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

    Authors: Xuechen Guo, Wenhao Chai, Shi-Yan Li, Gaoang Wang

    Abstract: Multimodal Large Language Model (MLLM) has recently garnered attention as a prominent research focus. By harnessing powerful LLM, it facilitates a transition of conversational generative AI from unimodal text to performing multimodal tasks. This boom begins to significantly impact medical field. However, general visual language model (VLM) lacks sophisticated comprehension for medical visual quest… ▽ More

    Submitted 19 October, 2024; originally announced October 2024.

  21. arXiv:2410.08530  [pdf, other

    cs.CV cs.MM

    Ego3DT: Tracking Every 3D Object in Ego-centric Videos

    Authors: Shengyu Hao, Wenhao Chai, Zhonghan Zhao, Meiqi Sun, Wendi Hu, Jieyang Zhou, Yixian Zhao, Qi Li, Yizhou Wang, Xi Li, Gaoang Wang

    Abstract: The growing interest in embodied intelligence has brought ego-centric perspectives to contemporary research. One significant challenge within this realm is the accurate localization and tracking of objects in ego-centric videos, primarily due to the substantial variability in viewing angles. Addressing this issue, this paper introduces a novel zero-shot approach for the 3D reconstruction and track… ▽ More

    Submitted 11 October, 2024; originally announced October 2024.

    Comments: Accepted by ACM Multimedia 2024

  22. arXiv:2410.04070  [pdf, other

    cs.CL cs.AI

    PAD: Personalized Alignment of LLMs at Decoding-Time

    Authors: Ruizhe Chen, Xiaotian Zhang, Meng Luo, Wenhao Chai, Zuozhu Liu

    Abstract: Aligning with personalized preferences, which vary significantly across cultural, educational, and political differences, poses a significant challenge due to the computational costs and data demands of traditional alignment methods. In response, this paper presents Personalized Alignment at Decoding-time (PAD), a novel framework designed to align LLM outputs with diverse personalized preferences… ▽ More

    Submitted 13 March, 2025; v1 submitted 5 October, 2024; originally announced October 2024.

    Comments: ICLR 2025

  23. arXiv:2410.03051  [pdf, other

    cs.CV

    AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

    Authors: Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Saining Xie, Christopher D. Manning

    Abstract: Video detailed captioning is a key task which aims to generate comprehensive and coherent textual descriptions of video content, benefiting both video understanding and generation. In this paper, we propose AuroraCap, a video captioner based on a large multimodal model. We follow the simplest architecture design without additional parameters for temporal modeling. To address the overhead caused by… ▽ More

    Submitted 9 April, 2025; v1 submitted 3 October, 2024; originally announced October 2024.

    Comments: Accepted to ICLR 2025. Code, docs, weight, benchmark and training data are all avaliable at https://rese1f.github.io/aurora-web/

  24. arXiv:2407.14900  [pdf, other

    cs.CV

    AGLLDiff: Guiding Diffusion Models Towards Unsupervised Training-free Real-world Low-light Image Enhancement

    Authors: Yunlong Lin, Tian Ye, Sixiang Chen, Zhenqi Fu, Yingying Wang, Wenhao Chai, Zhaohu Xing, Lei Zhu, Xinghao Ding

    Abstract: Existing low-light image enhancement (LIE) methods have achieved noteworthy success in solving synthetic distortions, yet they often fall short in practical applications. The limitations arise from two inherent challenges in real-world LIE: 1) the collection of distorted/clean image pairs is often impractical and sometimes even unavailable, and 2) accurately modeling complex degradations presents… ▽ More

    Submitted 23 July, 2024; v1 submitted 20 July, 2024; originally announced July 2024.

    Comments: 21 pages, 9 figures

  25. arXiv:2407.13937  [pdf, other

    cs.CV

    Boosting Online 3D Multi-Object Tracking through Camera-Radar Cross Check

    Authors: Sheng-Yao Kuan, Jen-Hao Cheng, Hsiang-Wei Huang, Wenhao Chai, Cheng-Yen Yang, Hugo Latapie, Gaowen Liu, Bing-Fei Wu, Jenq-Neng Hwang

    Abstract: In the domain of autonomous driving, the integration of multi-modal perception techniques based on data from diverse sensors has demonstrated substantial progress. Effectively surpassing the capabilities of state-of-the-art single-modality detectors through sensor fusion remains an active challenge. This work leverages the respective advantages of cameras in perspective view and radars in Bird's E… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: 2024 IEEE Intelligent Vehicles Symposium (IV)

  26. arXiv:2407.13930  [pdf, other

    cs.CV cs.AI eess.SP

    RT-Pose: A 4D Radar Tensor-based 3D Human Pose Estimation and Localization Benchmark

    Authors: Yuan-Hao Ho, Jen-Hao Cheng, Sheng Yao Kuan, Zhongyu Jiang, Wenhao Chai, Hsiang-Wei Huang, Chih-Lung Lin, Jenq-Neng Hwang

    Abstract: Traditional methods for human localization and pose estimation (HPE), which mainly rely on RGB images as an input modality, confront substantial limitations in real-world applications due to privacy concerns. In contrast, radar-based HPE methods emerge as a promising alternative, characterized by distinctive attributes such as through-wall recognition and privacy-preserving, rendering the method m… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  27. arXiv:2406.11247  [pdf, other

    cs.CV

    STEVE Series: Step-by-Step Construction of Agent Systems in Minecraft

    Authors: Zhonghan Zhao, Wenhao Chai, Xuan Wang, Ke Ma, Kewei Chen, Dongxu Guo, Tian Ye, Yanting Zhang, Hongwei Wang, Gaoang Wang

    Abstract: Building an embodied agent system with a large language model (LLM) as its core is a promising direction. Due to the significant costs and uncontrollable factors associated with deploying and training such agents in the real world, we have decided to begin our exploration within the Minecraft environment. Our STEVE Series agents can complete basic tasks in a virtual environment and more challengin… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: CVPR 2024 Embodied AI Workshop

  28. arXiv:2406.04983  [pdf, other

    cs.CV

    CityCraft: A Real Crafter for 3D City Generation

    Authors: Jie Deng, Wenhao Chai, Junsheng Huang, Zhonghan Zhao, Qixuan Huang, Mingyan Gao, Jianshu Guo, Shengyu Hao, Wenhao Hu, Jenq-Neng Hwang, Xi Li, Gaoang Wang

    Abstract: City scene generation has gained significant attention in autonomous driving, smart city development, and traffic simulation. It helps enhance infrastructure planning and monitoring solutions. Existing methods have employed a two-stage process involving city layout generation, typically using Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), or Transformers, followed by neur… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: 20 pages, 9 figures

  29. arXiv:2404.17176  [pdf, other

    cs.CV

    MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

    Authors: Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, Gaoang Wang

    Abstract: Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing methods either employ complex spatial-temporal modules or rely heavily on additional perception models to extract temporal features for video understanding, and they only perform well on short videos. For long… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

  30. arXiv:2404.12871  [pdf, other

    cs.SI math.CO physics.soc-ph

    Expanding the Katz Index for Link Prediction: A Case Study on a Live Fish Movement Network

    Authors: Michael-Sam Vidza, Marcin Budka, Wei Koong Chai, Mark Thrush, Mickael Teixeira Alves

    Abstract: In aquaculture, disease spread models often neglect the dynamic interactions between farms, hindering accuracy. This study enhances the Katz index (KI) to incorporate spatial and temporal patterns of fish movement, improving the prediction of farms susceptible to disease via live fish transfers. We modified the Katz index to create models like the Weighted Katz Index (WKI), Edge Weighted Katz Inde… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

    Comments: 15 pages, 3 figures, submitted to Expert Systems with Applications

  31. arXiv:2404.04910  [pdf, other

    cs.CV

    MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection

    Authors: Hou-I Liu, Christine Wu, Jen-Hao Cheng, Wenhao Chai, Shian-Yun Wang, Gaowen Liu, Hugo Latapie, Jhih-Ciang Wu, Jenq-Neng Hwang, Hong-Han Shuai, Wen-Huang Cheng

    Abstract: Monocular 3D object detection (Mono3D) holds noteworthy promise for autonomous driving applications owing to the cost-effectiveness and rich visual context of monocular camera sensors. However, depth ambiguity poses a significant challenge, as it requires extracting precise 3D scene geometry from a single image, resulting in suboptimal performance when transferring knowledge from a LiDAR-based tea… ▽ More

    Submitted 26 March, 2025; v1 submitted 7 April, 2024; originally announced April 2024.

    Comments: Accepted by CVPR 2025. Our code is available at https://github.com/hoiliu-0801/MonoTAKD

  32. arXiv:2404.04619  [pdf, other

    cs.AI cs.CV

    Do We Really Need a Complex Agent System? Distill Embodied Agent into a Single Model

    Authors: Zhonghan Zhao, Ke Ma, Wenhao Chai, Xuan Wang, Kewei Chen, Dongxu Guo, Yanting Zhang, Hongwei Wang, Gaoang Wang

    Abstract: With the power of large language models (LLMs), open-ended embodied agents can flexibly understand human instructions, generate interpretable guidance strategies, and output executable actions. Nowadays, Multi-modal Language Models~(MLMs) integrate multi-modal signals into LLMs, further bringing richer perception to entity agents and allowing embodied agents to perceive world-understanding tasks m… ▽ More

    Submitted 6 April, 2024; originally announced April 2024.

    Comments: arXiv admin note: text overlap with arXiv:2403.08282

  33. arXiv:2403.18493  [pdf, other

    cs.CV

    VersaT2I: Improving Text-to-Image Models with Versatile Reward

    Authors: Jianshu Guo, Wenhao Chai, Jie Deng, Hsiang-Wei Huang, Tian Ye, Yichen Xu, Jiawei Zhang, Jenq-Neng Hwang, Gaoang Wang

    Abstract: Recent text-to-image (T2I) models have benefited from large-scale and high-quality data, demonstrating impressive performance. However, these T2I models still struggle to produce images that are aesthetically pleasing, geometrically accurate, faithful to text, and of good low-level quality. We present VersaT2I, a versatile training framework that can boost the performance with multiple rewards of… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

  34. arXiv:2403.10826  [pdf, other

    cs.CV

    MambaMOT: State-Space Model as Motion Predictor for Multi-Object Tracking

    Authors: Hsiang-Wei Huang, Cheng-Yen Yang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang

    Abstract: In the field of multi-object tracking (MOT), traditional methods often rely on the Kalman filter for motion prediction, leveraging its strengths in linear motion scenarios. However, the inherent limitations of these methods become evident when confronted with complex, nonlinear motions and occlusions prevalent in dynamic environments like sports and dance. This paper explores the possibilities of… ▽ More

    Submitted 20 January, 2025; v1 submitted 16 March, 2024; originally announced March 2024.

    Comments: Accepted by ICASSP 2025. Previous version paper title: Exploring Learning-based Motion Models in Multi-Object Tracking

  35. arXiv:2403.08282  [pdf, other

    cs.CV

    Hierarchical Auto-Organizing System for Open-Ended Multi-Agent Navigation

    Authors: Zhonghan Zhao, Kewei Chen, Dongxu Guo, Wenhao Chai, Tian Ye, Yanting Zhang, Gaoang Wang

    Abstract: Due to the dynamic and unpredictable open-world setting, navigating complex environments in Minecraft poses significant challenges for multi-agent systems. Agents must interact with the environment and coordinate their actions with other agents to achieve common objectives. However, traditional approaches often struggle to efficiently manage inter-agent communication and task distribution, crucial… ▽ More

    Submitted 18 March, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

    Comments: ICLR 2024 Workshop on LLM Agents

  36. arXiv:2402.09316  [pdf, other

    cs.CV cs.LG

    Only My Model On My Data: A Privacy Preserving Approach Protecting one Model and Deceiving Unauthorized Black-Box Models

    Authors: Weiheng Chai, Brian Testa, Huantao Ren, Asif Salekin, Senem Velipasalar

    Abstract: Deep neural networks are extensively applied to real-world tasks, such as face recognition and medical image classification, where privacy and data protection are critical. Image data, if not protected, can be exploited to infer personal or contextual information. Existing privacy preservation methods, like encryption, generate perturbed images that are unrecognizable to even humans. Adversarial a… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

  37. arXiv:2312.08887  [pdf, other

    cs.CV cs.LG

    SpeedUpNet: A Plug-and-Play Adapter Network for Accelerating Text-to-Image Diffusion Models

    Authors: Weilong Chai, DanDan Zheng, Jiajiong Cao, Zhiquan Chen, Changbao Wang, Chenguang Ma

    Abstract: Text-to-image diffusion models (SD) exhibit significant advancements while requiring extensive computational resources. Existing acceleration methods usually require extensive training and are not universally applicable. LCM-LoRA, trainable once for diverse models, offers universality but rarely considers ensuring the consistency of generated content before and after acceleration. This paper propo… ▽ More

    Submitted 1 October, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

    Comments: Accepted to ECCV 2024

  38. User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

    Authors: Xuan Wang, Guanhong Wang, Wenhao Chai, Jiayu Zhou, Gaoang Wang

    Abstract: Image captioning bridges the gap between vision and language by automatically generating natural language descriptions for images. Traditional image captioning methods often overlook the preferences and characteristics of users. Personalized image captioning solves this problem by incorporating user prior knowledge into the model, such as writing styles and preferred vocabularies. Most existing me… ▽ More

    Submitted 20 December, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

  39. arXiv:2312.01508  [pdf, other

    cs.CV

    CityGen: Infinite and Controllable City Layout Generation

    Authors: Jie Deng, Wenhao Chai, Jianshu Guo, Qixuan Huang, Junsheng Huang, Wenhao Hu, Shengyu Hao, Jenq-Neng Hwang, Gaoang Wang

    Abstract: The recent surge in interest in city layout generation underscores its significance in urban planning and smart city development. The task involves procedurally or automatically generating spatial arrangements for urban elements such as roads, buildings, water, and vegetation. Previous methods, whether procedural modeling or deep learning-based approaches like VAEs and GANs, rely on complex priors… ▽ More

    Submitted 11 April, 2025; v1 submitted 3 December, 2023; originally announced December 2023.

    Comments: Accepted to CVPR 2025 USM3D Workshop

  40. arXiv:2311.16477  [pdf, other

    cs.CV

    UniHPE: Towards Unified Human Pose Estimation via Contrastive Learning

    Authors: Zhongyu Jiang, Wenhao Chai, Lei Li, Zhuoran Zhou, Cheng-Yen Yang, Jenq-Neng Hwang

    Abstract: In recent times, there has been a growing interest in developing effective perception techniques for combining information from multiple modalities. This involves aligning features obtained from diverse sources to enable more efficient training with larger datasets and constraints, as well as leveraging the wealth of information contained in each modality. 2D and 3D Human Pose Estimation (HPE) are… ▽ More

    Submitted 24 November, 2023; originally announced November 2023.

  41. arXiv:2311.15209  [pdf, other

    cs.AI

    See and Think: Embodied Agent in Virtual Environment

    Authors: Zhonghan Zhao, Wenhao Chai, Xuan Wang, Li Boyi, Shengyu Hao, Shidong Cao, Tian Ye, Gaoang Wang

    Abstract: Large language models (LLMs) have achieved impressive pro-gress on several open-world tasks. Recently, using LLMs to build embodied agents has been a hotspot. This paper proposes STEVE, a comprehensive and visionary embodied agent in the Minecraft virtual environment. STEVE comprises three key components: vision perception, language instruction, and code action. Vision perception involves interpre… ▽ More

    Submitted 9 July, 2024; v1 submitted 26 November, 2023; originally announced November 2023.

    Comments: ECCV 2024. First three authors contribute equally to this work. Project Website https://rese1f.github.io/STEVE/

  42. arXiv:2311.12043  [pdf, other

    cs.CV cs.AI

    Efficient Domain Adaptation via Generative Prior for 3D Infant Pose Estimation

    Authors: Zhuoran Zhou, Zhongyu Jiang, Wenhao Chai, Cheng-Yen Yang, Lei Li, Jenq-Neng Hwang

    Abstract: Although 3D human pose estimation has gained impressive development in recent years, only a few works focus on infants, that have different bone lengths and also have limited data. Directly applying adult pose estimation models typically achieves low performance in the infant domain and suffers from out-of-distribution issues. Moreover, the limitation of infant pose data collection also heavily co… ▽ More

    Submitted 17 November, 2023; originally announced November 2023.

    Comments: WACVW 2024

  43. arXiv:2309.13770  [pdf, other

    cs.LG cs.CV

    Devil in the Number: Towards Robust Multi-modality Data Filter

    Authors: Yichen Xu, Zihan Xu, Wenhao Chai, Zhonghan Zhao, Enxin Song, Gaoang Wang

    Abstract: In order to appropriately filter multi-modality data sets on a web-scale, it becomes crucial to employ suitable filtering methods to boost performance and reduce training costs. For instance, LAION papers employs the CLIP score filter to select data with CLIP scores surpassing a certain threshold. On the other hand, T-MARS achieves high-quality data filtering by detecting and masking text within i… ▽ More

    Submitted 24 September, 2023; originally announced September 2023.

    Comments: ICCV 2023 Workshop: TNGCV-DataComp

  44. arXiv:2309.03599  [pdf, other

    cs.CV

    Chasing Consistency in Text-to-3D Generation from a Single Image

    Authors: Yichen Ouyang, Wenhao Chai, Jiayi Ye, Dapeng Tao, Yibing Zhan, Gaoang Wang

    Abstract: Text-to-3D generation from a single-view image is a popular but challenging task in 3D vision. Although numerous methods have been proposed, existing works still suffer from the inconsistency issues, including 1) semantic inconsistency, 2) geometric inconsistency, and 3) saturation inconsistency, resulting in distorted, overfitted, and over-saturated generations. In light of the above issues, we p… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

    Comments: 9 pages, 11 figures

  45. arXiv:2308.09953  [pdf, other

    cs.CV

    UniAP: Towards Universal Animal Perception in Vision via Few-shot Learning

    Authors: Meiqi Sun, Zhonghan Zhao, Wenhao Chai, Hanjun Luo, Shidong Cao, Yanting Zhang, Jenq-Neng Hwang, Gaoang Wang

    Abstract: Animal visual perception is an important technique for automatically monitoring animal health, understanding animal behaviors, and assisting animal-related research. However, it is challenging to design a deep learning-based perception model that can freely adapt to different animals across various perception tasks, due to the varying poses of a large diversity of animals, lacking data on rare spe… ▽ More

    Submitted 19 August, 2023; originally announced August 2023.

  46. arXiv:2308.09678  [pdf, other

    cs.CV cs.AI cs.MM cs.RO

    PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation

    Authors: Hanbing Liu, Jun-Yan He, Zhi-Qi Cheng, Wangmeng Xiang, Qize Yang, Wenhao Chai, Gaoang Wang, Xu Bao, Bin Luo, Yifeng Geng, Xuansong Xie

    Abstract: Existing 3D human pose estimators face challenges in adapting to new datasets due to the lack of 2D-3D pose pairs in training sets. To overcome this issue, we propose \textit{Multi-Hypothesis \textbf{P}ose \textbf{Syn}thesis \textbf{D}omain \textbf{A}daptation} (\textbf{PoSynDA}) framework to bridge this data disparity gap in target domain. Typically, PoSynDA uses a diffusion-inspired structure to… ▽ More

    Submitted 16 October, 2023; v1 submitted 18 August, 2023; originally announced August 2023.

    Comments: Accepted to ACM Multimedia 2023; 10 pages, 4 figures, 8 tables; the code is at https://github.com/hbing-l/PoSynDA

  47. arXiv:2308.09592  [pdf, other

    cs.CV

    StableVideo: Text-driven Consistency-aware Diffusion Video Editing

    Authors: Wenhao Chai, Xun Guo, Gaoang Wang, Yan Lu

    Abstract: Diffusion-based methods can generate realistic images and videos, but they struggle to edit existing objects in a video while preserving their appearance over time. This prevents diffusion models from being applied to natural video editing in practical scenarios. In this paper, we tackle this problem by introducing temporal dependency to existing text-driven diffusion models, which allows them to… ▽ More

    Submitted 18 August, 2023; originally announced August 2023.

    Comments: ICCV 2023

  48. arXiv:2308.01555  [pdf, other

    cs.RO

    Mani-GPT: A Generative Model for Interactive Robotic Manipulation

    Authors: Zhe Zhang, Wei Chai, Jiankun Wang

    Abstract: In real-world scenarios, human dialogues are multi-round and diverse. Furthermore, human instructions can be unclear and human responses are unrestricted. Interactive robots face difficulties in understanding human intents and generating suitable strategies for assisting individuals through manipulation. In this article, we propose Mani-GPT, a Generative Pre-trained Transformer (GPT) for interacti… ▽ More

    Submitted 7 August, 2023; v1 submitted 3 August, 2023; originally announced August 2023.

  49. arXiv:2308.01164  [pdf, other

    cs.RO

    Virtual Reality Based Robot Teleoperation via Human-Scene Interaction

    Authors: Lingxiao Meng, Jiangshan Liu, Wei Chai, Jiankun Wang, Max Q. -H. Meng

    Abstract: Robot teleoperation gains great success in various situations, including chemical pollution rescue, disaster relief, and long-distance manipulation. In this article, we propose a virtual reality (VR) based robot teleoperation system to achieve more efficient and natural interaction with humans in different scenes. A user-friendly VR interface is designed to help users interact with a desktop scene… ▽ More

    Submitted 2 August, 2023; originally announced August 2023.

  50. arXiv:2307.16449  [pdf, other

    cs.CV

    MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

    Authors: Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang

    Abstract: Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection impose additional challenges. Taking advantage of the Atkinson-S… ▽ More

    Submitted 9 March, 2024; v1 submitted 31 July, 2023; originally announced July 2023.

    Comments: CVPR 2024. First three authors contribute equally to this work. Project Website https://rese1f.github.io/MovieChat/