这是indexloc提供的服务,不要输入任何密码
Skip to main content

Showing 1–50 of 624 results for author: Cheng, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.15677  [pdf, ps, other

    cs.RO

    Data-Driven MPC with Data Selection for Flexible Cable-Driven Robotic Arms

    Authors: Huayue Liang, Yanbo Chen, Hongyang Cheng, Yanzhao Yu, Shoujie Li, Junbo Tan, Xueqian Wang, Long Zeng

    Abstract: Flexible cable-driven robotic arms (FCRAs) offer dexterous and compliant motion. Still, the inherent properties of cables, such as resilience, hysteresis, and friction, often lead to particular difficulties in modeling and control. This paper proposes a model predictive control (MPC) method that relies exclusively on input-output data, without a physical model, to improve the control accuracy of F… ▽ More

    Submitted 21 July, 2025; originally announced July 2025.

  2. Hierarchical Graph Information Bottleneck for Multi-Behavior Recommendation

    Authors: Hengyu Zhang, Chunxu Shen, Xiangguo Sun, Jie Tan, Yanchao Tan, Yu Rong, Hong Cheng, Lingling Yi

    Abstract: In real-world recommendation scenarios, users typically engage with platforms through multiple types of behavioral interactions. Multi-behavior recommendation algorithms aim to leverage various auxiliary user behaviors to enhance prediction for target behaviors of primary interest (e.g., buy), thereby overcoming performance limitations caused by data sparsity in target behavior records. Current st… ▽ More

    Submitted 21 July, 2025; originally announced July 2025.

    Comments: Accepted by RecSys2025

  3. arXiv:2507.07961  [pdf, ps, other

    quant-ph cs.IT math.FA math.OA

    Sharp estimates of quantum covering problems via a novel trace inequality

    Authors: Hao-Chung Cheng, Li Gao, Christoph Hirche, Hao-Wei Huang, Po-Chieh Liu

    Abstract: In this paper, we prove a novel trace inequality involving two operators. As applications, we sharpen the one-shot achievability bound on the relative entropy error in a wealth of quantum covering-type problems, such as soft covering, privacy amplification, convex splitting, quantum information decoupling, and quantum channel simulation by removing some dimension-dependent factors. Moreover, the e… ▽ More

    Submitted 10 July, 2025; originally announced July 2025.

  4. arXiv:2507.07694  [pdf, ps, other

    cs.CL

    SAS: Simulated Attention Score

    Authors: Chuanyang Zheng, Jiankai Sun, Yihang Gao, Yuehao Wang, Peihao Wang, Jing Xiong, Liliang Ren, Hao Cheng, Janardhan Kulkarni, Yelong Shen, Atlas Wang, Mac Schwager, Anderson Schneider, Xiaodong Liu, Jianfeng Gao

    Abstract: The attention mechanism is a core component of the Transformer architecture. Various methods have been developed to compute attention scores, including multi-head attention (MHA), multi-query attention, group-query attention and so on. We further analyze the MHA and observe that its performance improves as the number of attention heads increases, provided the hidden size per head remains sufficien… ▽ More

    Submitted 10 July, 2025; originally announced July 2025.

    Comments: Tech Report

  5. Shuffling for Semantic Secrecy

    Authors: Fupei Chen, Liyao Xiang, Haoxiang Sun, Hei Victor Cheng, Kaiming Shen

    Abstract: Deep learning draws heavily on the latest progress in semantic communications. The present paper aims to examine the security aspect of this cutting-edge technique from a novel shuffling perspective. Our goal is to improve upon the conventional secure coding scheme to strike a desirable tradeoff between transmission rate and leakage rate. To be more specific, for a wiretap channel, we seek to maxi… ▽ More

    Submitted 9 July, 2025; originally announced July 2025.

    Journal ref: IEEE Transactions on Information Forensics and Security, vol. 20, pp. 5240-5255, 2025

  6. arXiv:2507.07065  [pdf, ps, other

    quant-ph cs.IT

    Layer Cake Representations for Quantum Divergences

    Authors: Po-Chieh Liu, Christoph Hirche, Hao-Chung Cheng

    Abstract: Defining suitable quantum extensions of classical divergences often poses a challenge due to the non-commutative nature of quantum information. In this work, we propose a new approach via what we call the layer cake representation. The resulting quantum Rényi and $f$-divergences are then proven to be equivalent to those recently defined via integral representations. Nevertheless, the approach can… ▽ More

    Submitted 10 July, 2025; v1 submitted 9 July, 2025; originally announced July 2025.

    Comments: 2nd version: typo corrected

  7. arXiv:2507.06607  [pdf, ps, other

    cs.CL cs.LG

    Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

    Authors: Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, Hao Cheng, Jianfeng Gao, Weizhu Chen, Yelong Shen

    Abstract: Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we… ▽ More

    Submitted 16 July, 2025; v1 submitted 9 July, 2025; originally announced July 2025.

  8. arXiv:2507.06261  [pdf, ps, other

    cs.CL cs.AI

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Authors: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu , et al. (3284 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal unde… ▽ More

    Submitted 22 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

    Comments: 72 pages, 17 figures

  9. arXiv:2507.06232  [pdf, ps, other

    quant-ph cs.IT math-ph math.FA

    Error Exponents for Quantum Packing Problems via An Operator Layer Cake Theorem

    Authors: Hao-Chung Cheng, Po-Chieh Liu

    Abstract: In this work, we prove a one-shot random coding bound for classical-quantum channel coding, a problem conjectured by Burnashev and Holevo in 1998. By choosing the optimal input distribution, we recover the optimal error exponent (i.e., the reliability function) of classical-quantum channels for rates above the critical rate. Our result extends to various quantum packing-type problems, including cl… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  10. arXiv:2507.06229  [pdf, ps, other

    cs.CL cs.AI

    Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

    Authors: Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou

    Abstract: Current AI agents cannot effectively learn from each other's problem-solving experiences or use past successes to guide self-reflection and error correction in new tasks. We introduce Agent KB, a shared knowledge base that captures both high-level problem-solving strategies and detailed execution lessons, enabling knowledge transfer across agent frameworks. Agent KB implements a novel teacher-stud… ▽ More

    Submitted 21 July, 2025; v1 submitted 8 July, 2025; originally announced July 2025.

  11. arXiv:2507.05110  [pdf, ps, other

    cs.AI

    Rule Learning for Knowledge Graph Reasoning under Agnostic Distribution Shift

    Authors: Shixuan Liu, Yue He, Yunfei Wang, Hao Zou, Haoxiang Cheng, Wenjing Yang, Peng Cui, Zhong Liu

    Abstract: Logical rule learning, a prominent category of knowledge graph (KG) reasoning methods, constitutes a critical research area aimed at learning explicit rules from observed facts to infer missing knowledge. However, like all KG reasoning methods, rule learning suffers from a critical weakness-its dependence on the I.I.D. assumption. This assumption can easily be violated due to selection bias during… ▽ More

    Submitted 10 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

  12. arXiv:2507.02645  [pdf, ps, other

    cs.LG cs.CV

    Fair Deepfake Detectors Can Generalize

    Authors: Harry Cheng, Ming-Hui Liu, Yangyang Guo, Tianyi Wang, Liqiang Nie, Mohan Kankanhalli

    Abstract: Deepfake detection models face two critical challenges: generalization to unseen manipulations and demographic fairness among population groups. However, existing approaches often demonstrate that these two objectives are inherently conflicting, revealing a trade-off between them. In this paper, we, for the first time, uncover and formally define a causal relationship between fairness and generali… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 14 pages, version 1

  13. arXiv:2506.23549  [pdf, ps, other

    cs.AI cs.HC cs.LG

    CooT: Learning to Coordinate In-Context with Coordination Transformers

    Authors: Huai-Chih Wang, Hsiang-Chun Chuang, Hsi-Chun Cheng, Dai-Jie Wu, Shao-Hua Sun

    Abstract: Effective coordination among artificial agents in dynamic and uncertain environments remains a significant challenge in multi-agent systems. Existing approaches, such as self-play and population-based methods, either generalize poorly to unseen partners or require extensive training. To overcome these limitations, we propose Coordination Transformers (CooT), a novel in-context coordination framewo… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: 23 pages, 10 tables, 8 figures

  14. arXiv:2506.21712  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers

    Authors: Tzu-Quan Lin, Hsi-Chun Cheng, Hung-yi Lee, Hao Tang

    Abstract: In recent years, the impact of self-supervised speech Transformers has extended to speaker-related applications. However, little research has explored how these models encode speaker information. In this work, we address this gap by identifying neurons in the feed-forward layers that are correlated with speaker information. Specifically, we analyze neurons associated with k-means clusters of self-… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  15. arXiv:2506.19199  [pdf, ps, other

    eess.SY cs.MA cs.RO

    Low-Cost Infrastructure-Free 3D Relative Localization with Sub-Meter Accuracy in Near Field

    Authors: Qiangsheng Gao, Ka Ho Cheng, Li Qiu, Zijun Gong

    Abstract: Relative localization in the near-field scenario is critically important for unmanned vehicle (UxV) applications. Although related works addressing 2D relative localization problem have been widely studied for unmanned ground vehicles (UGVs), the problem in 3D scenarios for unmanned aerial vehicles (UAVs) involves more uncertainties and remains to be investigated. Inspired by the phenomenon that a… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  16. arXiv:2506.18729  [pdf, ps, other

    cs.SD cs.AI eess.AS

    MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners

    Authors: Fang-Duo Tsai, Shih-Lun Wu, Weijaw Lee, Sheng-Ping Yang, Bo-Rui Chen, Hao-Chung Cheng, Yi-Hsuan Yang

    Abstract: We propose MuseControlLite, a lightweight mechanism designed to fine-tune text-to-music generation models for precise conditioning using various time-varying musical attributes and reference audio signals. The key finding is that positional embeddings, which have been seldom used by text-to-music generation models in the conditioner for text conditions, are critical when the condition of interest… ▽ More

    Submitted 24 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

    Comments: Accepted by the 42nd International Conference on Machine Learning (ICML 2025)

  17. arXiv:2506.17328  [pdf, ps, other

    cs.RO

    Reflective VLM Planning for Dual-Arm Desktop Cleaning: Bridging Open-Vocabulary Perception and Precise Manipulation

    Authors: Yufan Liu, Yi Wu, Gweneth Ge, Haoliang Cheng, Rui Liu

    Abstract: Desktop cleaning demands open-vocabulary recognition and precise manipulation for heterogeneous debris. We propose a hierarchical framework integrating reflective Vision-Language Model (VLM) planning with dual-arm execution via structured scene representation. Grounded-SAM2 facilitates open-vocabulary detection, while a memory-augmented VLM generates, critiques, and revises manipulation sequences.… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  18. arXiv:2506.16743  [pdf, ps, other

    cs.CV

    Noise-Informed Diffusion-Generated Image Detection with Anomaly Attention

    Authors: Weinan Guan, Wei Wang, Bo Peng, Ziwen He, Jing Dong, Haonan Cheng

    Abstract: With the rapid development of image generation technologies, especially the advancement of Diffusion Models, the quality of synthesized images has significantly improved, raising concerns among researchers about information security. To mitigate the malicious abuse of diffusion models, diffusion-generated image detection has proven to be an effective countermeasure.However, a key challenge for for… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: Accepted by TIFS 2025. Our code is availabel at https://github.com/WeinanGuan/NASA-Swin

    Journal ref: IEEE Trans. Inf. Forensics Security, vol.20, pp. 5256-5268, 2025

  19. arXiv:2506.13695  [pdf, ps, other

    cs.IR

    OneRec Technical Report

    Authors: Guorui Zhou, Jiaxin Deng, Jinghao Zhang, Kuo Cai, Lejian Ren, Qiang Luo, Qianqian Wang, Qigen Hu, Rui Huang, Shiyao Wang, Weifeng Ding, Wuchao Li, Xinchen Luo, Xingmei Wang, Zexuan Cheng, Zixing Zhang, Bin Zhang, Boxuan Wang, Chaoyi Ma, Chengru Song, Chenhui Wang, Di Wang, Dongxue Meng, Fan Yang, Fangyu Zhang , et al. (40 additional authors not shown)

    Abstract: Recommender systems have been widely used in various large-scale user-oriented platforms for many years. However, compared to the rapid developments in the AI community, recommendation systems have not achieved a breakthrough in recent years. For instance, they still rely on a multi-stage cascaded architecture rather than an end-to-end approach, leading to computational fragmentation and optimizat… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: Authors are listed alphabetically by their first name

  20. arXiv:2506.12606  [pdf, ps, other

    cs.CL cs.AI

    An Exploration of Mamba for Speech Self-Supervised Models

    Authors: Tzu-Quan Lin, Heng-Cheng Kuo, Tzu-Chieh Wei, Hsi-Chun Cheng, Chun-Wei Chen, Hsien-Fu Hsiao, Yu Tsao, Hung-yi Lee

    Abstract: While Mamba has demonstrated strong performance in language modeling, its potential as a speech self-supervised (SSL) model remains underexplored, with prior studies limited to isolated tasks. To address this, we explore Mamba-based HuBERT models as alternatives to Transformer-based SSL architectures. Leveraging the linear-time Selective State Space, these models enable fine-tuning on long-context… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  21. arXiv:2506.11130  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data

    Authors: Cheng-Kang Chou, Chan-Jan Hsu, Ho-Lam Chung, Liang-Hsuan Tseng, Hsi-Chun Cheng, Yu-Kuan Fu, Kuan Po Huang, Hung-Yi Lee

    Abstract: We propose a self-refining framework that enhances ASR performance with only unlabeled datasets. The process starts with an existing ASR model generating pseudo-labels on unannotated speech, which are then used to train a high-fidelity text-to-speech (TTS) system. Then, synthesized speech text pairs are bootstrapped into the original ASR system, completing the closed-loop self-improvement cycle. W… ▽ More

    Submitted 16 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

  22. arXiv:2506.10145  [pdf, ps, other

    cs.CV

    RoCA: Robust Cross-Domain End-to-End Autonomous Driving

    Authors: Rajeev Yasarla, Shizhong Han, Hsin-Pai Cheng, Litian Liu, Shweta Mahajan, Apratim Bhattacharyya, Yunxiao Shi, Risheek Garrepalli, Hong Cai, Fatih Porikli

    Abstract: End-to-end (E2E) autonomous driving has recently emerged as a new paradigm, offering significant potential. However, few studies have looked into the practical challenge of deployment across domains (e.g., cities). Although several works have incorporated Large Language Models (LLMs) to leverage their open-world knowledge, LLMs do not guarantee cross-domain driving performance and may incur prohib… ▽ More

    Submitted 17 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

  23. arXiv:2506.08303  [pdf, ps, other

    cs.HC

    EMG-Driven Stiffness-Modulating Palpation for Telerehabilitation

    Authors: Thomas M. Kwok, Hilary HY Cheng, Wai Tuck Chow

    Abstract: In this work, we introduce HJ-Pal, a lightweight wearable haptic device that leverages EMG-driven honeycomb jamming to render muscle activation as kinesthetic feedback, enabling remote palpation for small muscle assessment in telerehabilitation.

    Submitted 12 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

    Comments: Accepted by the Workshop on Human-Robot Contact and Manipulation (HRCM 2025) at RSS Conference 2025

  24. arXiv:2506.07652  [pdf, ps, other

    cs.CV cs.AI

    FMaMIL: Frequency-Driven Mamba Multi-Instance Learning for Weakly Supervised Lesion Segmentation in Medical Images

    Authors: Hangbei Cheng, Xiaorong Dong, Xueyu Liu, Jianan Zhang, Xuetao Ma, Mingqiang Wei, Liansheng Wang, Junxin Chen, Yongfei Wu

    Abstract: Accurate lesion segmentation in histopathology images is essential for diagnostic interpretation and quantitative analysis, yet it remains challenging due to the limited availability of costly pixel-level annotations. To address this, we propose FMaMIL, a novel two-stage framework for weakly supervised lesion segmentation based solely on image-level labels. In the first stage, a lightweight Mamba-… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  25. arXiv:2506.07490  [pdf, other

    cs.RO

    RAPID Hand: A Robust, Affordable, Perception-Integrated, Dexterous Manipulation Platform for Generalist Robot Autonomy

    Authors: Zhaoliang Wan, Zetong Bi, Zida Zhou, Hao Ren, Yiming Zeng, Yihan Li, Lu Qi, Xu Yang, Ming-Hsuan Yang, Hui Cheng

    Abstract: This paper addresses the scarcity of low-cost but high-dexterity platforms for collecting real-world multi-fingered robot manipulation data towards generalist robot autonomy. To achieve it, we propose the RAPID Hand, a co-optimized hardware and software platform where the compact 20-DoF hand, robust whole-hand perception, and high-DoF teleoperation interface are jointly designed. Specifically, RAP… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  26. arXiv:2506.04499  [pdf, ps, other

    cs.CV

    FALO: Fast and Accurate LiDAR 3D Object Detection on Resource-Constrained Devices

    Authors: Shizhong Han, Hsin-Pai Cheng, Hong Cai, Jihad Masri, Soyeb Nagori, Fatih Porikli

    Abstract: Existing LiDAR 3D object detection methods predominantely rely on sparse convolutions and/or transformers, which can be challenging to run on resource-constrained edge devices, due to irregular memory access patterns and high computational costs. In this paper, we propose FALO, a hardware-friendly approach to LiDAR 3D detection, which offers both state-of-the-art (SOTA) detection accuracy and fast… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  27. arXiv:2506.03373  [pdf, ps, other

    cs.CV cs.AI

    A Foundation Model for Spatial Proteomics

    Authors: Muhammad Shaban, Yuzhou Chang, Huaying Qiu, Yao Yu Yeo, Andrew H. Song, Guillaume Jaume, Yuchen Wang, Luca L. Weishaupt, Tong Ding, Anurag Vaidya, Abdallah Lamane, Daniel Shao, Mohammed Zidane, Yunhao Bai, Paige McCallum, Shuli Luo, Wenrui Wu, Yang Wang, Precious Cramer, Chi Ngai Chan, Pierre Stephan, Johanna Schaffenrath, Jia Le Lee, Hendrik A. Michel, Caiwei Tian , et al. (35 additional authors not shown)

    Abstract: Foundation models have begun to transform image analysis by acting as pretrained generalist backbones that can be adapted to many tasks even when post-training data are limited, yet their impact on spatial proteomics, imaging that maps proteins at single-cell resolution, remains limited. Here, we introduce KRONOS, a foundation model built for spatial proteomics. KRONOS was trained in a self-superv… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  28. arXiv:2506.02368  [pdf, ps, other

    cs.IR

    NextQuill: Causal Preference Modeling for Enhancing LLM Personalization

    Authors: Xiaoyan Zhao, Juntao You, Yang Zhang, Wenjie Wang, Hong Cheng, Fuli Feng, See-Kiong Ng, Tat-Seng Chua

    Abstract: Personalizing large language models (LLMs) for individual users has become increasingly important as they are progressively integrated into real-world applications to support users' daily lives. However, existing personalization approaches often fail to distinguish which components of model predictions and training data truly reflect user preferences, leading to superficial personalization alignme… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  29. arXiv:2506.01947  [pdf, ps, other

    eess.IV cs.CV

    RAW Image Reconstruction from RGB on Smartphones. NTIRE 2025 Challenge Report

    Authors: Marcos V. Conde, Radu Timofte, Radu Berdan, Beril Besbinar, Daisuke Iso, Pengzhou Ji, Xiong Dun, Zeying Fan, Chen Wu, Zhansheng Wang, Pengbo Zhang, Jiazi Huang, Qinglin Liu, Wei Yu, Shengping Zhang, Xiangyang Ji, Kyungsik Kim, Minkyung Kim, Hwalmin Lee, Hekun Ma, Huan Zheng, Yanyan Wei, Zhao Zhang, Jing Fang, Meilin Gao , et al. (8 additional authors not shown)

    Abstract: Numerous low-level vision tasks operate in the RAW domain due to its linear properties, bit depth, and sensor designs. Despite this, RAW image datasets are scarce and more expensive to collect than the already large and public sRGB datasets. For this reason, many approaches try to generate realistic RAW images using sensor information and sRGB images. This paper covers the second challenge on RAW… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: CVPR 2025 - New Trends in Image Restoration and Enhancement (NTIRE)

  30. arXiv:2506.01758  [pdf, ps, other

    cs.CV

    Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks

    Authors: Tao Yang, Ruibin Li, Yangming Shi, Yuqi Zhang, Qide Dong, Haoran Cheng, Weiguo Feng, Shilei Wen, Bingyue Peng, Lei Zhang

    Abstract: Diffusion models have shown impressive performance in many visual generation and manipulation tasks. Many existing methods focus on training a model for a specific task, especially, text-to-video (T2V) generation, while many other works focus on finetuning the pretrained T2V model for image-to-video (I2V), video-to-video (V2V), image and video manipulation tasks, etc. However, training a strong T2… ▽ More

    Submitted 12 July, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

  31. arXiv:2506.01487  [pdf, ps, other

    cs.CV

    FDSG: Forecasting Dynamic Scene Graphs

    Authors: Yi Yang, Yuren Cong, Hao Cheng, Bodo Rosenhahn, Michael Ying Yang

    Abstract: Dynamic scene graph generation extends scene graph generation from images to videos by modeling entity relationships and their temporal evolution. However, existing methods either generate scene graphs from observed frames without explicitly modeling temporal dynamics, or predict only relationships while assuming static entity labels and locations. These limitations hinder effective extrapolation… ▽ More

    Submitted 18 July, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

    Comments: 16 pages, 8 figures, 12 tables

  32. arXiv:2506.00320  [pdf, ps, other

    cs.AI cs.CL cs.LG

    Dyna-Think: Synergizing Reasoning, Acting, and World Model Simulation in AI Agents

    Authors: Xiao Yu, Baolin Peng, Ruize Xu, Michel Galley, Hao Cheng, Suman Nath, Jianfeng Gao, Zhou Yu

    Abstract: Recent progress in reasoning with large language models (LLMs), such as DeepSeek-R1, demonstrates impressive capabilities in domains like mathematics and coding, by exhibiting complex cognitive behaviors such as verification, goal decomposition, and self-reflection. However, it is unclear what behavior is effective and what behavior is missing for long-horizon AI agents tasks. In this work, we pro… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

  33. arXiv:2505.23341  [pdf, ps, other

    cs.CV

    DSAGL: Dual-Stream Attention-Guided Learning for Weakly Supervised Whole Slide Image Classification

    Authors: Daoxi Cao, Hangbei Cheng, Yijin Li, Ruolin Zhou, Xuehan Zhang, Xinyi Li, Binwei Li, Xuancheng Gu, Jianan Zhang, Xueyu Liu, Yongfei Wu

    Abstract: Whole-slide images (WSIs) are critical for cancer diagnosis due to their ultra-high resolution and rich semantic content. However, their massive size and the limited availability of fine-grained annotations pose substantial challenges for conventional supervised learning. We propose DSAGL (Dual-Stream Attention-Guided Learning), a novel weakly supervised classification framework that combines a te… ▽ More

    Submitted 27 June, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

  34. arXiv:2505.20828  [pdf, ps, other

    cs.RO

    GET: Goal-directed Exploration and Targeting for Large-Scale Unknown Environments

    Authors: Lanxiang Zheng, Ruidong Mei, Mingxin Wei, Hao Ren, Hui Cheng

    Abstract: Object search in large-scale, unstructured environments remains a fundamental challenge in robotics, particularly in dynamic or expansive settings such as outdoor autonomous exploration. This task requires robust spatial reasoning and the ability to leverage prior experiences. While Large Language Models (LLMs) offer strong semantic capabilities, their application in embodied contexts is limited b… ▽ More

    Submitted 28 May, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

  35. arXiv:2505.17389  [pdf, other

    cs.RO cs.AI

    Bootstrapping Imitation Learning for Long-horizon Manipulation via Hierarchical Data Collection Space

    Authors: Jinrong Yang, Kexun Chen, Zhuoling Li, Shengkai Wu, Yong Zhao, Liangliang Ren, Wenqiu Luo, Chaohui Shang, Meiyu Zhi, Linfeng Gao, Mingshan Sun, Hui Cheng

    Abstract: Imitation learning (IL) with human demonstrations is a promising method for robotic manipulation tasks. While minimal demonstrations enable robotic action execution, achieving high success rates and generalization requires high cost, e.g., continuously adding data or incrementally conducting human-in-loop processes with complex hardware/software systems. In this paper, we rethink the state/action… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  36. arXiv:2505.16107  [pdf, ps, other

    cs.CL

    MPL: Multiple Programming Languages with Large Language Models for Information Extraction

    Authors: Bo Li, Gexiang Fang, Wei Ye, Zhenghua Xu, Jinglei Zhang, Hao Cheng, Shikun Zhang

    Abstract: Recent research in information extraction (IE) focuses on utilizing code-style inputs to enhance structured output generation. The intuition behind this is that the programming languages (PLs) inherently exhibit greater structural organization than natural languages (NLs). This structural advantage makes PLs particularly suited for IE tasks. Nevertheless, existing research primarily focuses on Pyt… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: Findings of ACL2025

  37. arXiv:2505.13791  [pdf, ps, other

    cs.LG physics.chem-ph

    Scalable Autoregressive 3D Molecule Generation

    Authors: Austin H. Cheng, Chong Sun, Alán Aspuru-Guzik

    Abstract: Generative models of 3D molecular structure play a rapidly growing role in the design and simulation of molecules. Diffusion models currently dominate the space of 3D molecule generation, while autoregressive models have trailed behind. In this work, we present Quetzal, a simple but scalable autoregressive model that builds molecules atom-by-atom in 3D. Treating each molecule as an ordered sequenc… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  38. arXiv:2505.12499  [pdf, ps, other

    cs.CV cs.IR cs.MM

    Contrastive Alignment with Semantic Gap-Aware Corrections in Text-Video Retrieval

    Authors: Jian Xiao, Zijie Song, Jialong Hu, Hao Cheng, Zhenzhen Hu, Jia Li, Richang Hong

    Abstract: Recent advances in text-video retrieval have been largely driven by contrastive learning frameworks. However, existing methods overlook a key source of optimization tension: the separation between text and video distributions in the representation space (referred to as the modality gap), and the prevalence of false negatives in batch sampling. These factors lead to conflicting gradients under the… ▽ More

    Submitted 2 June, 2025; v1 submitted 18 May, 2025; originally announced May 2025.

  39. arXiv:2505.11636  [pdf, ps, other

    cs.LG math.OC

    Generalization Guarantees for Learning Branch-and-Cut Policies in Integer Programming

    Authors: Hongyu Cheng, Amitabh Basu

    Abstract: Mixed-integer programming (MIP) provides a powerful framework for optimization problems, with Branch-and-Cut (B&C) being the predominant algorithm in state-of-the-art solvers. The efficiency of B&C critically depends on heuristic policies for making sequential decisions, including node selection, cut selection, and branching variable selection. While traditional solvers often employ heuristics wit… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

  40. arXiv:2505.11474  [pdf

    cs.RO eess.SY

    REACT: Runtime-Enabled Active Collision-avoidance Technique for Autonomous Driving

    Authors: Heye Huang, Hao Cheng, Zhiyuan Zhou, Zijin Wang, Qichao Liu, Xiaopeng Li

    Abstract: Achieving rapid and effective active collision avoidance in dynamic interactive traffic remains a core challenge for autonomous driving. This paper proposes REACT (Runtime-Enabled Active Collision-avoidance Technique), a closed-loop framework that integrates risk assessment with active avoidance control. By leveraging energy transfer principles and human-vehicle-road interaction modeling, REACT dy… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

    Comments: 22 pages, 11 figures

  41. arXiv:2505.05956  [pdf, other

    eess.SP cs.LG cs.NI

    Multi-User Beamforming with Deep Reinforcement Learning in Sensing-Aided Communication

    Authors: Xiyu Wang, Gilberto Berardinelli, Hei Victor Cheng, Petar Popovski, Ramoni Adeogun

    Abstract: Mobile users are prone to experience beam failure due to beam drifting in millimeter wave (mmWave) communications. Sensing can help alleviate beam drifting with timely beam changes and low overhead since it does not need user feedback. This work studies the problem of optimizing sensing-aided communication by dynamically managing beams allocated to mobile users. A multi-beam scheme is introduced,… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

    Comments: Accepted for Presentation at IEEE EuCNC & 6G Summit 2025

  42. arXiv:2505.05512  [pdf, other

    cs.CV cs.RO

    Occupancy World Model for Robots

    Authors: Zhang Zhang, Qiang Zhang, Wei Cui, Shuai Shi, Yijie Guo, Gang Han, Wen Zhao, Jingkai Sun, Jiahang Cao, Jiaxu Wang, Hao Cheng, Xiaozhu Ju, Zhengping Che, Renjing Xu, Jian Tang

    Abstract: Understanding and forecasting the scene evolutions deeply affect the exploration and decision of embodied agents. While traditional methods simulate scene evolutions through trajectory prediction of potential instances, current works use the occupancy world model as a generative framework for describing fine-grained overall scene dynamics. However, existing methods cluster on the outdoor structure… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  43. arXiv:2505.04460  [pdf, other

    cs.CV

    Learning Real Facial Concepts for Independent Deepfake Detection

    Authors: Ming-Hui Liu, Harry Cheng, Tianyi Wang, Xin Luo, Xin-Shun Xu

    Abstract: Deepfake detection models often struggle with generalization to unseen datasets, manifesting as misclassifying real instances as fake in target domains. This is primarily due to an overreliance on forgery artifacts and a limited understanding of real faces. To address this challenge, we propose a novel approach RealID to enhance generalization by learning a comprehensive concept of real faces whil… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: Accepted by IJCAI 2025

  44. arXiv:2505.03537  [pdf, ps, other

    cs.RO

    Automated Action Generation based on Action Field for Robotic Garment Manipulation

    Authors: Hu Cheng, Fuyuki Tokuda, Kazuhiro Kosuge

    Abstract: Garment manipulation using robotic systems is a challenging task due to the diverse shapes and deformable nature of fabric. In this paper, we propose a novel method for robotic garment manipulation that significantly improves the accuracy while reducing computational time compared to previous approaches. Our method features an action generator that directly interprets scene images and generates pi… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  45. arXiv:2505.03522  [pdf, other

    cs.CV cs.AI

    Optimization of Module Transferability in Single Image Super-Resolution: Universality Assessment and Cycle Residual Blocks

    Authors: Haotong Cheng, Zhiqi Zhang, Hao Li, Xinshang Zhang

    Abstract: Deep learning has substantially advanced the Single Image Super-Resolution (SISR). However, existing researches have predominantly focused on raw performance gains, with little attention paid to quantifying the transferability of architectural components. In this paper, we introduce the concept of "Universality" and its associated definitions which extend the traditional notion of "Generalization"… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  46. arXiv:2505.03344  [pdf, other

    cs.RO cs.LG

    RIFT: Closed-Loop RL Fine-Tuning for Realistic and Controllable Traffic Simulation

    Authors: Keyu Chen, Wenchao Sun, Hao Cheng, Sifa Zheng

    Abstract: Achieving both realism and controllability in interactive closed-loop traffic simulation remains a key challenge in autonomous driving. Data-driven simulation methods reproduce realistic trajectories but suffer from covariate shift in closed-loop deployment, compounded by simplified dynamics models that further reduce reliability. Conversely, physics-based simulation methods enhance reliable and c… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  47. arXiv:2505.02484  [pdf, other

    cs.AI cs.LG cs.MA physics.chem-ph

    El Agente: An Autonomous Agent for Quantum Chemistry

    Authors: Yunheng Zou, Austin H. Cheng, Abdulrahman Aldossary, Jiaru Bai, Shi Xuan Leong, Jorge Arturo Campos-Gonzalez-Angulo, Changhyeok Choi, Cher Tian Ser, Gary Tom, Andrew Wang, Zijian Zhang, Ilya Yakavets, Han Hao, Chris Crebolder, Varinia Bernales, Alán Aspuru-Guzik

    Abstract: Computational chemistry tools are widely used to study the behaviour of chemical phenomena. Yet, the complexity of these tools can make them inaccessible to non-specialists and challenging even for experts. In this work, we introduce El Agente Q, an LLM-based multi-agent system that dynamically generates and executes quantum chemistry workflows from natural language user prompts. The system is bui… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

  48. arXiv:2505.02331  [pdf, other

    cs.CV cs.SD

    VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection

    Authors: Hao Cheng, Zhiwei Zhao, Yichao He, Zhenzhen Hu, Jia Li, Meng Wang, Richang Hong

    Abstract: Audiovisual emotion recognition (AVER) aims to infer human emotions from nonverbal visual-audio (VA) cues, offering modality-complementary and language-agnostic advantages. However, AVER remains challenging due to the inherent ambiguity of emotional expressions, cross-modal expressive disparities, and the scarcity of reliably annotated data. Recent self-supervised AVER approaches have introduced s… ▽ More

    Submitted 4 May, 2025; originally announced May 2025.

    Comments: Source code and pre-trained models will be available at https://github.com/MSA-LMC/VAEmo

  49. arXiv:2505.01207  [pdf, other

    cs.CV

    T-Graph: Enhancing Sparse-view Camera Pose Estimation by Pairwise Translation Graph

    Authors: Qingyu Xian, Weiqin Jiao, Hao Cheng, Berend Jan van der Zwaag, Yanqiu Huang

    Abstract: Sparse-view camera pose estimation, which aims to estimate the 6-Degree-of-Freedom (6-DoF) poses from a limited number of images captured from different viewpoints, is a fundamental yet challenging problem in remote sensing applications. Existing methods often overlook the translation information between each pair of viewpoints, leading to suboptimal performance in sparse-view scenarios. To addres… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  50. arXiv:2504.20645  [pdf, other

    cs.CV

    LDPoly: Latent Diffusion for Polygonal Road Outline Extraction in Large-Scale Topographic Mapping

    Authors: Weiqin Jiao, Hao Cheng, George Vosselman, Claudio Persello

    Abstract: Polygonal road outline extraction from high-resolution aerial images is an important task in large-scale topographic mapping, where roads are represented as vectorized polygons, capturing essential geometric features with minimal vertex redundancy. Despite its importance, no existing method has been explicitly designed for this task. While polygonal building outline extraction has been extensively… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.