这是indexloc提供的服务,不要输入任何密码
Skip to main content

Showing 1–50 of 2,399 results for author: Ma, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.17294  [pdf, ps, other

    cs.RO cs.LG

    VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback

    Authors: Jianxin Bi, Kevin Yuchen Ma, Ce Hao, Mike Zheng Shou, Harold Soh

    Abstract: Tactile feedback is generally recognized to be crucial for effective interaction with the physical world. However, state-of-the-art Vision-Language-Action (VLA) models lack the ability to interpret and use tactile signals, limiting their effectiveness in contact-rich tasks. Incorporating tactile feedback into these systems is challenging due to the absence of large multi-modal datasets. We present… ▽ More

    Submitted 23 July, 2025; originally announced July 2025.

    Comments: 19 pages, 5 figures

  2. arXiv:2507.16869  [pdf, ps, other

    cs.GR cs.CV

    Controllable Video Generation: A Survey

    Authors: Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang, Mingzhe Zheng, Xuanhua He, Chenyang Zhu, Hongyu Liu, Yingqing He, Zeyu Wang, Zhifeng Li, Xiu Li, Wei Liu, Dan Xu, Linfeng Zhang, Qifeng Chen

    Abstract: With the rapid development of AI-generated content (AIGC), video generation has emerged as one of its most dynamic and impactful subfields. In particular, the advancement of video generation foundation models has led to growing demand for controllable video generation methods that can more accurately reflect user intent. Most existing foundation models are designed for text-to-video generation, wh… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

    Comments: project page: https://github.com/mayuelala/Awesome-Controllable-Video-Generation

  3. arXiv:2507.16632  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Step-Audio 2 Technical Report

    Authors: Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen , et al. (84 additional authors not shown)

    Abstract: This paper presents Step-Audio~2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech convers… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

  4. arXiv:2507.16429  [pdf, ps, other

    cs.CV

    Robust Noisy Pseudo-label Learning for Semi-supervised Medical Image Segmentation Using Diffusion Model

    Authors: Lin Xi, Yingliang Ma, Cheng Wang, Sandra Howell, Aldo Rinaldi, Kawal S. Rhode

    Abstract: Obtaining pixel-level annotations in the medical domain is both expensive and time-consuming, often requiring close collaboration between clinical experts and developers. Semi-supervised medical image segmentation aims to leverage limited annotated data alongside abundant unlabeled data to achieve accurate segmentation. However, existing semi-supervised methods often struggle to structure semantic… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

  5. arXiv:2507.16424  [pdf, ps, other

    cs.CL cs.LG

    PromptAL: Sample-Aware Dynamic Soft Prompts for Few-Shot Active Learning

    Authors: Hui Xiang, Jinqiao Shi, Ting Zhang, Xiaojie Zhao, Yong Liu, Yong Ma

    Abstract: Active learning (AL) aims to optimize model training and reduce annotation costs by selecting the most informative samples for labeling. Typically, AL methods rely on the empirical distribution of labeled data to define the decision boundary and perform uncertainty or diversity estimation, subsequently identifying potential high-quality samples. In few-shot scenarios, the empirical distribution of… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

  6. arXiv:2507.16148  [pdf, ps, other

    cs.LG q-bio.QM

    Learning Patient-Specific Spatial Biomarker Dynamics via Operator Learning for Alzheimer's Disease Progression

    Authors: Jindong Wang, Yutong Mao, Xiao Liu, Wenrui Hao

    Abstract: Alzheimer's disease (AD) is a complex, multifactorial neurodegenerative disorder with substantial heterogeneity in progression and treatment response. Despite recent therapeutic advances, predictive models capable of accurately forecasting individualized disease trajectories remain limited. Here, we present a machine learning-based operator learning framework for personalized modeling of AD progre… ▽ More

    Submitted 21 July, 2025; originally announced July 2025.

  7. arXiv:2507.15759  [pdf, ps, other

    cs.CL

    Interaction as Intelligence: Deep Research With Human-AI Partnership

    Authors: Lyumanshan Ye, Xiaojie Cai, Xinkai Wang, Junfei Wang, Xiangkun Hu, Jiadi Su, Yang Nan, Sihan Wang, Bohan Zhang, Xiaoze Fan, Jinbin Luo, Yuxiang Zheng, Tianze Xu, Dayuan Fu, Yunze Wu, Pengrui Lu, Zengzhi Wang, Yiwei Qin, Zhen Huang, Yan Ma, Zhulin Hu, Haoyang Zou, Tiantian Mi, Yixin Ye, Ethan Chern , et al. (1 additional authors not shown)

    Abstract: This paper introduces "Interaction as Intelligence" research series, presenting a reconceptualization of human-AI relationships in deep research tasks. Traditional approaches treat interaction merely as an interface for accessing AI capabilities-a conduit between human intent and machine output. We propose that interaction itself constitutes a fundamental dimension of intelligence. As AI systems e… ▽ More

    Submitted 21 July, 2025; originally announced July 2025.

    Comments: 30 pages, 10 figures

  8. arXiv:2507.14694  [pdf, ps, other

    cs.RO cs.CV

    Uncertainty-aware Probabilistic 3D Human Motion Forecasting via Invertible Networks

    Authors: Yue Ma, Kanglei Zhou, Fuyang Yu, Frederick W. B. Li, Xiaohui Liang

    Abstract: 3D human motion forecasting aims to enable autonomous applications. Estimating uncertainty for each prediction (i.e., confidence based on probability density or quantile) is essential for safety-critical contexts like human-robot collaboration to minimize risks. However, existing diverse motion forecasting approaches struggle with uncertainty quantification due to implicit probabilistic representa… ▽ More

    Submitted 19 July, 2025; originally announced July 2025.

  9. arXiv:2507.14627  [pdf, ps, other

    cs.NI eess.SY

    UAV-Enabled Wireless-Powered Underground Communication Networks: A Novel Time Allocation Approach

    Authors: Kaiqiang Lin, Yijie Mao, Onel Luis Alcaraz López, Mohamed-Slim Alouini

    Abstract: Wireless-powered underground communication networks (WPUCNs), which allow underground devices (UDs) to harvest energy from wireless signals for battery-free communication, offer a promising solution for sustainable underground monitoring. However, the severe wireless signal attenuation in challenging underground environments and the costly acquisition of channel state information (CSI) make large-… ▽ More

    Submitted 19 July, 2025; originally announced July 2025.

    Comments: 14 pages, 8 figures, 3 tables, submitted to IEEE TGCN

  10. arXiv:2507.14447  [pdf, ps, other

    cs.AI cs.CL

    Routine: A Structural Planning Framework for LLM Agent System in Enterprise

    Authors: Guancheng Zeng, Xueyi Chen, Jiawang Hu, Shaohua Qi, Yaxuan Mao, Zhantao Wang, Yifan Nie, Shuang Li, Qiuyang Feng, Pengxu Qiu, Yujia Wang, Wenqiang Han, Linyan Huang, Gang Li, Jingjing Mo, Haowen Hu

    Abstract: The deployment of agent systems in an enterprise environment is often hindered by several challenges: common models lack domain-specific process knowledge, leading to disorganized plans, missing key tools, and poor execution stability. To address this, this paper introduces Routine, a multi-step agent planning framework designed with a clear structure, explicit instructions, and seamless parameter… ▽ More

    Submitted 22 July, 2025; v1 submitted 18 July, 2025; originally announced July 2025.

    Comments: 26 pages, 8 figures, 5 tables

  11. arXiv:2507.14324  [pdf, ps, other

    cs.CR quant-ph

    Quantum-Safe Identity Verification using Relativistic Zero-Knowledge Proof Systems

    Authors: Yao Ma, Wen Yu Kon, Jefferson Chu, Kevin Han Yong Loh, Kaushik Chakraborty, Charles Lim

    Abstract: Identity verification is the process of confirming an individual's claimed identity, which is essential in sectors like finance, healthcare, and online services to ensure security and prevent fraud. However, current password/PIN-based identity solutions are susceptible to phishing or skimming attacks, where malicious intermediaries attempt to steal credentials using fake identification portals. Al… ▽ More

    Submitted 18 July, 2025; originally announced July 2025.

  12. arXiv:2507.13717  [pdf, ps, other

    cs.NI

    ATRO: A Fast Solver-Free Algorithm for Topology and Routing Optimization of Reconfigurable Datacenter Networks

    Authors: Yingming Mao, Qiaozhu Zhai, Zhen Yao, Xia Zhu, Ximeng Liu, Xinchi Han

    Abstract: The growing scale and complexity of reconfigurable data center networks (DCNs) demand more scalable and efficient algorithms for computing logical topologies and routing. Reconfigurable DCNs typically operate in two modes: one-hop configurations that require frequent topology optimization (TO), and multi-hop scenarios that involve joint topology and routing optimization (TRO). In both cases, the c… ▽ More

    Submitted 18 July, 2025; originally announced July 2025.

    ACM Class: C.2.3

  13. arXiv:2507.13575  [pdf, ps, other

    cs.LG cs.AI

    Apple Intelligence Foundation Language Models: Tech Report 2025

    Authors: Hanzhi Zhou, Erik Hornberger, Pengsheng Guo, Xiyou Zhou, Saiwen Wang, Xin Wang, Yifei He, Xuankai Chang, Rene Rauch, Louis D'hauwe, John Peebles, Alec Doane, Kohen Chia, Jenna Thibodeau, Zi-Yi Dou, Yuanyang Zhang, Ruoming Pang, Reed Li, Zhifeng Chen, Jeremy Warner, Zhaoyang Xu, Sophy Lee, David Mizrahi, Ramsey Tantawi, Chris Chaney , et al. (370 additional authors not shown)

    Abstract: We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transform… ▽ More

    Submitted 17 July, 2025; originally announced July 2025.

  14. arXiv:2507.12883  [pdf, ps, other

    cs.CV

    HRSeg: High-Resolution Visual Perception and Enhancement for Reasoning Segmentation

    Authors: Weihuang Lin, Yiwei Ma, Xiaoshuai Sun, Shuting He, Jiayi Ji, Liujuan Cao, Rongrong Ji

    Abstract: The reasoning segmentation task involves segmenting objects within an image by interpreting implicit user instructions, which may encompass subtleties such as contextual cues and open-world knowledge. Despite significant advancements made by existing approaches, they remain constrained by low perceptual resolution, as visual encoders are typically pre-trained at lower resolutions. Furthermore, sim… ▽ More

    Submitted 17 July, 2025; originally announced July 2025.

  15. arXiv:2507.12749  [pdf, ps, other

    cs.HC

    PatternSight: A Perceptual Grouping Effectiveness Assessment Approach for Graphical Patterns in Charts

    Authors: Xumeng Wang, Xiangxuan Zhang, Zhiqi Gao, Shuangcheng Jiao, Yuxin Ma

    Abstract: The boom in visualization generation tools has significantly lowered the threshold for chart authoring. Nevertheless, chart authors with an insufficient understanding of perceptual theories may encounter difficulties in evaluating the effectiveness of chart representations, thereby struggling to identify the appropriate chart design to convey the intended data patterns. To address this issue, we p… ▽ More

    Submitted 16 July, 2025; originally announced July 2025.

  16. arXiv:2507.12499  [pdf, ps, other

    cs.RO cs.LG

    ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving

    Authors: Yuhang Lu, Jiadong Tu, Yuexin Ma, Xinge Zhu

    Abstract: End-to-end autonomous driving has emerged as a promising approach to unify perception, prediction, and planning within a single framework, reducing information loss and improving adaptability. However, existing methods often rely on fixed and sparse trajectory supervision, limiting their ability to capture the hierarchical reasoning process that human drivers naturally employ. To bridge this gap,… ▽ More

    Submitted 15 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCV2025

  17. arXiv:2507.12301  [pdf, ps, other

    eess.SP cs.IT

    Leveraging Bi-Directional Channel Reciprocity for Robust Ultra-Low-Rate Implicit CSI Feedback with Deep Learning

    Authors: Zhenyu Liu, Yi Ma, Rahim Tafazolli, Zhi Ding

    Abstract: Deep learning-based implicit channel state information (CSI) feedback has been introduced to enhance spectral efficiency in massive MIMO systems. Existing methods often show performance degradation in ultra-low-rate scenarios and inadaptability across diverse environments. In this paper, we propose Dual-ImRUNet, an efficient uplink-assisted deep implicit CSI feedback framework incorporating two no… ▽ More

    Submitted 16 July, 2025; originally announced July 2025.

  18. arXiv:2507.11949  [pdf, ps, other

    cs.GR cs.CV cs.RO

    MOSPA: Human Motion Generation Driven by Spatial Audio

    Authors: Shuyang Xu, Zhiyang Dou, Mingyi Shi, Liang Pan, Leo Ho, Jingbo Wang, Yuan Liu, Cheng Lin, Yuexin Ma, Wenping Wang, Taku Komura

    Abstract: Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As… ▽ More

    Submitted 16 July, 2025; originally announced July 2025.

  19. arXiv:2507.11500  [pdf, ps, other

    cs.CR

    ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning

    Authors: Zhengyue Zhao, Yingzi Ma, Somesh Jha, Marco Pavone, Chaowei Xiao

    Abstract: Large Language Models (LLMs) have demonstrated remarkable generative capabilities. However, their susceptibility to misuse has raised significant safety concerns. While post-training safety alignment methods have been widely adopted, LLMs remain vulnerable to malicious instructions that can bypass safety constraints. Recent efforts have introduced inference-time safety reasoning (system-2 alignmen… ▽ More

    Submitted 14 July, 2025; originally announced July 2025.

  20. arXiv:2507.11334  [pdf, ps, other

    cs.AI cs.RO

    CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking

    Authors: Yuehao Huang, Liang Liu, Shuangming Lei, Yukai Ma, Hao Su, Jianbiao Mei, Pengxiang Zhao, Yaqing Gu, Yong Liu, Jiajun Lv

    Abstract: Mobile robots are increasingly required to navigate and interact within unknown and unstructured environments to meet human demands. Demand-driven navigation (DDN) enables robots to identify and locate objects based on implicit human intent, even when object locations are unknown. However, traditional data-driven DDN methods rely on pre-collected data for model training and decision-making, limiti… ▽ More

    Submitted 15 July, 2025; originally announced July 2025.

    Comments: Accepted by ACM MM 2025

  21. arXiv:2507.10496  [pdf, ps, other

    cs.CV cs.AI

    Cameras as Relative Positional Encoding

    Authors: Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, Angjoo Kanazawa

    Abstract: Transformers are increasingly prevalent for multi-view computer vision tasks, where geometric relationships between viewpoints are critical for 3D perception. To leverage these relationships, multi-view transformers must use camera geometry to ground visual tokens in 3D space. In this work, we compare techniques for conditioning transformers on cameras: token-level raymap encodings, attention-leve… ▽ More

    Submitted 14 July, 2025; originally announced July 2025.

    Comments: Project Page: https://www.liruilong.cn/prope/

  22. arXiv:2507.08416  [pdf, ps, other

    cs.CV

    InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes

    Authors: Zesong Yang, Bangbang Yang, Wenqi Dong, Chenxuan Cao, Liyuan Cui, Yuewen Ma, Zhaopeng Cui, Hujun Bao

    Abstract: Humans can naturally identify and mentally complete occluded objects in cluttered environments. However, imparting similar cognitive ability to robotics remains challenging even with advanced reconstruction techniques, which models scenes as undifferentiated wholes and fails to recognize complete object from partial observations. In this paper, we propose InstaScene, a new paradigm towards holisti… ▽ More

    Submitted 21 July, 2025; v1 submitted 11 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCV 2025. Project page: https://zju3dv.github.io/instascene/

  23. arXiv:2507.08306  [pdf, ps, other

    cs.AI cs.CL cs.CV cs.LG

    M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning

    Authors: Inclusion AI, :, Fudong Wang, Jiajia Liu, Jingdong Chen, Jun Zhou, Kaixiang Ji, Lixiang Ru, Qingpei Guo, Ruobing Zheng, Tianqi Li, Yi Yuan, Yifan Mao, Yuting Xiao, Ziping Ma

    Abstract: Recent advancements in Multimodal Large Language Models (MLLMs), particularly through Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced their reasoning abilities. However, a critical gap persists: these models struggle with dynamic spatial interactions, a capability essential for real-world applications. To bridge this gap, we introduce M2-Reasoning-7B, a model des… ▽ More

    Submitted 11 July, 2025; originally announced July 2025.

    Comments: 31pages, 14 figures

  24. arXiv:2507.08288  [pdf, ps, other

    cs.CR cs.AI

    Invariant-based Robust Weights Watermark for Large Language Models

    Authors: Qingxiao Guo, Xinjie Zhu, Yilong Ma, Hui Jin, Yunhao Wang, Weifeng Zhang, Xiaobing Guo

    Abstract: Watermarking technology has gained significant attention due to the increasing importance of intellectual property (IP) rights, particularly with the growing deployment of large language models (LLMs) on billions resource-constrained edge devices. To counter the potential threats of IP theft by malicious users, this paper introduces a robust watermarking scheme without retraining or fine-tuning fo… ▽ More

    Submitted 10 July, 2025; originally announced July 2025.

  25. arXiv:2507.07418  [pdf, ps, other

    cs.GT cs.AI

    Optimal Auction Design in the Joint Advertising

    Authors: Yang Li, Yuchao Ma, Qi Qi

    Abstract: Online advertising is a vital revenue source for major internet platforms. Recently, joint advertising, which assigns a bundle of two advertisers in an ad slot instead of allocating a single advertiser, has emerged as an effective method for enhancing allocation efficiency and revenue. However, existing mechanisms for joint advertising fail to realize the optimality, as they tend to focus on indiv… ▽ More

    Submitted 10 July, 2025; originally announced July 2025.

    Comments: Accepted by ICML 2025 (International Conference on Machine Learning). 17 pages, 4 figures

  26. arXiv:2507.07145  [pdf, ps, other

    cs.LG

    CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs

    Authors: Zhaojing Zhou, Xunchao Li, Minghao Li, Handi Zhang, Haoshuang Wang, Wenbin Chang, Yiqun Liu, Qingqing Dang, Dianhai Yu, Yanjun Ma, Haifeng Wang

    Abstract: The rapid scaling of Large Language Models (LLMs) elevates inference costs and compounds substantial deployment barriers. While quantization to 8 or 4 bits mitigates this, sub-3-bit methods face severe accuracy, scalability, and efficiency degradation. We propose Convolutional Code Quantization (CCQ), an inference-optimized quantization approach compressing LLMs to 2.0-2.75 bits with minimal accur… ▽ More

    Submitted 9 July, 2025; originally announced July 2025.

    Comments: 11 pages, 3 figures

  27. arXiv:2507.06892  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

    Authors: Jing Liang, Hongyao Tang, Yi Ma, Jinyi Liu, Yan Zheng, Shuyue Hu, Lei Bai, Jianye Hao

    Abstract: Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs). One major limitation of most existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL in nature, i.e., data generated during the past learning process is not fully utilized. This inevitably comes at a significant cost of compute and time, posing a string… ▽ More

    Submitted 11 July, 2025; v1 submitted 9 July, 2025; originally announced July 2025.

    Comments: Preliminary version, v3, added the missing name of x-axis in the left part of Fig.1 and corrected a wrong number in Fig.3. Project page: https://anitaleungxx.github.io/ReMix

  28. arXiv:2507.06552  [pdf, ps, other

    stat.ML cs.IT cs.LG

    On the Hardness of Unsupervised Domain Adaptation: Optimal Learners and Information-Theoretic Perspective

    Authors: Zhiyi Dong, Zixuan Liu, Yongyi Mao

    Abstract: This paper studies the hardness of unsupervised domain adaptation (UDA) under covariate shift. We model the uncertainty that the learner faces by a distribution $π$ in the ground-truth triples $(p, q, f)$ -- which we call a UDA class -- where $(p, q)$ is the source -- target distribution pair and $f$ is the classifier. We define the performance of a learner as the overall target domain risk, avera… ▽ More

    Submitted 9 July, 2025; originally announced July 2025.

    Comments: Accepted at the 4th Conference on Lifelong Learning Agents (CoLLAs 2025)

  29. arXiv:2507.06261  [pdf, ps, other

    cs.CL cs.AI

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Authors: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu , et al. (3284 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal unde… ▽ More

    Submitted 22 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

    Comments: 72 pages, 17 figures

  30. arXiv:2507.06127  [pdf, ps, other

    cs.AR cs.AI

    PrefixAgent: An LLM-Powered Design Framework for Efficient Prefix Adder Optimization

    Authors: Dongsheng Zuo, Jiadong Zhu, Yang Luo, Yuzhe Ma

    Abstract: Prefix adders are fundamental arithmetic circuits, but their design space grows exponentially with bit-width, posing significant optimization challenges. Previous works face limitations in performance, generalization, and scalability. To address these challenges, we propose PrefixAgent, a large language model (LLM)-powered framework that enables efficient prefix adder optimization. Specifically, P… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  31. arXiv:2507.06004  [pdf, ps, other

    cs.MA

    From General Relation Patterns to Task-Specific Decision-Making in Continual Multi-Agent Coordination

    Authors: Chang Yao, Youfang Lin, Shoucheng Song, Hao Wu, Yuqing Ma, Shang Han, Kai Lv

    Abstract: Continual Multi-Agent Reinforcement Learning (Co-MARL) requires agents to address catastrophic forgetting issues while learning new coordination policies with the dynamics team. In this paper, we delve into the core of Co-MARL, namely Relation Patterns, which refer to agents' general understanding of interactions. In addition to generality, relation patterns exhibit task-specificity when mapped to… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

    Comments: IJCAI 2025 Accepted

  32. arXiv:2507.05781  [pdf, ps, other

    cs.IT

    Text-Guided Token Communication for Wireless Image Transmission

    Authors: Bole Liu, Li Qiao, Ye Wang, Zhen Gao, Yu Ma, Keke Ying, Tong Qin

    Abstract: With the emergence of 6G networks and proliferation of visual applications, efficient image transmission under adverse channel conditions is critical. We present a text-guided token communication system leveraging pre-trained foundation models for wireless image transmission with low bandwidth. Our approach converts images to discrete tokens, applies 5G NR polar coding, and employs text-guided tok… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  33. arXiv:2507.05595  [pdf, ps, other

    cs.CV

    PaddleOCR 3.0 Technical Report

    Authors: Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, Yanjun Ma

    Abstract: This technical report introduces PaddleOCR 3.0, an Apache-licensed open-source toolkit for OCR and document parsing. To address the growing demand for document understanding in the era of large language models, PaddleOCR 3.0 presents three major solutions: (1) PP-OCRv5 for multilingual text recognition, (2) PP-StructureV3 for hierarchical document parsing, and (3) PP-ChatOCRv4 for key information… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  34. arXiv:2507.05411  [pdf, ps, other

    cs.LG

    AXLearn: Modular Large Model Training on Heterogeneous Infrastructure

    Authors: Mark Lee, Tom Gunter, Chang Lan, John Peebles, Hanzhi Zhou, Kelvin Zou, Sneha Bangalore, Chung-Cheng Chiu, Nan Du, Xianzhi Du, Philipp Dufter, Ruixuan Hou, Haoshuo Huang, Dongseong Hwang, Xiang Kong, Jinhao Lei, Tao Lei, Meng Li, Li Li, Jiarui Lu, Zhiyun Lu, Yiping Ma, David Qiu, Vivek Rathod, Senyu Tong , et al. (12 additional authors not shown)

    Abstract: We design and implement AXLearn, a production deep learning system that facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-the-art deep learning systems, AXLearn has a unique focus on modularity and support for heterogeneous hardware infrastructure. AXLearn's internal interfaces between software components follow strict encapsulation, allow… ▽ More

    Submitted 9 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

  35. arXiv:2507.05259  [pdf, ps, other

    cs.CV

    Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

    Authors: Chun-Hsiao Yeh, Yilin Wang, Nanxuan Zhao, Richard Zhang, Yuheng Li, Yi Ma, Krishna Kumar Singh

    Abstract: Recent diffusion-based image editing methods have significantly advanced text-guided tasks but often struggle to interpret complex, indirect instructions. Moreover, current models frequently suffer from poor identity preservation, unintended edits, or rely heavily on manual masks. To address these challenges, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system th… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: Project page: https://danielchyeh.github.io/x-planner/

  36. arXiv:2507.04766  [pdf, ps, other

    cs.LG cs.AI cs.CL

    ABench-Physics: Benchmarking Physical Reasoning in LLMs via High-Difficulty and Dynamic Physics Problems

    Authors: Yiming Zhang, Yingfan Ma, Yanmei Gu, Zhengkai Yang, Yihong Zhuang, Feng Wang, Zenan Huang, Yuanyuan Wang, Chao Huang, Bowen Song, Cheng Lin, Junbo Zhao

    Abstract: Large Language Models (LLMs) have shown impressive performance in domains such as mathematics and programming, yet their capabilities in physics remain underexplored and poorly understood. Physics poses unique challenges that demand not only precise computation but also deep conceptual understanding and physical modeling skills. Existing benchmarks often fall short due to limited difficulty, multi… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  37. arXiv:2507.04632  [pdf, ps, other

    cs.AI cs.LG

    Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?

    Authors: Yun Qu, Qi Cheems Wang, Yixiu Mao, Vincent Tao Hu, Xiangyang Ji

    Abstract: Recent advances have witnessed the effectiveness of reinforcement learning (RL) finetuning in enhancing the reasoning capabilities of large language models (LLMs). The optimization process often requires numerous iterations to achieve satisfactory performance, resulting in high computational costs due to the need for frequent prompt evaluations under intensive LLM interactions and repeated policy… ▽ More

    Submitted 16 July, 2025; v1 submitted 6 July, 2025; originally announced July 2025.

  38. arXiv:2507.04599  [pdf, ps, other

    cs.CV

    QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation

    Authors: Jiahui Yang, Yongjia Ma, Donglin Di, Hao Li, Wei Chen, Yan Xie, Jianxun Cui, Xun Yang, Wangmeng Zuo

    Abstract: Existing text-to-image models often rely on parameter fine-tuning techniques such as Low-Rank Adaptation (LoRA) to customize visual attributes. However, when combining multiple LoRA models for content-style fusion tasks, unstructured modifications of weight matrices often lead to undesired feature entanglement between content and style attributes. We propose QR-LoRA, a novel fine-tuning framework… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

    Comments: ICCV 2025, 30 pages, 26 figures

  39. arXiv:2507.04201  [pdf, ps, other

    cs.IT

    An Efficient Max-Min Fair Resource Optimization Algorithm for Rate-Splitting Multiple Access

    Authors: Facheng Luo, Yijie Mao

    Abstract: The max-min fairness (MMF) problem in rate-splitting multiple access (RSMA) is known to be challenging due to its non-convex and non-smooth nature, as well as the coupled beamforming and common rate variables. Conventional algorithms to address this problem often incur high computational complexity or degraded MMF rate performance. To address these challenges, in this work, we propose a novel opti… ▽ More

    Submitted 18 July, 2025; v1 submitted 5 July, 2025; originally announced July 2025.

    Comments: 16 pages, 10 figures

  40. arXiv:2507.04084  [pdf

    cs.GR cs.CV

    Attention-Guided Multi-Scale Local Reconstruction for Point Clouds via Masked Autoencoder Self-Supervised Learning

    Authors: Xin Cao, Haoyu Wang, Yuzhu Mao, Xinda Liu, Linzhi Su, Kang Li

    Abstract: Self-supervised learning has emerged as a prominent research direction in point cloud processing. While existing models predominantly concentrate on reconstruction tasks at higher encoder layers, they often neglect the effective utilization of low-level local features, which are typically employed solely for activation computations rather than directly contributing to reconstruction tasks. To over… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

    Comments: 22 pages

  41. arXiv:2507.02945  [pdf, ps, other

    cs.NE cs.AI cs.LG

    SPEAR: Structured Pruning for Spiking Neural Networks via Synaptic Operation Estimation and Reinforcement Learning

    Authors: Hui Xie, Yuhe Liu, Shaoqi Yang, Jinyang Guo, Yufei Guo, Yuqing Ma, Jiaxin Chen, Jiaheng Liu, Xianglong Liu

    Abstract: While deep spiking neural networks (SNNs) demonstrate superior performance, their deployment on resource-constrained neuromorphic hardware still remains challenging. Network pruning offers a viable solution by reducing both parameters and synaptic operations (SynOps) to facilitate the edge deployment of SNNs, among which search-based pruning methods search for the SNNs structure after pruning. How… ▽ More

    Submitted 28 June, 2025; originally announced July 2025.

  42. arXiv:2507.02843  [pdf, ps, other

    cs.LG

    LLM-Driven Treatment Effect Estimation Under Inference Time Text Confounding

    Authors: Yuchen Ma, Dennis Frauen, Jonas Schweisthal, Stefan Feuerriegel

    Abstract: Estimating treatment effects is crucial for personalized decision-making in medicine, but this task faces unique challenges in clinical practice. At training time, models for estimating treatment effects are typically trained on well-structured medical datasets that contain detailed patient information. However, at inference time, predictions are often made using textual descriptions (e.g., descri… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  43. arXiv:2507.02698  [pdf, ps, other

    cs.LG econ.EM

    Multi-Agent Reinforcement Learning for Dynamic Pricing in Supply Chains: Benchmarking Strategic Agent Behaviours under Realistically Simulated Market Conditions

    Authors: Thomas Hazenberg, Yao Ma, Seyed Sahand Mohammadi Ziabari, Marijn van Rijswijk

    Abstract: This study investigates how Multi-Agent Reinforcement Learning (MARL) can improve dynamic pricing strategies in supply chains, particularly in contexts where traditional ERP systems rely on static, rule-based approaches that overlook strategic interactions among market actors. While recent research has applied reinforcement learning to pricing, most implementations remain single-agent and fail to… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  44. arXiv:2507.02294  [pdf, ps, other

    cs.CV

    ViRefSAM: Visual Reference-Guided Segment Anything Model for Remote Sensing Segmentation

    Authors: Hanbo Bi, Yulong Xu, Ya Li, Yongqiang Mao, Boyuan Tong, Chongyang Li, Chunbo Lang, Wenhui Diao, Hongqi Wang, Yingchao Feng, Xian Sun

    Abstract: The Segment Anything Model (SAM), with its prompt-driven paradigm, exhibits strong generalization in generic segmentation tasks. However, applying SAM to remote sensing (RS) images still faces two major challenges. First, manually constructing precise prompts for each image (e.g., points or boxes) is labor-intensive and inefficient, especially in RS scenarios with dense small objects or spatially… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  45. arXiv:2507.02020  [pdf, ps, other

    cs.DB

    Template-Based Schema Matching of Multi-Layout Tenancy Schedules:A Comparative Study of a Template-Based Hybrid Matcher and the ALITE Full Disjunction Model

    Authors: Tim Uilkema, Yao Ma, Seyed Sahand Mohammadi Ziabari, Joep van Vliet

    Abstract: The lack of standardized tabular formats for tenancy schedules across real estate firms creates significant inefficiencies in data integration. Existing automated integration methods, such as Full Disjunction (FD)-based models like ALITE, prioritize completeness but result in schema bloat, sparse attributes and limited business usability. We propose a novel hybrid, template-based schema matcher th… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  46. arXiv:2507.01924  [pdf, ps, other

    cs.LG cs.AI

    Exploring a Hybrid Deep Learning Approach for Anomaly Detection in Mental Healthcare Provider Billing: Addressing Label Scarcity through Semi-Supervised Anomaly Detection

    Authors: Samirah Bakker, Yao Ma, Seyed Sahand Mohammadi Ziabari

    Abstract: The complexity of mental healthcare billing enables anomalies, including fraud. While machine learning methods have been applied to anomaly detection, they often struggle with class imbalance, label scarcity, and complex sequential patterns. This study explores a hybrid deep learning approach combining Long Short-Term Memory (LSTM) networks and Transformers, with pseudo-labeling via Isolation Fore… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  47. arXiv:2507.01876  [pdf, ps, other

    cs.IT eess.SP

    Joint Power Control and Precoding for Cell-Free Massive MIMO Systems With Sparse Multi-Dimensional Graph Neural Networks

    Authors: Yukun Ma, Jiayi Zhang, Ziheng Liu, Guowei Shi, Bo Ai

    Abstract: Cell-free massive multiple-input multiple-output (CF mMIMO) has emerged as a prominent candidate for future networks due to its ability to significantly enhance spectral efficiency by eliminating inter-cell interference. However, its practical deployment faces considerable challenges, such as high computational complexity and the optimization of its complex processing. To address these challenges,… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: 5 pages, 5 figures

  48. arXiv:2507.01652  [pdf, ps, other

    cs.CV cs.AI cs.MM

    Autoregressive Image Generation with Linear Complexity: A Spatial-Aware Decay Perspective

    Authors: Yuxin Mao, Zhen Qin, Jinxing Zhou, Hui Deng, Xuyang Shen, Bin Fan, Jing Zhang, Yiran Zhong, Yuchao Dai

    Abstract: Autoregressive (AR) models have garnered significant attention in image generation for their ability to effectively capture both local and global structures within visual data. However, prevalent AR models predominantly rely on the transformer architectures, which are beset by quadratic computational complexity concerning input sequence length and substantial memory overhead due to the necessity o… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  49. arXiv:2507.01037  [pdf, ps, other

    cs.LG cs.AI cs.RO

    Learning to Segment for Vehicle Routing Problems

    Authors: Wenbin Ouyang, Sirui Li, Yining Ma, Cathy Wu

    Abstract: Iterative search heuristics are widely recognized as state-of-the-art for solving Vehicle Routing Problems (VRPs). In this work, we identify and exploit a critical observation: within these solvers, a large portion of the solution remains stable, i.e., unchanged across search iterations, causing redundant computations, especially for large-scale VRPs with long subtours. To address this, we pioneer… ▽ More

    Submitted 22 June, 2025; originally announced July 2025.

  50. WebANNS: Fast and Efficient Approximate Nearest Neighbor Search in Web Browsers

    Authors: Mugeng Liu, Siqi Zhong, Qi Yang, Yudong Han, Xuanzhe Liu, Yun Ma

    Abstract: Approximate nearest neighbor search (ANNS) has become vital to modern AI infrastructure, particularly in retrieval-augmented generation (RAG) applications. Numerous in-browser ANNS engines have emerged to seamlessly integrate with popular LLM-based web applications, while addressing privacy protection and challenges of heterogeneous device deployments. However, web browsers present unique challeng… ▽ More

    Submitted 1 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

    Comments: SIGIR 2025