Search | arXiv e-print repository

Group Relative Attention Guidance for Image Editing

Authors: Xuanpu Zhang, Xuesong Niu, Ruidong Chen, Dan Song, Jianhao Zeng, Penghui Du, Haoxiang Cao, Kai Wu, An-an Liu

Abstract: Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector… ▽ More Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model's inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance, a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained control over editing intensity without any tuning. Extensive experiments conducted on existing image editing frameworks demonstrate that GRAG can be integrated with as few as four lines of code, consistently enhancing editing quality. Moreover, compared to the commonly used Classifier-Free Guidance, GRAG achieves smoother and more precise control over the degree of editing. Our code will be released at https://github.com/little-misfit/GRAG-Image-Editing. △ Less

Submitted 28 October, 2025; originally announced October 2025.

arXiv:2510.17604 [pdf, ps, other]

Learned Inertial Odometry for Cycling Based on Mixture of Experts Algorithm

Authors: Hao Qiao, Yan Wang, Shuo Yang, Xiaoyao Yu, Jian kuang, Xiaoji Niu

Abstract: With the rapid growth of bike sharing and the increasing diversity of cycling applications, accurate bicycle localization has become essential. traditional GNSS-based methods suffer from multipath effects, while existing inertial navigation approaches rely on precise modeling and show limited robustness. Tight Learned Inertial Odometry (TLIO) achieves low position drift by combining raw IMU data w… ▽ More With the rapid growth of bike sharing and the increasing diversity of cycling applications, accurate bicycle localization has become essential. traditional GNSS-based methods suffer from multipath effects, while existing inertial navigation approaches rely on precise modeling and show limited robustness. Tight Learned Inertial Odometry (TLIO) achieves low position drift by combining raw IMU data with predicted displacements by neural networks, but its high computational cost restricts deployment on mobile devices. To overcome this, we extend TLIO to bicycle localization and introduce an improved Mixture-of Experts (MoE) model that reduces both training and inference costs. Experiments show that, compared to the state-of-the-art LLIO framework, our method achieves comparable accuracy while reducing parameters by 64.7% and computational cost by 81.8%. △ Less

Submitted 20 October, 2025; originally announced October 2025.

arXiv:2510.11892 [pdf, ps, other]

R-WoM: Retrieval-augmented World Model For Computer-use Agents

Authors: Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee, Xing Niu, Jiarong Jiang

Abstract: Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLMs' tendency toward hallucination and their reliance on static training knowledge, which can lead to compounding… ▽ More Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLMs' tendency toward hallucination and their reliance on static training knowledge, which can lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models--future state prediction and reward estimation--through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMs' limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves substantial improvements of up to 25.3% (OSWorld) and 18.1% (WebArena) compared to baselines, with particular advantages in longer-horizon simulations. △ Less

Submitted 13 October, 2025; originally announced October 2025.

arXiv:2510.11072 [pdf, ps, other]

PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System

Authors: Huayi Wang, Wentao Zhang, Runyi Yu, Tao Huang, Junli Ren, Feiyu Jia, Zirui Wang, Xiaojie Niu, Xiao Chen, Jiahe Chen, Qifeng Chen, Jingbo Wang, Jiangmiao Pang

Abstract: Deploying humanoid robots to interact with real-world environments--such as carrying objects or sitting on chairs--requires generalizable, lifelike motions and robust scene perception. Although prior approaches have advanced each capability individually, combining them in a unified system is still an ongoing challenge. In this work, we present a physical-world humanoid-scene interaction system, Ph… ▽ More Deploying humanoid robots to interact with real-world environments--such as carrying objects or sitting on chairs--requires generalizable, lifelike motions and robust scene perception. Although prior approaches have advanced each capability individually, combining them in a unified system is still an ongoing challenge. In this work, we present a physical-world humanoid-scene interaction system, PhysHSI, that enables humanoids to autonomously perform diverse interaction tasks while maintaining natural and lifelike behaviors. PhysHSI comprises a simulation training pipeline and a real-world deployment system. In simulation, we adopt adversarial motion prior-based policy learning to imitate natural humanoid-scene interaction data across diverse scenarios, achieving both generalization and lifelike behaviors. For real-world deployment, we introduce a coarse-to-fine object localization module that combines LiDAR and camera inputs to provide continuous and robust scene perception. We validate PhysHSI on four representative interactive tasks--box carrying, sitting, lying, and standing up--in both simulation and real-world settings, demonstrating consistently high success rates, strong generalization across diverse task goals, and natural motion patterns. △ Less

Submitted 13 October, 2025; originally announced October 2025.

Comments: Project website: https://why618188.github.io/physhsi/

arXiv:2510.07958 [pdf, ps, other]

A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning

Authors: Fengji Zhang, Xinyao Niu, Chengyang Ying, Guancheng Lin, Zhongkai Hao, Zhou Fan, Chengen Huang, Jacky Keung, Bei Chen, Junyang Lin

Abstract: Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts t… ▽ More Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue. In this paper, we present A$^2$Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed $\mathrm{AnsF1}$ reward, which naturally accommodates multiple answers. Experiments on eight open-domain QA benchmarks demonstrate that A$^2$Search achieves new state-of-the-art performance. With only a single rollout, A$^2$Search-7B yields an average $\mathrm{AnsF1}@1$ score of $48.4\%$ across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B ($46.2\%$). Extensive analyses further show that A$^2$Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems. Our code, data, and model weights can be found at https://github.com/zfj1998/A2Search △ Less

Submitted 9 October, 2025; originally announced October 2025.

arXiv:2510.07861 [pdf, ps, other]

Understanding DeepResearch via Reports

Authors: Tianyu Fan, Xinyao Niu, Yuxiang Zheng, Fengji Zhang, Chengen Huang, Bei Chen, Junyang Lin, Chao Huang

Abstract: DeepResearch agents represent a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration. However, evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities rather than holistic performance. Unlike traditional LLM tasks, DeepResearch systems… ▽ More DeepResearch agents represent a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration. However, evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities rather than holistic performance. Unlike traditional LLM tasks, DeepResearch systems must synthesize diverse sources, generate insights, and present coherent findings, which are capabilities that resist simple verification. To address this gap, we introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports. Our approach systematically measures three dimensions: quality, redundancy, and factuality, using an innovative LLM-as-a-Judge methodology achieving strong expert concordance. We contribute a standardized benchmark of 100 curated queries spanning 12 real-world categories, enabling systematic capability comparison. Our evaluation of four leading commercial systems reveals distinct design philosophies and performance trade-offs, establishing foundational insights as DeepResearch evolves from information assistants toward intelligent research partners. Source code and data are available at: https://github.com/HKUDS/DeepResearch-Eval. △ Less

Submitted 9 October, 2025; originally announced October 2025.

Comments: 22 pages, 4 figures

arXiv:2509.17963 [pdf, ps, other]

Single-Cell Universal Logic-in-Memory Using 2T-nC FeRAM: An Area and Energy-Efficient Approach for Bulk Bitwise Computation

Authors: Rudra Biswas, Jiahui Duan, Shan Deng, Xuezhong Niu, Yixin Qin, Prapti Panigrahi, Varun Parekh, Rajiv Joshi, Kai Ni, Vijaykrishnan Narayanan

Abstract: This work presents a novel approach to configure 2T-nC ferroelectric RAM (FeRAM) for performing single cell logic-in-memory operations, highlighting its advantages in energy-efficient computation over conventional DRAM-based approaches. Unlike conventional 1T-1C dynamic RAM (DRAM), which incurs refresh overhead, 2T-nC FeRAM offers a promising alternative as a non-volatile memory solution with low… ▽ More This work presents a novel approach to configure 2T-nC ferroelectric RAM (FeRAM) for performing single cell logic-in-memory operations, highlighting its advantages in energy-efficient computation over conventional DRAM-based approaches. Unlike conventional 1T-1C dynamic RAM (DRAM), which incurs refresh overhead, 2T-nC FeRAM offers a promising alternative as a non-volatile memory solution with low energy consumption. Our key findings include the potential of quasi-nondestructive readout (QNRO) sensing in 2T-nC FeRAM for logic-in-memory (LiM) applications, demonstrating its inherent capability to perform inverting logic without requiring external modifications, a feature absent in traditional 1T-1C DRAM. We successfully implement the MINORITY function within a single cell of 2T-nC FeRAM, enabling universal NAND and NOR logic, validated through SPICE simulations and experimental data. Additionally, the research investigates the feasibility of 3D integration with 2T-nC FeRAM, showing substantial improvements in storage and computational density, facilitating bulk-bitwise computation. Our evaluation of eight real-world, data-intensive applications reveals that 2T-nC FeRAM achieves 2x higher performance and 2.5x lower energy consumption compared to DRAM. Furthermore, the thermal stability of stacked 2T-nC FeRAM is validated, confirming its reliable operation when integrated on a compute die. These findings emphasize the advantages of 2T-nC FeRAM for LiM, offering superior performance and energy efficiency over conventional DRAM. △ Less

Submitted 22 September, 2025; originally announced September 2025.

Comments: 6 Pages, 7 Figures, To be presented at System on Chip Conference 2025

arXiv:2509.15492 [pdf, ps, other]

Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech

Authors: Xinlei Niu, Jianbo Ma, Dylan Harper-Harris, Xiangyu Zhang, Charles Patrick Martin, Jing Zhang

Abstract: The generation of realistic, context-aware audio is important in real-world applications such as video game development. While existing video-to-audio (V2A) methods mainly focus on Foley sound generation, they struggle to produce intelligible speech. Meanwhile, current environmental speech synthesis approaches remain text-driven and fail to temporally align with dynamic video content. In this pape… ▽ More The generation of realistic, context-aware audio is important in real-world applications such as video game development. While existing video-to-audio (V2A) methods mainly focus on Foley sound generation, they struggle to produce intelligible speech. Meanwhile, current environmental speech synthesis approaches remain text-driven and fail to temporally align with dynamic video content. In this paper, we propose Beyond Video-to-SFX (BVS), a method to generate synchronized audio with environmentally aware intelligible speech for given videos. We introduce a two-stage modeling method: (1) stage one is a video-guided audio semantic (V2AS) model to predict unified audio semantic tokens conditioned on phonetic cues; (2) stage two is a video-conditioned semantic-to-acoustic (VS2A) model that refines semantic tokens into detailed acoustic tokens. Experiments demonstrate the effectiveness of BVS in scenarios such as video-to-context-aware speech synthesis and immersive audio background conversion, with ablation studies further validating our design. Our demonstration is available at~\href{https://xinleiniu.github.io/BVS-demo/}{BVS-Demo}. △ Less

Submitted 18 September, 2025; originally announced September 2025.

arXiv:2509.13780 [pdf, ps, other]

Behavior Foundation Model for Humanoid Robots

Authors: Weishuai Zeng, Shunlin Lu, Kangning Yin, Xiaojie Niu, Minyue Dai, Jingbo Wang, Jiangmiao Pang

Abstract: Whole-body control (WBC) of humanoid robots has witnessed remarkable progress in skill versatility, enabling a wide range of applications such as locomotion, teleoperation, and motion tracking. Despite these achievements, existing WBC frameworks remain largely task-specific, relying heavily on labor-intensive reward engineering and demonstrating limited generalization across tasks and skills. Thes… ▽ More Whole-body control (WBC) of humanoid robots has witnessed remarkable progress in skill versatility, enabling a wide range of applications such as locomotion, teleoperation, and motion tracking. Despite these achievements, existing WBC frameworks remain largely task-specific, relying heavily on labor-intensive reward engineering and demonstrating limited generalization across tasks and skills. These limitations hinder their response to arbitrary control modes and restrict their deployment in complex, real-world scenarios. To address these challenges, we revisit existing WBC systems and identify a shared objective across diverse tasks: the generation of appropriate behaviors that guide the robot toward desired goal states. Building on this insight, we propose the Behavior Foundation Model (BFM), a generative model pretrained on large-scale behavioral datasets to capture broad, reusable behavioral knowledge for humanoid robots. BFM integrates a masked online distillation framework with a Conditional Variational Autoencoder (CVAE) to model behavioral distributions, thereby enabling flexible operation across diverse control modes and efficient acquisition of novel behaviors without retraining from scratch. Extensive experiments in both simulation and on a physical humanoid platform demonstrate that BFM generalizes robustly across diverse WBC tasks while rapidly adapting to new behaviors. These results establish BFM as a promising step toward a foundation model for general-purpose humanoid control. △ Less

Submitted 17 September, 2025; originally announced September 2025.

arXiv:2509.13496 [pdf, ps, other]

BiasMap: Leveraging Cross-Attentions to Discover and Mitigate Hidden Social Biases in Text-to-Image Generation

Authors: Rajatsubhra Chakraborty, Xujun Che, Depeng Xu, Cori Faklaris, Xi Niu, Shuhan Yuan

Abstract: Bias discovery is critical for black-box generative models, especiall text-to-image (TTI) models. Existing works predominantly focus on output-level demographic distributions, which do not necessarily guarantee concept representations to be disentangled post-mitigation. We propose BiasMap, a model-agnostic framework for uncovering latent concept-level representational biases in stable diffusion mo… ▽ More Bias discovery is critical for black-box generative models, especiall text-to-image (TTI) models. Existing works predominantly focus on output-level demographic distributions, which do not necessarily guarantee concept representations to be disentangled post-mitigation. We propose BiasMap, a model-agnostic framework for uncovering latent concept-level representational biases in stable diffusion models. BiasMap leverages cross-attention attribution maps to reveal structural entanglements between demographics (e.g., gender, race) and semantics (e.g., professions), going deeper into representational bias during the image generation. Using attribution maps of these concepts, we quantify the spatial demographics-semantics concept entanglement via Intersection over Union (IoU), offering a lens into bias that remains hidden in existing fairness discovery approaches. In addition, we further utilize BiasMap for bias mitigation through energy-guided diffusion sampling that directly modifies latent noise space and minimizes the expected SoftIoU during the denoising process. Our findings show that existing fairness interventions may reduce the output distributional gap but often fail to disentangle concept-level coupling, whereas our mitigation method can mitigate concept entanglement in image generation while complementing distributional bias mitigation. △ Less

Submitted 16 September, 2025; originally announced September 2025.

arXiv:2509.07909 [pdf, ps, other]

Uncovering Scaling Laws for Large Language Models via Inverse Problems

Authors: Arun Verma, Zhaoxuan Wu, Zijian Zhou, Xiaoqiang Lin, Zhiliang Chen, Rachael Hwee Ling Sim, Rui Qiao, Jingtan Wang, Nhung Bui, Xinyuan Niu, Wenyang Hu, Gregory Kang Ruey Lau, Zi-Yu Khoo, Zitong Zhao, Xinyi Xu, Apivich Hemachandra, See-Kiong Ng, Bryan Kian Hsiang Low

Abstract: Large Language Models (LLMs) are large-scale pretrained models that have achieved remarkable success across diverse domains. These successes have been driven by unprecedented complexity and scale in both data and computations. However, due to the high costs of training such models, brute-force trial-and-error approaches to improve LLMs are not feasible. Inspired by the success of inverse problems… ▽ More Large Language Models (LLMs) are large-scale pretrained models that have achieved remarkable success across diverse domains. These successes have been driven by unprecedented complexity and scale in both data and computations. However, due to the high costs of training such models, brute-force trial-and-error approaches to improve LLMs are not feasible. Inspired by the success of inverse problems in uncovering fundamental scientific laws, this position paper advocates that inverse problems can also efficiently uncover scaling laws that guide the building of LLMs to achieve the desirable performance with significantly better cost-effectiveness. △ Less

Submitted 9 September, 2025; originally announced September 2025.

Comments: Accepted at EMNLP Findings 2025

arXiv:2509.01299 [pdf, ps, other]

Cross-Domain Few-Shot Segmentation via Ordinary Differential Equations over Time Intervals

Authors: Huan Ni, Qingshan Liu, Xiaonan Niu, Danfeng Hong, Lingli Zhao, Haiyan Guan

Abstract: Cross-domain few-shot segmentation (CD-FSS) not only enables the segmentation of unseen categories with very limited samples, but also improves cross-domain generalization ability within the few-shot segmentation framework. Currently, existing CD-FSS studies typically design multiple independent modules to enhance the cross-domain generalization ability of feature representations. However, the ind… ▽ More Cross-domain few-shot segmentation (CD-FSS) not only enables the segmentation of unseen categories with very limited samples, but also improves cross-domain generalization ability within the few-shot segmentation framework. Currently, existing CD-FSS studies typically design multiple independent modules to enhance the cross-domain generalization ability of feature representations. However, the independence among these modules hinders the effective flow of knowledge, making it difficult to fully leverage their collective potential. In contrast, this paper proposes an all-in-one module based on ordinary differential equations and Fourier transform, resulting in a structurally concise method--Few-Shot Segmentation over Time Intervals (FSS-TIs). FSS-TIs assumes the existence of an ODE relationship between the spectra (including amplitude and phase spectra) of domain-specific features and domain-agnostic features. This ODE formulation yields an iterative transformation process along a sequence of time intervals, while simultaneously applying affine transformations with randomized perturbations to the spectra. In doing so, the exploration of domain-agnostic feature representation spaces and the simulation of diverse potential target-domain distributions are reformulated as an optimization process over the intrinsic parameters of the ODE. Moreover, we strictly constrain the support-sample selection during target-domain fine-tuning so that it is consistent with the requirements of real-world few-shot segmentation tasks. For evaluation, we introduce five datasets from substantially different domains and define two sets of cross-domain few-shot segmentation tasks to comprehensively analyze the performance of FSS-TIs. Experimental results demonstrate the superiority of FSS-TIs over existing CD-FSS methods, and in-depth ablation studies further validate the cross-domain adaptability of FSS-TIs. △ Less

Submitted 1 September, 2025; originally announced September 2025.

arXiv:2508.13486 [pdf, ps, other]

A Convergent Primal-Dual Algorithm for Computing Rate-Distortion-Perception Functions

Authors: Chunhui Chen, Linyi Chen, Xueyan Niu, Hao Wu

Abstract: Recent advances in Rate-Distortion-Perception (RDP) theory highlight the importance of balancing compression level, reconstruction quality, and perceptual fidelity. While previous work has explored numerical approaches to approximate the information RDP function, the lack of theoretical guarantees remains a major limitation, especially in the presence of complex perceptual constraints that introdu… ▽ More Recent advances in Rate-Distortion-Perception (RDP) theory highlight the importance of balancing compression level, reconstruction quality, and perceptual fidelity. While previous work has explored numerical approaches to approximate the information RDP function, the lack of theoretical guarantees remains a major limitation, especially in the presence of complex perceptual constraints that introduce non-convexity and computational intractability. Inspired by our previous constrained Blahut-Arimoto algorithm for solving the rate-distortion function, in this paper, we present a new theoretical framework for computing the information RDP function by relaxing the constraint on the reconstruction distribution and replacing it with an alternative optimization approach over the reconstruction distribution itself. This reformulation significantly simplifies the optimization and enables a rigorous proof of convergence. Based on this formulation, we develop a novel primal-dual algorithm with provable convergence guarantees. Our analysis establishes, for the first time, a rigorous convergence rate of $O(1/n)$ for the computation of RDP functions. The proposed method not only bridges a key theoretical gap in the existing literature but also achieves competitive empirical performance in representative settings. These results lay the groundwork for more reliable and interpretable optimization in RDP-constrained compression systems. Experimental results demonstrate the efficiency and accuracy of the proposed algorithm. △ Less

Submitted 18 August, 2025; originally announced August 2025.

arXiv:2508.11485 [pdf]

i2Nav-Robot: A Large-Scale Indoor-Outdoor Robot Dataset for Multi-Sensor Fusion Navigation and Mapping

Authors: Hailiang Tang, Tisheng Zhang, Liqiang Wang, Xin Ding, Man Yuan, Zhiyu Xiang, Jujin Chen, Yuhan Bian, Shuangyan Liu, Yuqing Wang, Guan Wang, Xiaoji Niu

Abstract: Accurate and reliable navigation is crucial for autonomous unmanned ground vehicle (UGV). However, current UGV datasets fall short in meeting the demands for advancing navigation and mapping techniques due to limitations in sensor configuration, time synchronization, ground truth, and scenario diversity. To address these challenges, we present i2Nav-Robot, a large-scale dataset designed for multi-… ▽ More Accurate and reliable navigation is crucial for autonomous unmanned ground vehicle (UGV). However, current UGV datasets fall short in meeting the demands for advancing navigation and mapping techniques due to limitations in sensor configuration, time synchronization, ground truth, and scenario diversity. To address these challenges, we present i2Nav-Robot, a large-scale dataset designed for multi-sensor fusion navigation and mapping in indoor-outdoor environments. We integrate multi-modal sensors, including the newest front-view and 360-degree solid-state LiDARs, 4-dimensional (4D) radar, stereo cameras, odometer, global navigation satellite system (GNSS) receiver, and inertial measurement units (IMU) on an omnidirectional wheeled robot. Accurate timestamps are obtained through both online hardware synchronization and offline calibration for all sensors. The dataset includes ten larger-scale sequences covering diverse UGV operating scenarios, such as outdoor streets, and indoor parking lots, with a total length of about 17060 meters. High-frequency ground truth, with centimeter-level accuracy for position, is derived from post-processing integrated navigation methods using a navigation-grade IMU. The proposed i2Nav-Robot dataset is evaluated by more than ten open-sourced multi-sensor fusion systems, and it has proven to have superior data quality. △ Less

Submitted 27 August, 2025; v1 submitted 15 August, 2025; originally announced August 2025.

Comments: 10 pages, 12 figures

arXiv:2508.06053 [pdf, ps, other]

ReNiL: Event-Driven Pedestrian Bayesian Localization Using IMU for Real-World Applications

Authors: Kaixuan Wu, Yuanzhuo Xu, Zejun Zhang, Weiping Zhu, Jian Zhang, Steve Drew, Xiaoguang Niu

Abstract: Pedestrian inertial localization is key for mobile and IoT services because it provides infrastructure-free positioning. Yet most learning-based methods depend on fixed sliding-window integration, struggle to adapt to diverse motion scales and cadences, and yield inconsistent uncertainty, limiting real-world use. We present ReNiL, a Bayesian deep-learning framework for accurate, efficient, and unc… ▽ More Pedestrian inertial localization is key for mobile and IoT services because it provides infrastructure-free positioning. Yet most learning-based methods depend on fixed sliding-window integration, struggle to adapt to diverse motion scales and cadences, and yield inconsistent uncertainty, limiting real-world use. We present ReNiL, a Bayesian deep-learning framework for accurate, efficient, and uncertainty-aware pedestrian localization. ReNiL introduces Inertial Positioning Demand Points (IPDPs) to estimate motion at contextually meaningful waypoints instead of dense tracking, and supports inference on IMU sequences at any scale so cadence can match application needs. It couples a motion-aware orientation filter with an Any-Scale Laplace Estimator (ASLE), a dual-task network that blends patch-based self-supervision with Bayesian regression. By modeling displacements with a Laplace distribution, ReNiL provides homogeneous Euclidean uncertainty that integrates cleanly with other sensors. A Bayesian inference chain links successive IPDPs into consistent trajectories. On RoNIN-ds and a new WUDataset covering indoor and outdoor motion from 28 participants, ReNiL achieves state-of-the-art displacement accuracy and uncertainty consistency, outperforming TLIO, CTIN, iMoT, and RoNIN variants while reducing computation. Application studies further show robustness and practicality for mobile and IoT localization, making ReNiL a scalable, uncertainty-aware foundation for next-generation positioning. △ Less

Submitted 6 November, 2025; v1 submitted 8 August, 2025; originally announced August 2025.

Comments: This work has been submitted to the ACM for possible publication

arXiv:2508.05949 [pdf, ps, other]

A Survey on Task Scheduling in Carbon-Aware Container Orchestration

Authors: Jialin Yang, Zainab Saad, Jiajun Wu, Xiaoguang Niu, Henry Leung, Steve Drew

Abstract: The soaring energy demands of large-scale software ecosystems and cloud data centers, accelerated by the intensive training and deployment of large language models, have driven energy consumption and carbon footprint to unprecedented levels. In response, both industry and academia are increasing efforts to reduce the carbon emissions associated with cloud computing through more efficient task sche… ▽ More The soaring energy demands of large-scale software ecosystems and cloud data centers, accelerated by the intensive training and deployment of large language models, have driven energy consumption and carbon footprint to unprecedented levels. In response, both industry and academia are increasing efforts to reduce the carbon emissions associated with cloud computing through more efficient task scheduling and infrastructure orchestration. In this work, we present a systematic review of various Kubernetes scheduling strategies, categorizing them into hardware-centric and software-centric, annotating each with its sustainability objectives, and grouping them according to the algorithms they use. We propose a comprehensive taxonomy for cloud task scheduling studies, with a particular focus on the environmental sustainability aspect. We analyze emerging research trends and open challenges, and our findings provide critical insight into the design of sustainable scheduling solutions for next-generation cloud computing systems. △ Less

Submitted 7 August, 2025; originally announced August 2025.

Comments: Submitted to ACM Computing Surveys

arXiv:2507.18028 [pdf, ps, other]

NeuralDB: Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database

Authors: Weizhi Fei, Hao Shi, Jing Xu, Jingchen Peng, Jiazheng Li, Jingzhao Zhang, Bo Bai, Wei Han, Zhenyuan Chen, Xueyan Niu

Abstract: Efficiently editing knowledge stored in large language models (LLMs) enables model updates without large-scale training. One possible solution is Locate-and-Edit (L\&E), allowing simultaneous modifications of a massive number of facts. However, such editing may compromise the general abilities of LLMs and even result in forgetting edited facts when scaling up to thousands of edits. In this paper,… ▽ More Efficiently editing knowledge stored in large language models (LLMs) enables model updates without large-scale training. One possible solution is Locate-and-Edit (L\&E), allowing simultaneous modifications of a massive number of facts. However, such editing may compromise the general abilities of LLMs and even result in forgetting edited facts when scaling up to thousands of edits. In this paper, we model existing linear L\&E methods as querying a Key-Value (KV) database. From this perspective, we then propose NeuralDB, an editing framework that explicitly represents the edited facts as a neural KV database equipped with a non-linear gated retrieval module, % In particular, our gated module only operates when inference involves the edited facts, effectively preserving the general abilities of LLMs. Comprehensive experiments involving the editing of 10,000 facts were conducted on the ZsRE and CounterFacts datasets, using GPT2-XL, GPT-J (6B) and Llama-3 (8B). The results demonstrate that NeuralDB not only excels in editing efficacy, generalization, specificity, fluency, and consistency, but also preserves overall performance across six representative text understanding and generation tasks. Further experiments indicate that NeuralDB maintains its effectiveness even when scaled to 100,000 facts (\textbf{50x} more than in prior work). △ Less

Submitted 23 July, 2025; originally announced July 2025.

arXiv:2507.16177 [pdf, ps, other]

A Sparsity-Aware Autonomous Path Planning Accelerator with HW/SW Co-Design and Multi-Level Dataflow Optimization

Authors: Yifan Zhang, Xiaoyu Niu, Hongzheng Tian, Yanjun Zhang, Bo Yu, Shaoshan Liu, Sitao Huang

Abstract: Path planning is critical for autonomous driving, generating smooth, collision-free, feasible paths based on perception and localization inputs. However, its computationally intensive nature poses significant challenges for resource-constrained autonomous driving hardware. This paper presents an end-to-end FPGA-based acceleration framework targeting the quadratic programming (QP), core of optimiza… ▽ More Path planning is critical for autonomous driving, generating smooth, collision-free, feasible paths based on perception and localization inputs. However, its computationally intensive nature poses significant challenges for resource-constrained autonomous driving hardware. This paper presents an end-to-end FPGA-based acceleration framework targeting the quadratic programming (QP), core of optimization-based path planning. We employ a hardware-friendly alternating direction method of multipliers (ADMM) for QP solving and a parallelizable preconditioned conjugate gradient (PCG) method for linear systems. By analyzing sparse matrix patterns, we propose customized storage schemes and efficient sparse matrix multiplication units, significantly reducing resource usage and accelerating matrix operations. Our multi-level dataflow optimization strategy incorporates intra-operator parallelization and pipelining, inter-operator fine-grained pipelining, and CPU-FPGA system-level task mapping. Implemented on the AMD ZCU102 platform, our framework achieves state-of-the-art latency and energy efficiency, including 1.48x faster performance than the best FPGA-based design, 2.89x over an Intel i7-11800H CPU, 5.62x over an ARM Cortex-A57 embedded CPU, and 1.56x over a state-of-the-art GPU solution, along with a 2.05x throughput improvement over existing FPGA-based designs. △ Less

Submitted 21 July, 2025; originally announced July 2025.

Comments: Accepted by ACM Transactions on Architecture and Code Optimization (ACM TACO)

arXiv:2506.08555 [pdf, ps, other]

Towards Cross-Subject EMG Pattern Recognition via Dual-Branch Adversarial Feature Disentanglement

Authors: Xinyue Niu, Akira Furui

Abstract: Cross-subject electromyography (EMG) pattern recognition faces significant challenges due to inter-subject variability in muscle anatomy, electrode placement, and signal characteristics. Traditional methods rely on subject-specific calibration data to adapt models to new users, an approach that is both time-consuming and impractical for large-scale, real-world deployment. This paper presents an ap… ▽ More Cross-subject electromyography (EMG) pattern recognition faces significant challenges due to inter-subject variability in muscle anatomy, electrode placement, and signal characteristics. Traditional methods rely on subject-specific calibration data to adapt models to new users, an approach that is both time-consuming and impractical for large-scale, real-world deployment. This paper presents an approach to eliminate calibration requirements through feature disentanglement, enabling effective cross-subject generalization. We propose an end-to-end dual-branch adversarial neural network that simultaneously performs pattern recognition and individual identification by disentangling EMG features into pattern-specific and subject-specific components. The pattern-specific components facilitate robust pattern recognition for new users without model calibration, while the subject-specific components enable downstream applications such as task-invariant biometric identification. Experimental results demonstrate that the proposed model achieves robust performance on data from unseen users, outperforming various baseline methods in cross-subject scenarios. Overall, this study offers a new perspective for cross-subject EMG pattern recognition without model calibration and highlights the proposed model's potential for broader applications, such as task-independent biometric systems. △ Less

Submitted 17 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

Comments: 6 pages, 3 figures. This work has been accepted for presentation at the IEEE Engineering in Medicine and Biology Conference (EMBC) 2025. New version corrects numerical errors in Table 1. Conclusions are unaffected

arXiv:2505.20683 [pdf, ps, other]

In-memory Incremental Maintenance of Provenance Sketches [extended version]

Authors: Pengyuan Li, Boris Glavic, Dieter Gawlick, Vasudha Krishnaswamy, Zhen Hua Liu, Danica Porobic, Xing Niu

Abstract: Provenance-based data skipping compactly over-approximates the provenance of a query using so-called provenance sketches and utilizes such sketches to speed-up the execution of subsequent queries by skipping irrelevant data. However, a sketch captured at some time in the past may become stale if the data has been updated subsequently. Thus, there is a need to maintain provenance sketches. In this… ▽ More Provenance-based data skipping compactly over-approximates the provenance of a query using so-called provenance sketches and utilizes such sketches to speed-up the execution of subsequent queries by skipping irrelevant data. However, a sketch captured at some time in the past may become stale if the data has been updated subsequently. Thus, there is a need to maintain provenance sketches. In this work, we introduce In-Memory incremental Maintenance of Provenance sketches (IMP), a framework for maintaining sketches incrementally under updates. At the core of IMP is an incremental query engine for data annotated with sketches that exploits the coarse-grained nature of sketches to enable novel optimizations. We experimentally demonstrate that IMP significantly reduces the cost of sketch maintenance, thereby enabling the use of provenance sketches for a broad range of workloads that involve updates. △ Less

Submitted 26 May, 2025; originally announced May 2025.

arXiv:2505.18612 [pdf, ps, other]

Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter

Authors: Weizhi Zhong, Huan Yang, Zheng Liu, Huiguo He, Zijian He, Xuesong Niu, Di Zhang, Guanbin Li

Abstract: Personalized text-to-image generation aims to synthesize images of user-provided concepts in diverse contexts. Despite recent progress in multi-concept personalization, most are limited to object concepts and struggle to customize abstract concepts (e.g., pose, lighting). Some methods have begun exploring multi-concept personalization supporting abstract concepts, but they require test-time fine-t… ▽ More Personalized text-to-image generation aims to synthesize images of user-provided concepts in diverse contexts. Despite recent progress in multi-concept personalization, most are limited to object concepts and struggle to customize abstract concepts (e.g., pose, lighting). Some methods have begun exploring multi-concept personalization supporting abstract concepts, but they require test-time fine-tuning for each new concept, which is time-consuming and prone to overfitting on limited training images. In this work, we propose a novel tuning-free method for multi-concept personalization that can effectively customize both object and abstract concepts without test-time fine-tuning. Our method builds upon the modulation mechanism in pre-trained Diffusion Transformers (DiTs) model, leveraging the localized and semantically meaningful properties of the modulation space. Specifically, we propose a novel module, Mod-Adapter, to predict concept-specific modulation direction for the modulation process of concept-related text tokens. It introduces vision-language cross-attention for extracting concept visual features, and Mixture-of-Experts (MoE) layers that adaptively map the concept features into the modulation space. Furthermore, to mitigate the training difficulty caused by the large gap between the concept image space and the modulation space, we introduce a VLM-guided pre-training strategy that leverages the strong image understanding capabilities of vision-language models to provide semantic supervision signals. For a comprehensive comparison, we extend a standard benchmark by incorporating abstract concepts. Our method achieves state-of-the-art performance in multi-concept personalization, supported by quantitative, qualitative, and human evaluations. △ Less

Submitted 27 September, 2025; v1 submitted 24 May, 2025; originally announced May 2025.

Comments: Code will be released upon paper acceptance

arXiv:2505.12634 [pdf, other]

MSCEKF-MIO: Magnetic-Inertial Odometry Based on Multi-State Constraint Extended Kalman Filter

Authors: Jiazhu Li, Jian Kuang, Xiaoji Niu

Abstract: To overcome the limitation of existing indoor odometry technologies which often cannot simultaneously meet requirements for accuracy cost-effectiveness, and robustness-this paper proposes a novel magnetometer array-aided inertial odometry approach, MSCEKF-MIO (Multi-State Constraint Extended Kalman Filter-based Magnetic-Inertial Odometry). We construct a magnetic field model by fitting measurement… ▽ More To overcome the limitation of existing indoor odometry technologies which often cannot simultaneously meet requirements for accuracy cost-effectiveness, and robustness-this paper proposes a novel magnetometer array-aided inertial odometry approach, MSCEKF-MIO (Multi-State Constraint Extended Kalman Filter-based Magnetic-Inertial Odometry). We construct a magnetic field model by fitting measurements from the magnetometer array and then use temporal variations in this model-extracted from continuous observations-to estimate the carrier's absolute velocity. Furthermore, we implement the MSCEKF framework to fuse observed magnetic field variations with position and attitude estimates from inertial navigation system (INS) integration, thereby enabling autonomous, high-precision indoor relative positioning. Experimental results demonstrate that the proposed algorithm achieves superior velocity estimation accuracy and horizontal positioning precision relative to state-of-the-art magnetic array-aided INS algorithms (MAINS). On datasets with trajectory lengths of 150-250m, the proposed method yields an average horizontal position RMSE of approximately 2.5m. In areas with distinctive magnetic features, the magneto-inertial odometry achieves a velocity estimation accuracy of 0.07m/s. Consequently, the proposed method offers a novel positioning solution characterized by low power consumption, cost-effectiveness, and high reliability in complex indoor environments. △ Less

Submitted 20 May, 2025; v1 submitted 18 May, 2025; originally announced May 2025.

Comments: 10 pages

arXiv:2505.06584 [pdf, ps, other]

JAEGER: Dual-Level Humanoid Whole-Body Controller

Authors: Ziluo Ding, Haobin Jiang, Yuxuan Wang, Zhenguo Sun, Yu Zhang, Xiaojie Niu, Ming Yang, Weishuai Zeng, Xinrun Xu, Zongqing Lu

Abstract: This paper presents JAEGER, a dual-level whole-body controller for humanoid robots that addresses the challenges of training a more robust and versatile policy. Unlike traditional single-controller approaches, JAEGER separates the control of the upper and lower bodies into two independent controllers, so that they can better focus on their distinct tasks. This separation alleviates the dimensional… ▽ More This paper presents JAEGER, a dual-level whole-body controller for humanoid robots that addresses the challenges of training a more robust and versatile policy. Unlike traditional single-controller approaches, JAEGER separates the control of the upper and lower bodies into two independent controllers, so that they can better focus on their distinct tasks. This separation alleviates the dimensionality curse and improves fault tolerance. JAEGER supports both root velocity tracking (coarse-grained control) and local joint angle tracking (fine-grained control), enabling versatile and stable movements. To train the controller, we utilize a human motion dataset (AMASS), retargeting human poses to humanoid poses through an efficient retargeting network, and employ a curriculum learning approach. This method performs supervised learning for initialization, followed by reinforcement learning for further exploration. We conduct our experiments on two humanoid platforms and demonstrate the superiority of our approach against state-of-the-art methods in both simulation and real environments. △ Less

Submitted 16 June, 2025; v1 submitted 10 May, 2025; originally announced May 2025.

Comments: 15 pages, 2 figures

arXiv:2505.05112 [pdf, ps, other]

MDAA-Diff: CT-Guided Multi-Dose Adaptive Attention Diffusion Model for PET Denoising

Authors: Xiaolong Niu, Zanting Ye, Xu Han, Yanchao Huang, Hao Sun, Hubing Wu, Lijun Lu

Abstract: Acquiring high-quality Positron Emission Tomography (PET) images requires administering high-dose radiotracers, which increases radiation exposure risks. Generating standard-dose PET (SPET) from low-dose PET (LPET) has become a potential solution. However, previous studies have primarily focused on single low-dose PET denoising, neglecting two critical factors: discrepancies in dose response cause… ▽ More Acquiring high-quality Positron Emission Tomography (PET) images requires administering high-dose radiotracers, which increases radiation exposure risks. Generating standard-dose PET (SPET) from low-dose PET (LPET) has become a potential solution. However, previous studies have primarily focused on single low-dose PET denoising, neglecting two critical factors: discrepancies in dose response caused by inter-patient variability, and complementary anatomical constraints derived from CT images. In this work, we propose a novel CT-Guided Multi-dose Adaptive Attention Denoising Diffusion Model (MDAA-Diff) for multi-dose PET denoising. Our approach integrates anatomical guidance and dose-level adaptation to achieve superior denoising performance under low-dose conditions. Specifically, this approach incorporates a CT-Guided High-frequency Wavelet Attention (HWA) module, which uses wavelet transforms to separate high-frequency anatomical boundary features from CT images. These extracted features are then incorporated into PET imaging through an adaptive weighted fusion mechanism to enhance edge details. Additionally, we propose the Dose-Adaptive Attention (DAA) module, a dose-conditioned enhancement mechanism that dynamically integrates dose levels into channel-spatial attention weight calculation. Extensive experiments on 18F-FDG and 68Ga-FAPI datasets demonstrate that MDAA-Diff outperforms state-of-the-art approaches in preserving diagnostic quality under reduced-dose conditions. Our code is publicly available. △ Less

Submitted 21 June, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

arXiv:2505.05064 [pdf, ps, other]

WaterDrum: Watermarking for Data-centric Unlearning Metric

Authors: Xinyang Lu, Xinyuan Niu, Gregory Kang Ruey Lau, Bui Thi Cam Nhung, Rachael Hwee Ling Sim, Fanyu Wen, Chuan-Sheng Foo, See-Kiong Ng, Bryan Kian Hsiang Low

Abstract: Large language model (LLM) unlearning is critical in real-world applications where it is necessary to efficiently remove the influence of private, copyrighted, or harmful data from some users. However, existing utility-centric unlearning metrics (based on model utility) may fail to accurately evaluate the extent of unlearning in realistic settings such as when (a) the forget and retain set have se… ▽ More Large language model (LLM) unlearning is critical in real-world applications where it is necessary to efficiently remove the influence of private, copyrighted, or harmful data from some users. However, existing utility-centric unlearning metrics (based on model utility) may fail to accurately evaluate the extent of unlearning in realistic settings such as when (a) the forget and retain set have semantically similar content, (b) retraining the model from scratch on the retain set is impractical, and/or (c) the model owner can improve the unlearning metric without directly performing unlearning on the LLM. This paper presents the first data-centric unlearning metric for LLMs called WaterDrum that exploits robust text watermarking for overcoming these limitations. We also introduce new benchmark datasets for LLM unlearning that contain varying levels of similar data points and can be used to rigorously evaluate unlearning algorithms using WaterDrum. Our code is available at https://github.com/lululu008/WaterDrum and our new benchmark datasets are released at https://huggingface.co/datasets/Glow-AI/WaterDrum-Ax. △ Less

Submitted 8 May, 2025; originally announced May 2025.

arXiv:2505.01635 [pdf, ps, other]

Dendritic Computing with Multi-Gate Ferroelectric Field-Effect Transistors

Authors: A N M Nafiul Islam, Xuezhong Niu, Jiahui Duan, Shubham Kumar, Kai Ni, Abhronil Sengupta

Abstract: Although inspired by neuronal systems in the brain, artificial neural networks generally employ point-neurons, which offer far less computational complexity than their biological counterparts. Neurons have dendritic arbors that connect to different sets of synapses and offer local non-linear accumulation - playing a pivotal role in processing and learning. Inspired by this, we propose a novel neur… ▽ More Although inspired by neuronal systems in the brain, artificial neural networks generally employ point-neurons, which offer far less computational complexity than their biological counterparts. Neurons have dendritic arbors that connect to different sets of synapses and offer local non-linear accumulation - playing a pivotal role in processing and learning. Inspired by this, we propose a novel neuron design based on a multi-gate ferroelectric field-effect transistor that mimics dendrites. It leverages ferroelectric nonlinearity for local computations within dendritic branches, while utilizing the transistor action to generate the final neuronal output. The branched architecture paves the way for utilizing smaller crossbar arrays in hardware integration, leading to greater efficiency. Using an experimentally calibrated device-circuit-algorithm co-simulation framework, we demonstrate that networks incorporating our dendritic neurons achieve superior performance in comparison to much larger networks without dendrites ($\sim$17$\times$ fewer trainable weight parameters). These findings suggest that dendritic hardware can significantly improve computational efficiency, and learning capacity of neuromorphic systems optimized for edge applications. △ Less

Submitted 20 October, 2025; v1 submitted 2 May, 2025; originally announced May 2025.

arXiv:2505.00827 [pdf, ps, other]

MIMIC-\RNum{4}-Ext-22MCTS: A 22 Millions-Event Temporal Clinical Time-Series Dataset with Relative Timestamp for Risk Prediction

Authors: Jing Wang, Xing Niu, Tong Zhang, Jie Shen, Juyong Kim, Jeremy C. Weiss

Abstract: A crucial component for clinical risk prediction is developing a reliable prediction model is collecting high-quality time series clinical events. In this work, we release such a dataset that consists of 22,588,586 Clinical Time Series events, which we term MIMIC-\RNum{4}-Ext-22MCTS. Our source data are discharge summaries selected from the well-known yet unstructured MIMIC-IV-Note \cite{Johnson20… ▽ More A crucial component for clinical risk prediction is developing a reliable prediction model is collecting high-quality time series clinical events. In this work, we release such a dataset that consists of 22,588,586 Clinical Time Series events, which we term MIMIC-\RNum{4}-Ext-22MCTS. Our source data are discharge summaries selected from the well-known yet unstructured MIMIC-IV-Note \cite{Johnson2023-pg}. The general-purpose MIMIC-IV-Note pose specific challenges for our work: it turns out that the discharge summaries are too lengthy for typical natural language models to process, and the clinical events of interest often are not accompanied with explicit timestamps. Therefore, we propose a new framework that works as follows: 1) we break each discharge summary into manageably small text chunks; 2) we apply contextual BM25 and contextual semantic search to retrieve chunks that have a high potential of containing clinical events; and 3) we carefully design prompts to teach the recently released Llama-3.1-8B \cite{touvron2023llama} model to identify or infer temporal information of the chunks. The obtained dataset is informative and transparent that standard models fine-tuned on the dataset achieves significant improvements in healthcare applications. In particular, the BERT model fine-tuned based on our dataset achieves 10\% improvement in accuracy on medical question answering task, and 3\% improvement in clinical trial matching task compared with the classic BERT. The dataset is available at https://physionet.org/content/mimic-iv-ext-22mcts/1.0.0. The codebase is released at https://github.com/JingWang-RU/MIMIC-IV-Ext-22MCTS-Temporal-Clinical-Time-Series-Dataset. △ Less

Submitted 17 November, 2025; v1 submitted 1 May, 2025; originally announced May 2025.

arXiv:2504.12520 [pdf, ps, other]

Interpreting Network Differential Privacy

Authors: Jonathan Hehir, Xiaoyue Niu, Aleksandra Slavkovic

Abstract: How do we interpret the differential privacy (DP) guarantee for network data? We take a deep dive into a popular form of network DP ($\varepsilon$--edge DP) to find that many of its common interpretations are flawed. Drawing on prior work for privacy with correlated data, we interpret DP through the lens of adversarial hypothesis testing and demonstrate a gap between the pairs of hypotheses actual… ▽ More How do we interpret the differential privacy (DP) guarantee for network data? We take a deep dive into a popular form of network DP ($\varepsilon$--edge DP) to find that many of its common interpretations are flawed. Drawing on prior work for privacy with correlated data, we interpret DP through the lens of adversarial hypothesis testing and demonstrate a gap between the pairs of hypotheses actually protected under DP (tests of complete networks) and the sorts of hypotheses implied to be protected by common claims (tests of individual edges). We demonstrate some conditions under which this gap can be bridged, while leaving some questions open. While some discussion is specific to edge DP, we offer selected results in terms of abstract DP definitions and provide discussion of the implications for other forms of network DP. △ Less

Submitted 16 April, 2025; originally announced April 2025.

Comments: 19 pages

arXiv:2504.10826 [pdf, ps, other]

SteerMusic: Enhanced Musical Consistency for Zero-shot Text-Guided and Personalized Music Editing

Authors: Xinlei Niu, Kin Wai Cheuk, Jing Zhang, Naoki Murata, Chieh-Hsin Lai, Michele Mancusi, Woosung Choi, Giorgio Fabbro, Wei-Hsiang Liao, Charles Patrick Martin, Yuki Mitsufuji

Abstract: Music editing is an important step in music production, which has broad applications, including game development and film production. Most existing zero-shot text-guided editing methods rely on pretrained diffusion models by involving forward-backward diffusion processes. However, these methods often struggle to preserve the musical content. Additionally, text instructions alone usually fail to ac… ▽ More Music editing is an important step in music production, which has broad applications, including game development and film production. Most existing zero-shot text-guided editing methods rely on pretrained diffusion models by involving forward-backward diffusion processes. However, these methods often struggle to preserve the musical content. Additionally, text instructions alone usually fail to accurately describe the desired music. In this paper, we propose two music editing methods that improve the consistency between the original and edited music by leveraging score distillation. The first method, SteerMusic, is a coarse-grained zero-shot editing approach using delta denoising score. The second method, SteerMusic+, enables fine-grained personalized music editing by manipulating a concept token that represents a user-defined musical style. SteerMusic+ allows for the editing of music into user-defined musical styles that cannot be achieved by the text instructions alone. Experimental results show that our methods outperform existing approaches in preserving both music content consistency and editing fidelity. User studies further validate that our methods achieve superior music editing quality. △ Less

Submitted 11 November, 2025; v1 submitted 14 April, 2025; originally announced April 2025.

Comments: Accepted by AAAI2026

arXiv:2504.04497 [pdf, other]

SELC: Self-Supervised Efficient Local Correspondence Learning for Low Quality Images

Authors: Yuqing Wang, Yan Wang, Hailiang Tang, Xiaoji Niu

Abstract: Accurate and stable feature matching is critical for computer vision tasks, particularly in applications such as Simultaneous Localization and Mapping (SLAM). While recent learning-based feature matching methods have demonstrated promising performance in challenging spatiotemporal scenarios, they still face inherent trade-offs between accuracy and computational efficiency in specific settings. In… ▽ More Accurate and stable feature matching is critical for computer vision tasks, particularly in applications such as Simultaneous Localization and Mapping (SLAM). While recent learning-based feature matching methods have demonstrated promising performance in challenging spatiotemporal scenarios, they still face inherent trade-offs between accuracy and computational efficiency in specific settings. In this paper, we propose a lightweight feature matching network designed to establish sparse, stable, and consistent correspondence between multiple frames. The proposed method eliminates the dependency on manual annotations during training and mitigates feature drift through a hybrid self-supervised paradigm. Extensive experiments validate three key advantages: (1) Our method operates without dependency on external prior knowledge and seamlessly incorporates its hybrid training mechanism into original datasets. (2) Benchmarked against state-of-the-art deep learning-based methods, our approach maintains equivalent computational efficiency at low-resolution scales while achieving a 2-10x improvement in computational efficiency for high-resolution inputs. (3) Comparative evaluations demonstrate that the proposed hybrid self-supervised scheme effectively mitigates feature drift in long-term tracking while maintaining consistent representation across image sequences. △ Less

Submitted 6 April, 2025; originally announced April 2025.

Comments: 8 pages, 4 figures

arXiv:2503.14983 [pdf, other]

Semi-KAN: KAN Provides an Effective Representation for Semi-Supervised Learning in Medical Image Segmentation

Authors: Zanting Ye, Xiaolong Niu, Xuanbin Wu, Wenxiang Yi, Yuan Chang, Lijun Lu

Abstract: Deep learning-based medical image segmentation has shown remarkable success; however, it typically requires extensive pixel-level annotations, which are both expensive and time-intensive. Semi-supervised medical image segmentation (SSMIS) offers a viable alternative, driven by advancements in CNNs and ViTs. However, these networks often rely on single fixed activation functions and linear modeling… ▽ More Deep learning-based medical image segmentation has shown remarkable success; however, it typically requires extensive pixel-level annotations, which are both expensive and time-intensive. Semi-supervised medical image segmentation (SSMIS) offers a viable alternative, driven by advancements in CNNs and ViTs. However, these networks often rely on single fixed activation functions and linear modeling patterns, limiting their ability to effectively learn robust representations. Given the limited availability of labeled date, achieving robust representation learning becomes crucial. Inspired by Kolmogorov-Arnold Networks (KANs), we propose Semi-KAN, which leverages the untapped potential of KANs to enhance backbone architectures for representation learning in SSMIS. Our findings indicate that: (1) compared to networks with fixed activation functions, KANs exhibit superior representation learning capabilities with fewer parameters, and (2) KANs excel in high-semantic feature spaces. Building on these insights, we integrate KANs into tokenized intermediate representations, applying them selectively at the encoder's bottleneck and the decoder's top layers within a U-Net pipeline to extract high-level semantic features. Although learnable activation functions improve feature expansion, they introduce significant computational overhead with only marginal performance gains. To mitigate this, we reduce the feature dimensions and employ horizontal scaling to capture multiple pattern representations. Furthermore, we design a multi-branch U-Net architecture with uncertainty estimation to effectively learn diverse pattern representations. Extensive experiments on four public datasets demonstrate that Semi-KAN surpasses baseline networks, utilizing fewer KAN layers and lower computational cost, thereby underscoring the potential of KANs as a promising approach for SSMIS. △ Less

Submitted 19 March, 2025; originally announced March 2025.

Comments: 18 pages, 7 figures, 6 tables

arXiv:2503.11586 [pdf, ps, other]

Broaden your SCOPE! Efficient Multi-turn Conversation Planning for LLMs with Semantic Space

Authors: Zhiliang Chen, Xinyuan Niu, Chuan-Sheng Foo, Bryan Kian Hsiang Low

Abstract: Large language models (LLMs) are used in chatbots or AI assistants to hold conversations with a human user. In such applications, the quality (e.g., user engagement, safety) of a conversation is important and can only be exactly known at the end of the conversation. To maximize its expected quality, conversation planning reasons about the stochastic transitions within a conversation to select the… ▽ More Large language models (LLMs) are used in chatbots or AI assistants to hold conversations with a human user. In such applications, the quality (e.g., user engagement, safety) of a conversation is important and can only be exactly known at the end of the conversation. To maximize its expected quality, conversation planning reasons about the stochastic transitions within a conversation to select the optimal LLM response at each turn. Existing simulation-based conversation planning algorithms typically select the optimal response by simulating future conversations with a large number of LLM queries at every turn. However, this process is extremely time-consuming and hence impractical for real-time conversations. This paper presents a novel approach called Semantic space COnversation Planning with improved Efficiency (SCOPE) that exploits the dense semantic representation of conversations to perform conversation planning efficiently. In particular, SCOPE models the stochastic transitions in conversation semantics and their associated rewards to plan entirely within the semantic space. This allows us to select the optimal LLM response at every conversation turn without needing additional LLM queries for simulation. As a result, SCOPE can perform conversation planning 70 times faster than conventional simulation-based planning algorithms when applied to a wide variety of conversation starters and two reward functions seen in the real world, yet achieving a higher reward within a practical planning budget. Our code can be found at: https://github.com/chenzhiliang94/convo-plan-SCOPE. △ Less

Submitted 7 June, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

Comments: ICLR 2025 Spotlight

arXiv:2503.09039 [pdf, other]

Incentive Analysis for Agent Participation in Federated Learning

Authors: Lihui Yi, Xiaochun Niu, Ermin Wei

Abstract: Federated learning offers a decentralized approach to machine learning, where multiple agents collaboratively train a model while preserving data privacy. In this paper, we investigate the decision-making and equilibrium behavior in federated learning systems, where agents choose between participating in global training or conducting independent local training. The problem is first modeled as a st… ▽ More Federated learning offers a decentralized approach to machine learning, where multiple agents collaboratively train a model while preserving data privacy. In this paper, we investigate the decision-making and equilibrium behavior in federated learning systems, where agents choose between participating in global training or conducting independent local training. The problem is first modeled as a stage game and then extended to a repeated game to analyze the long-term dynamics of agent participation. For the stage game, we characterize the participation patterns and identify Nash equilibrium, revealing how data heterogeneity influences the equilibrium behavior-specifically, agents with similar data qualities will participate in FL as a group. We also derive the optimal social welfare and show that it coincides with Nash equilibrium under mild assumptions. In the repeated game, we propose a privacy-preserving, computationally efficient myopic strategy. This strategy enables agents to make practical decisions under bounded rationality and converges to a neighborhood of Nash equilibrium of the stage game in finite time. By combining theoretical insights with practical strategy design, this work provides a realistic and effective framework for guiding and analyzing agent behaviors in federated learning systems. △ Less

Submitted 11 March, 2025; originally announced March 2025.

arXiv:2503.08974 [pdf, other]

doi 10.1109/TAFFC.2025.3545915

Beyond Overfitting: Doubly Adaptive Dropout for Generalizable AU Detection

Authors: Yong Li, Yi Ren, Xuesong Niu, Yi Ding, Xiu-Shen Wei, Cuntai Guan

Abstract: Facial Action Units (AUs) are essential for conveying psychological states and emotional expressions. While automatic AU detection systems leveraging deep learning have progressed, they often overfit to specific datasets and individual features, limiting their cross-domain applicability. To overcome these limitations, we propose a doubly adaptive dropout approach for cross-domain AU detection, whi… ▽ More Facial Action Units (AUs) are essential for conveying psychological states and emotional expressions. While automatic AU detection systems leveraging deep learning have progressed, they often overfit to specific datasets and individual features, limiting their cross-domain applicability. To overcome these limitations, we propose a doubly adaptive dropout approach for cross-domain AU detection, which enhances the robustness of convolutional feature maps and spatial tokens against domain shifts. This approach includes a Channel Drop Unit (CD-Unit) and a Token Drop Unit (TD-Unit), which work together to reduce domain-specific noise at both the channel and token levels. The CD-Unit preserves domain-agnostic local patterns in feature maps, while the TD-Unit helps the model identify AU relationships generalizable across domains. An auxiliary domain classifier, integrated at each layer, guides the selective omission of domain-sensitive features. To prevent excessive feature dropout, a progressive training strategy is used, allowing for selective exclusion of sensitive features at any model layer. Our method consistently outperforms existing techniques in cross-domain AU detection, as demonstrated by extensive experimental evaluations. Visualizations of attention maps also highlight clear and meaningful patterns related to both individual and combined AUs, further validating the approach's effectiveness. △ Less

Submitted 11 March, 2025; originally announced March 2025.

Comments: Accetped by IEEE Transactions on Affective Computing 2025. A novel method for cross-domain facial action unit detection

Journal ref: IEEE Transactions on Affective Computing 2025

arXiv:2503.04580 [pdf, ps, other]

DogLegs: Robust Proprioceptive State Estimation for Legged Robots Using Multiple Leg-Mounted IMUs

Authors: Yibin Wu, Jian Kuang, Shahram Khorshidi, Xiaoji Niu, Lasse Klingbeil, Maren Bennewitz, Heiner Kuhlmann

Abstract: Robust and accurate proprioceptive state estimation of the main body is crucial for legged robots to execute tasks in extreme environments where exteroceptive sensors, such as LiDARs and cameras, may become unreliable. In this paper, we propose DogLegs, a state estimation system for legged robots that fuses the measurements from a body-mounted inertial measurement unit (Body-IMU), joint encoders,… ▽ More Robust and accurate proprioceptive state estimation of the main body is crucial for legged robots to execute tasks in extreme environments where exteroceptive sensors, such as LiDARs and cameras, may become unreliable. In this paper, we propose DogLegs, a state estimation system for legged robots that fuses the measurements from a body-mounted inertial measurement unit (Body-IMU), joint encoders, and multiple leg-mounted IMUs (Leg-IMU) using an extended Kalman filter (EKF). The filter system contains the error states of all IMU frames. The Leg-IMUs are used to detect foot contact, thereby providing zero-velocity measurements to update the state of the Leg-IMU frames. Additionally, we compute the relative position constraints between the Body-IMU and Leg-IMUs by the leg kinematics and use them to update the main body state and reduce the error drift of the individual IMU frames. Field experimental results have shown that our proposed DogLegs system achieves better state estimation accuracy compared to the traditional leg odometry method (using only Body-IMU and joint encoders) across various terrains. We make our datasets publicly available to benefit the research community (https://github.com/YibinWu/leg-odometry). △ Less

Submitted 25 July, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

Comments: 8 pages, 8 figures

arXiv:2503.03786 [pdf, ps, other]

Self is the Best Learner: CT-free Ultra-Low-Dose PET Organ Segmentation via Collaborating Denoising and Segmentation Learning

Authors: Zanting Ye, Xiaolong Niu, Xu Han, Xuanbin Wu, Wantong Lu, Yijun Lu, Hao Sun, Yanchao Huang, Hubing Wu, Lijun Lu

Abstract: Organ segmentation in Positron Emission Tomography (PET) plays a vital role in cancer quantification. Low-dose PET (LDPET) provides a safer alternative by reducing radiation exposure. However, the inherent noise and blurred boundaries make organ segmentation more challenging. Additionally, existing PET organ segmentation methods rely on coregistered Computed Tomography (CT) annotations, overlookin… ▽ More Organ segmentation in Positron Emission Tomography (PET) plays a vital role in cancer quantification. Low-dose PET (LDPET) provides a safer alternative by reducing radiation exposure. However, the inherent noise and blurred boundaries make organ segmentation more challenging. Additionally, existing PET organ segmentation methods rely on coregistered Computed Tomography (CT) annotations, overlooking the problem of modality mismatch. In this study, we propose LDOS, a novel CT-free ultra-LDPET organ segmentation pipeline. Inspired by Masked Autoencoders (MAE), we reinterpret LDPET as a naturally masked version of Full-Dose PET (FDPET). LDOS adopts a simple yet effective architecture: a shared encoder extracts generalized features, while task-specific decoders independently refine outputs for denoising and segmentation. By integrating CT-derived organ annotations into the denoising process, LDOS improves anatomical boundary recognition and alleviates the PET/CT misalignments. Experiments demonstrate that LDOS achieves state-of-the-art performance with mean Dice scores of 73.11% (18F-FDG) and 73.97% (68Ga-FAPI) across 18 organs in 5% dose PET. Our code will be available at https://github.com/yezanting/LDOS. △ Less

Submitted 26 June, 2025; v1 submitted 4 March, 2025; originally announced March 2025.

Comments: This work has been accepted by MICCAI2025; 9 pages, 5 figures

arXiv:2502.14227 [pdf, other]

SleepGMUformer: A gated multimodal temporal neural network for sleep staging

Authors: Chenjun Zhao, Xuesen Niu, Xinglin Yu, Long Chen, Na Lv, Huiyu Zhou, Aite Zhao

Abstract: Sleep staging is a key method for assessing sleep quality and diagnosing sleep disorders. However, current deep learning methods face challenges: 1) postfusion techniques ignore the varying contributions of different modalities; 2) unprocessed sleep data can interfere with frequency-domain information. To tackle these issues, this paper proposes a gated multimodal temporal neural network for multi… ▽ More Sleep staging is a key method for assessing sleep quality and diagnosing sleep disorders. However, current deep learning methods face challenges: 1) postfusion techniques ignore the varying contributions of different modalities; 2) unprocessed sleep data can interfere with frequency-domain information. To tackle these issues, this paper proposes a gated multimodal temporal neural network for multidomain sleep data, including heart rate, motion, steps, EEG (Fpz-Cz, Pz-Oz), and EOG from WristHR-Motion-Sleep and SleepEDF-78. The model integrates: 1) a pre-processing module for feature alignment, missing value handling, and EEG de-trending; 2) a feature extraction module for complex sleep features in the time dimension; and 3) a dynamic fusion module for real-time modality weighting.Experiments show classification accuracies of 85.03% on SleepEDF-78 and 94.54% on WristHR-Motion-Sleep datasets. The model handles heterogeneous datasets and outperforms state-of-the-art models by 1.00%-4.00%. △ Less

Submitted 19 February, 2025; originally announced February 2025.

arXiv:2501.12959 [pdf, other]

Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference

Authors: Weizhi Fei, Xueyan Niu, Guoqing Xie, Yingqing Liu, Bo Bai, Wei Han

Abstract: Although applications involving long-context inputs are crucial for the effective utilization of large language models (LLMs), they also result in increased computational costs and reduced performance. To address this challenge, we propose an efficient, training-free prompt compression method that retains key information within compressed prompts. We identify specific attention heads in transforme… ▽ More Although applications involving long-context inputs are crucial for the effective utilization of large language models (LLMs), they also result in increased computational costs and reduced performance. To address this challenge, we propose an efficient, training-free prompt compression method that retains key information within compressed prompts. We identify specific attention heads in transformer-based LLMs, which we designate as evaluator heads, that are capable of selecting tokens in long inputs that are most significant for inference. Building on this discovery, we develop EHPC, an Evaluator Head-based Prompt Compression method, which enables LLMs to rapidly "skim through" input prompts by leveraging only the first few layers with evaluator heads during the pre-filling stage, subsequently passing only the important tokens to the model for inference. EHPC achieves state-of-the-art results across two mainstream benchmarks: prompt compression and long-context inference acceleration. Consequently, it effectively reduces the complexity and costs associated with commercial API calls. We further demonstrate that EHPC attains competitive results compared to key-value cache-based acceleration methods, thereby highlighting its potential to enhance the efficiency of LLMs for long-context tasks. △ Less

Submitted 5 February, 2025; v1 submitted 22 January, 2025; originally announced January 2025.

arXiv:2501.03079 [pdf, other]

Wheel-GINS: A GNSS/INS Integrated Navigation System with a Wheel-mounted IMU

Authors: Yibin Wu, Jian Kuang, Xiaoji Niu, Cyrill Stachniss, Lasse Klingbeil, Heiner Kuhlmann

Abstract: A long-term accurate and robust localization system is essential for mobile robots to operate efficiently outdoors. Recent studies have shown the significant advantages of the wheel-mounted inertial measurement unit (Wheel-IMU)-based dead reckoning system. However, it still drifts over extended periods because of the absence of external correction signals. To achieve the goal of long-term accurate… ▽ More A long-term accurate and robust localization system is essential for mobile robots to operate efficiently outdoors. Recent studies have shown the significant advantages of the wheel-mounted inertial measurement unit (Wheel-IMU)-based dead reckoning system. However, it still drifts over extended periods because of the absence of external correction signals. To achieve the goal of long-term accurate localization, we propose Wheel-GINS, a Global Navigation Satellite System (GNSS)/inertial navigation system (INS) integrated navigation system using a Wheel-IMU. Wheel-GINS fuses the GNSS position measurement with the Wheel-IMU via an extended Kalman filter to limit the long-term error drift and provide continuous state estimation when the GNSS signal is blocked. Considering the specificities of the GNSS/Wheel-IMU integration, we conduct detailed modeling and online estimation of the Wheel-IMU installation parameters, including the Wheel-IMU leverarm and mounting angle and the wheel radius error. Experimental results have shown that Wheel-GINS outperforms the traditional GNSS/Odometer/INS integrated navigation system during GNSS outages. At the same time, Wheel-GINS can effectively estimate the Wheel-IMU installation parameters online and, consequently, improve the localization accuracy and practicality of the system. The source code of our implementation is publicly available (https://github.com/i2Nav-WHU/Wheel-GINS). △ Less

Submitted 6 January, 2025; originally announced January 2025.

Comments: Accepted to IEEE Transactions on Intelligent Transportation Systems

arXiv:2412.18566 [pdf, other]

Zero-resource Speech Translation and Recognition with LLMs

Authors: Karel Mundnich, Xing Niu, Prashant Mathur, Srikanth Ronanki, Brady Houston, Veera Raghavendra Elluru, Nilaksh Das, Zejiang Hou, Goeric Huybrechts, Anshu Bhatia, Daniel Garcia-Romero, Kyu J. Han, Katrin Kirchhoff

Abstract: Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a m… ▽ More Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM. We perform several experiments both in ST and ASR to understand how to best train the model and what data has the most impact on performance in previously unseen languages. In ST, our best model is capable to achieve BLEU scores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we achieve WERs of up to 28.2\%. We finally show that the performance of our system is bounded by the ability of the LLM to output text in the desired language. △ Less

Submitted 30 December, 2024; v1 submitted 24 December, 2024; originally announced December 2024.

Comments: ICASSP 2025, 5 pages, 2 figures, 2 tables

arXiv:2412.16530 [pdf, other]

Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation

Authors: Lucas Goncalves, Prashant Mathur, Xing Niu, Brady Houston, Chandrashekhar Lavania, Srikanth Vishnubhotla, Lijia Sun, Anthony Ferritto

Abstract: Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the spoken content-essential for maintaining realism in dubbed videos. Despite its importance, the inclusion of lip-synchrony constraints in AVS2S models has been lar… ▽ More Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the spoken content-essential for maintaining realism in dubbed videos. Despite its importance, the inclusion of lip-synchrony constraints in AVS2S models has been largely overlooked. This study addresses this gap by integrating a lip-synchrony loss into the training process of AVS2S models. Our proposed method significantly enhances lip-synchrony in direct audio-visual speech-to-speech translation, achieving an average LSE-D score of 10.67, representing a 9.2% reduction in LSE-D over a strong baseline across four language pairs. Additionally, it maintains the naturalness and high quality of the translated speech when overlaid onto the original video, without any degradation in translation quality. △ Less

Submitted 21 December, 2024; originally announced December 2024.

Comments: Accepted at ICASSP, 4 pages

arXiv:2412.01253 [pdf, other]

Yi-Lightning Technical Report

Authors: Alan Wake, Bei Chen, C. X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Fan Zhou, Feng Hu, Ge Zhang, Guoyin Wang, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qicheng Hu, Shawn Wang, Shijun Zhou , et al. (19 additional authors not shown)

Abstract: This technical report presents Yi-Lightning, our latest flagship large language model (LLM). It achieves exceptional performance, ranking 6th overall on Chatbot Arena, with particularly strong results (2nd to 4th place) in specialized categories including Chinese, Math, Coding, and Hard Prompts. Yi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture, featuring advanced expert seg… ▽ More This technical report presents Yi-Lightning, our latest flagship large language model (LLM). It achieves exceptional performance, ranking 6th overall on Chatbot Arena, with particularly strong results (2nd to 4th place) in specialized categories including Chinese, Math, Coding, and Hard Prompts. Yi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture, featuring advanced expert segmentation and routing mechanisms coupled with optimized KV-caching techniques. Our development process encompasses comprehensive pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF), where we devise deliberate strategies for multi-stage training, synthetic data construction, and reward modeling. Furthermore, we implement RAISE (Responsible AI Safety Engine), a four-component framework to address safety issues across pre-training, post-training, and serving phases. Empowered by our scalable super-computing infrastructure, all these innovations substantially reduce training, deployment and inference costs while maintaining high-performance standards. With further evaluations on public academic benchmarks, Yi-Lightning demonstrates competitive performance against top-tier LLMs, while we observe a notable disparity between traditional, static benchmark results and real-world, dynamic human preferences. This observation prompts a critical reassessment of conventional benchmarks' utility in guiding the development of more intelligent and powerful AI systems for practical applications. Yi-Lightning is now available through our developer platform at https://platform.lingyiwanwu.com. △ Less

Submitted 22 January, 2025; v1 submitted 2 December, 2024; originally announced December 2024.

arXiv:2412.00122 [pdf, other]

Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback

Authors: Xuexiang Niu, Jinping Tang, Lei Wang, Ge Zhu

Abstract: Learning from feedback has been shown to enhance the alignment between text prompts and images in text-to-image diffusion models. However, due to the lack of focus in feedback content, especially regarding the object type and quantity, these techniques struggle to accurately match text and images when faced with specified prompts. To address this issue, we propose an efficient fine-turning method… ▽ More Learning from feedback has been shown to enhance the alignment between text prompts and images in text-to-image diffusion models. However, due to the lack of focus in feedback content, especially regarding the object type and quantity, these techniques struggle to accurately match text and images when faced with specified prompts. To address this issue, we propose an efficient fine-turning method with specific reward objectives, including three stages. First, generated images from diffusion model are detected to obtain the object categories and quantities. Meanwhile, the confidence of category and quantity can be derived from the detection results and given prompts. Next, we define a novel matching score, based on above confidence, to measure text-image alignment. It can guide the model for feedback learning in the form of a reward function. Finally, we fine-tune the diffusion model by backpropagation the reward function gradients to generate semantically related images. Different from previous feedbacks that focus more on overall matching, we place more emphasis on the accuracy of entity categories and quantities. Besides, we construct a text-to-image dataset for studying the compositional generation, including 1.7 K pairs of text-image with diverse combinations of entities and quantities. Experimental results on this benchmark show that our model outperforms other SOTA methods in both alignment and fidelity. In addition, our model can also serve as a metric for evaluating text-image alignment in other models. All code and dataset are available at https://github.com/kingniu0329/Visions. △ Less

Submitted 28 November, 2024; originally announced December 2024.

arXiv:2411.19248 [pdf, ps, other]

Reconfigurable Intelligent Surface-Assisted Multiple-Antenna Coded Caching

Authors: Xiaofan Niu, Minquan Cheng, Kai Wan, Robert Caiming Qiu, Giuseppe Caire

Abstract: Reconfigurable Intelligent Surface (RIS) has emerged as a promising technology to enhance the wireless propagation environment for next-generation wireless communication systems. This paper introduces a new RIS-assisted multiple-antenna coded caching problem. Unlike the existing multi-antenna coded caching models, our considered model incorporates a passive RIS with a limited number of elements ai… ▽ More Reconfigurable Intelligent Surface (RIS) has emerged as a promising technology to enhance the wireless propagation environment for next-generation wireless communication systems. This paper introduces a new RIS-assisted multiple-antenna coded caching problem. Unlike the existing multi-antenna coded caching models, our considered model incorporates a passive RIS with a limited number of elements aimed at enhancing the multicast gain (i.e., Degrees of Freedom (DoF)). The system consists of a server equipped with multiple antennas and several single-antenna users. The RIS, which functions as a passive and configurable relay, improves communication by selectively erasing certain transmission paths between transmit and receive antennas, thereby reducing interference. We first propose a new RIS-assisted interference nulling algorithm to determine the phase-shift coefficients of the RIS. This algorithm achieves faster convergence compared to the existing approach. By strategically nulling certain interference paths in each time slot, the transmission process is divided into multiple interference-free groups. Each group consists of a set of transmit antennas that serve a corresponding set of users without any interference from other groups. The optimal grouping strategy to maximize the DoF is formulated as a combinatorial optimization problem. To efficiently solve this, we design a low-complexity algorithm that identifies the optimal solution and develops a corresponding coded caching scheme to achieve the maximum DoF. Building on the optimal grouping strategy, we introduce a new framework, referred to as RIS-assisted Multiple-Antenna Placement Delivery Array (RMAPDA), to construct the cache placement and delivery phases. Then we propose a general RMAPDA design to achieve the maximum DoF under the optimal grouping strategy. △ Less

Submitted 13 November, 2025; v1 submitted 28 November, 2024; originally announced November 2024.

Comments: Submitted to IEEE Trans. Information Theory, 40 pages

arXiv:2411.11262 [pdf, other]

Cross-Patient Pseudo Bags Generation and Curriculum Contrastive Learning for Imbalanced Multiclassification of Whole Slide Image

Authors: Yonghuang Wu, Xuan Xie, Xinyuan Niu, Chengqian Zhao, Jinhua Yu

Abstract: Pathology computing has dramatically improved pathologists' workflow and diagnostic decision-making processes. Although computer-aided diagnostic systems have shown considerable value in whole slide image (WSI) analysis, the problem of multi-classification under sample imbalance remains an intractable challenge. To address this, we propose learning fine-grained information by generating sub-bags w… ▽ More Pathology computing has dramatically improved pathologists' workflow and diagnostic decision-making processes. Although computer-aided diagnostic systems have shown considerable value in whole slide image (WSI) analysis, the problem of multi-classification under sample imbalance remains an intractable challenge. To address this, we propose learning fine-grained information by generating sub-bags with feature distributions similar to the original WSIs. Additionally, we utilize a pseudo-bag generation algorithm to further leverage the abundant and redundant information in WSIs, allowing efficient training in unbalanced-sample multi-classification tasks. Furthermore, we introduce an affinity-based sample selection and curriculum contrastive learning strategy to enhance the stability of model representation learning. Unlike previous approaches, our framework transitions from learning bag-level representations to understanding and exploiting the feature distribution of multi-instance bags. Our method demonstrates significant performance improvements on three datasets, including tumor classification and lymph node metastasis. On average, it achieves a 4.39-point improvement in F1 score compared to the second-best method across the three tasks, underscoring its superior performance. △ Less

Submitted 17 November, 2024; originally announced November 2024.

Comments: 9 pages, 4 figures

arXiv:2411.05088 [pdf]

Findings of the IWSLT 2024 Evaluation Campaign

Authors: Ibrahim Said Ahmad, Antonios Anastasopoulos, Ondřej Bojar, Claudia Borg, Marine Carpuat, Roldano Cattoni, Mauro Cettolo, William Chen, Qianqian Dong, Marcello Federico, Barry Haddow, Dávid Javorský, Mateusz Krubiński, Tsz Kin Lam, Xutai Ma, Prashant Mathur, Evgeny Matusov, Chandresh Maurya, John McCrae, Kenton Murray, Satoshi Nakamura, Matteo Negri, Jan Niehues, Xing Niu, Atul Kr. Ojha , et al. (20 additional authors not shown)

Abstract: This paper reports on the shared tasks organized by the 21st IWSLT Conference. The shared tasks address 7 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, dialect and low-resource speech translation, and Indic languages. The shared tasks attracted 18 teams whose submissions are documented in… ▽ More This paper reports on the shared tasks organized by the 21st IWSLT Conference. The shared tasks address 7 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, dialect and low-resource speech translation, and Indic languages. The shared tasks attracted 18 teams whose submissions are documented in 26 system papers. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia. △ Less

Submitted 7 November, 2024; originally announced November 2024.

Comments: IWSLT 2024; 59 pages

arXiv:2411.01841 [pdf, other]

doi 10.1016/j.knosys.2025.113412

Leveraging Label Semantics and Meta-Label Refinement for Multi-Label Question Classification

Authors: Shi Dong, Xiaobei Niu, Rui Zhong, Zhifeng Wang, Mingzhang Zuo

Abstract: Accurate annotation of educational resources is crucial for effective personalized learning and resource recommendation in online education. However, fine-grained knowledge labels often overlap or share similarities, making it difficult for existing multi-label classification methods to differentiate them. The label distribution imbalance due to sparsity of human annotations further intensifies th… ▽ More Accurate annotation of educational resources is crucial for effective personalized learning and resource recommendation in online education. However, fine-grained knowledge labels often overlap or share similarities, making it difficult for existing multi-label classification methods to differentiate them. The label distribution imbalance due to sparsity of human annotations further intensifies these challenges. To address these issues, this paper introduces RR2QC, a novel Retrieval Reranking method to multi-label Question Classification by leveraging label semantics and meta-label refinement. First, RR2QC improves the pre-training strategy by utilizing semantic relationships within and across label groups. Second, it introduces a class center learning task to align questions with label semantics during downstream training. Finally, this method decomposes labels into meta-labels and uses a meta-label classifier to rerank the retrieved label sequences. In doing so, RR2QC enhances the understanding and prediction capability of long-tail labels by learning from meta-labels that frequently appear in other labels. Additionally, a mathematical LLM is used to generate solutions for questions, extracting latent information to further refine the model's insights. Experimental results show that RR2QC outperforms existing methods in Precision@K and F1 scores across multiple educational datasets, demonstrating its effectiveness for online education applications. The code and datasets are available at https://github.com/78Erii/RR2QC. △ Less

Submitted 25 April, 2025; v1 submitted 4 November, 2024; originally announced November 2024.

Journal ref: Knowledge-Based Systems, 2025: 113412

arXiv:2411.01777 [pdf, other]

Learning predictable and robust neural representations by straightening image sequences

Authors: Xueyan Niu, Cristina Savin, Eero P. Simoncelli

Abstract: Prediction is a fundamental capability of all living organisms, and has been proposed as an objective for learning sensory representations. Recent work demonstrates that in primate visual systems, prediction is facilitated by neural representations that follow straighter temporal trajectories than their initial photoreceptor encoding, which allows for prediction by linear extrapolation. Inspired b… ▽ More Prediction is a fundamental capability of all living organisms, and has been proposed as an objective for learning sensory representations. Recent work demonstrates that in primate visual systems, prediction is facilitated by neural representations that follow straighter temporal trajectories than their initial photoreceptor encoding, which allows for prediction by linear extrapolation. Inspired by these experimental findings, we develop a self-supervised learning (SSL) objective that explicitly quantifies and promotes straightening. We demonstrate the power of this objective in training deep feedforward neural networks on smoothly-rendered synthetic image sequences that mimic commonly-occurring properties of natural videos. The learned model contains neural embeddings that are predictive, but also factorize the geometric, photometric, and semantic attributes of objects. The representations also prove more robust to noise and adversarial attacks compared to previous SSL methods that optimize for invariance to random augmentations. Moreover, these beneficial properties can be transferred to other training procedures by using the straightening objective as a regularizer, suggesting a broader utility for straightening as a principle for robust unsupervised learning. △ Less

Submitted 20 January, 2025; v1 submitted 3 November, 2024; originally announced November 2024.

Comments: Accepted at NeurIPS 2024

arXiv:2411.00023 [pdf, other]

Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models

Authors: Ognjen, Rudovic, Pranay Dighe, Yi Su, Vineet Garg, Sameer Dharur, Xiaochuan Niu, Ahmed H. Abdelaziz, Saurabh Adya, Ahmed Tewfik

Abstract: Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the notion of Large Language Models (LLMs) and mode… ▽ More Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the notion of Large Language Models (LLMs) and model the first query when making inference about the follow-ups (based on the ASR-decoded text), via prompting of a pretrained LLM, or by adapting a binary classifier on top of the LLM. In doing so, we also exploit the ASR uncertainty when designing the LLM prompts. We show on the real-world dataset of follow-up conversations that this approach yields large gains (20-40% reduction in false alarms at 10% fixed false rejects) due to the joint modeling of the previous speech context and ASR uncertainty, compared to when follow-ups are modeled alone. △ Less

Submitted 4 November, 2024; v1 submitted 28 October, 2024; originally announced November 2024.

arXiv:2410.20487 [pdf, other]

Efficient Diversity-based Experience Replay for Deep Reinforcement Learning

Authors: Kaiyan Zhao, Yiming Wang, Yuyang Chen, Yan Li, Leong Hou U, Xiaoguang Niu

Abstract: Experience replay is widely used to improve learning efficiency in reinforcement learning by leveraging past experiences. However, existing experience replay methods, whether based on uniform or prioritized sampling, often suffer from low efficiency, particularly in real-world scenarios with high-dimensional state spaces. To address this limitation, we propose a novel approach, Efficient Diversity… ▽ More Experience replay is widely used to improve learning efficiency in reinforcement learning by leveraging past experiences. However, existing experience replay methods, whether based on uniform or prioritized sampling, often suffer from low efficiency, particularly in real-world scenarios with high-dimensional state spaces. To address this limitation, we propose a novel approach, Efficient Diversity-based Experience Replay (EDER). EDER employs a determinantal point process to model the diversity between samples and prioritizes replay based on the diversity between samples. To further enhance learning efficiency, we incorporate Cholesky decomposition for handling large state spaces in realistic environments. Additionally, rejection sampling is applied to select samples with higher diversity, thereby improving overall learning efficacy. Extensive experiments are conducted on robotic manipulation tasks in MuJoCo, Atari games, and realistic indoor environments in Habitat. The results demonstrate that our approach not only significantly improves learning efficiency but also achieves superior performance in high-dimensional, realistic environments. △ Less

Submitted 18 May, 2025; v1 submitted 27 October, 2024; originally announced October 2024.

Showing 1–50 of 196 results for author: Niu, X