这是indexloc提供的服务,不要输入任何密码
Skip to main content

Showing 1–50 of 723 results for author: Wu, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.16632  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Step-Audio 2 Technical Report

    Authors: Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen , et al. (84 additional authors not shown)

    Abstract: This paper presents Step-Audio~2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech convers… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

  2. arXiv:2507.13575  [pdf, ps, other

    cs.LG cs.AI

    Apple Intelligence Foundation Language Models: Tech Report 2025

    Authors: Hanzhi Zhou, Erik Hornberger, Pengsheng Guo, Xiyou Zhou, Saiwen Wang, Xin Wang, Yifei He, Xuankai Chang, Rene Rauch, Louis D'hauwe, John Peebles, Alec Doane, Kohen Chia, Jenna Thibodeau, Zi-Yi Dou, Yuanyang Zhang, Ruoming Pang, Reed Li, Zhifeng Chen, Jeremy Warner, Zhaoyang Xu, Sophy Lee, David Mizrahi, Ramsey Tantawi, Chris Chaney , et al. (370 additional authors not shown)

    Abstract: We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transform… ▽ More

    Submitted 17 July, 2025; originally announced July 2025.

  3. arXiv:2507.10606  [pdf, ps, other

    cs.LG cs.AI cs.AR

    DALI-PD: Diffusion-based Synthetic Layout Heatmap Generation for ML in Physical Design

    Authors: Bing-Yue Wu, Vidya A. Chhabria

    Abstract: Machine learning (ML) has demonstrated significant promise in various physical design (PD) tasks. However, model generalizability remains limited by the availability of high-quality, large-scale training datasets. Creating such datasets is often computationally expensive and constrained by IP. While very few public datasets are available, they are typically static, slow to generate, and require fr… ▽ More

    Submitted 13 July, 2025; originally announced July 2025.

    Comments: Under review at Asia and South Pacific Design Automation Conference (ASP-DAC'26)

  4. arXiv:2507.09315  [pdf, ps, other

    cs.SE cs.AI

    Enhancing Interpretability in Software Change Management with Chain-of-Thought Reasoning

    Authors: Yongqian Sun, Weihua Kuang, Chao Shen, Xidao Wen, Tinghua Zheng, Heng Liu, Shenglin Zhang, Bo Wu, Dan Pei

    Abstract: In modern online services, frequent software changes introduce significant risks. To tackle this challenge, we propose SCELM (Software Change Evaluation and Lifecycle Management), an end-to-end automated framework for software change management. SCELM aims to manage software changes efficiently and precisely, significantly reducing service failures and economic losses.

    Submitted 12 July, 2025; originally announced July 2025.

    Comments: 22 pages, 19 figures

  5. arXiv:2507.08854  [pdf, ps, other

    physics.chem-ph cs.LG

    DiffNMR: Diffusion Models for Nuclear Magnetic Resonance Spectra Elucidation

    Authors: Qingsong Yang, Binglan Wu, Xuwei Liu, Bo Chen, Wei Li, Gen Long, Xin Chen, Mingjun Xiao

    Abstract: Nuclear Magnetic Resonance (NMR) spectroscopy is a central characterization method for molecular structure elucidation, yet interpreting NMR spectra to deduce molecular structures remains challenging due to the complexity of spectral data and the vastness of the chemical space. In this work, we introduce DiffNMR, a novel end-to-end framework that leverages a conditional discrete diffusion model fo… ▽ More

    Submitted 9 July, 2025; originally announced July 2025.

  6. arXiv:2507.04036  [pdf, ps, other

    cs.CV

    PresentAgent: Multimodal Agent for Presentation Video Generation

    Authors: Jingwei Shi, Zeyu Zhang, Biao Wu, Yanjie Liang, Meng Fang, Ling Chen, Yang Zhao

    Abstract: We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

  7. arXiv:2507.00856  [pdf, ps, other

    cs.NI eess.SP

    Enhancing Vehicular Platooning with Wireless Federated Learning: A Resource-Aware Control Framework

    Authors: Beining Wu, Jun Huang, Qiang Duan, Liang Dong, Zhipeng Cai

    Abstract: This paper aims to enhance the performance of Vehicular Platooning (VP) systems integrated with Wireless Federated Learning (WFL). In highly dynamic environments, vehicular platoons experience frequent communication changes and resource constraints, which significantly affect information exchange and learning model synchronization. To address these challenges, we first formulate WFL in VP as a joi… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Under review at IEEE Transactions on Networking

  8. arXiv:2506.23056  [pdf, ps, other

    cs.CL

    Boosting LLM's Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning

    Authors: Xiang Zhuang, Bin Wu, Jiyu Cui, Kehua Feng, Xiaotong Li, Huabin Xing, Keyan Ding, Qiang Zhang, Huajun Chen

    Abstract: Molecular structure elucidation involves deducing a molecule's structure from various types of spectral data, which is crucial in chemical experimental analysis. While large language models (LLMs) have shown remarkable proficiency in analyzing and reasoning through complex tasks, they still encounter substantial challenges in molecular structure elucidation. We identify that these challenges large… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

    Comments: ACL 2025 Main

  9. arXiv:2506.22930  [pdf, ps, other

    cs.CV

    Towards Explainable Bilingual Multimodal Misinformation Detection and Localization

    Authors: Yiwei He, Xiangtai Li, Zhenglin Huang, Yi Dong, Hao Fei, Jiangning Zhang, Baoyuan Wu, Guangliang Cheng

    Abstract: The increasing realism of multimodal content has made misinformation more subtle and harder to detect, especially in news media where images are frequently paired with bilingual (e.g., Chinese-English) subtitles. Such content often includes localized image edits and cross-lingual inconsistencies that jointly distort meaning while remaining superficially plausible. We introduce BiMi, a bilingual mu… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  10. arXiv:2506.20179  [pdf, ps, other

    cs.CV cs.AI eess.IV

    Progressive Alignment Degradation Learning for Pansharpening

    Authors: Enzhe Zhao, Zhichang Guo, Yao Li, Fanghui Song, Boying Wu

    Abstract: Deep learning-based pansharpening has been shown to effectively generate high-resolution multispectral (HRMS) images. To create supervised ground-truth HRMS images, synthetic data generated using the Wald protocol is commonly employed. This protocol assumes that networks trained on artificial low-resolution data will perform equally well on high-resolution data. However, well-trained models typica… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: 13 pages, 9 figures

  11. arXiv:2506.17368  [pdf, ps, other

    cs.LG cs.AI cs.CR

    SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

    Authors: Zhenglin Lai, Mengyao Liao, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, Jianqiang Li, Bingzhe Wu

    Abstract: Large language models based on Mixture-of-Experts have achieved substantial gains in efficiency and scalability, yet their architectural uniqueness introduces underexplored safety alignment challenges. Existing safety alignment strategies, predominantly designed for dense models, are ill-suited to address MoE-specific vulnerabilities. In this work, we formalize and systematically study MoE model's… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: 9 pages, 7 figures

  12. arXiv:2506.15679  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Dense SAE Latents Are Features, Not Bugs

    Authors: Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, Max Tegmark

    Abstract: Sparse autoencoders (SAEs) are designed to extract interpretable features from language models by enforcing a sparsity constraint. Ideally, training an SAE would yield latents that are both sparse and semantically meaningful. However, many SAE latents activate frequently (i.e., are \emph{dense}), raising concerns that they may be undesirable artifacts of the training procedure. In this work, we sy… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  13. arXiv:2506.13192  [pdf, ps, other

    cs.CL cs.AI

    Breaking Thought Patterns: A Multi-Dimensional Reasoning Framework for LLMs

    Authors: Xintong Tang, Meiru Zhang, Shang Xiao, Junzhao Jin, Zihan Zhao, Liwei Li, Yang Zheng, Bangyi Wu

    Abstract: Large language models (LLMs) are often constrained by rigid reasoning processes, limiting their ability to generate creative and diverse responses. To address this, a novel framework called LADDER is proposed, combining Chain-of-Thought (CoT) reasoning, Mixture of Experts (MoE) models, and multi-dimensional up/down-sampling strategies which breaks the limitations of traditional LLMs. First, CoT re… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  14. arXiv:2506.11991  [pdf, ps, other

    cs.CV cs.AI cs.CL

    VGR: Visual Grounded Reasoning

    Authors: Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, Jun Xiao

    Abstract: In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations,… ▽ More

    Submitted 16 June, 2025; v1 submitted 13 June, 2025; originally announced June 2025.

    Comments: 9 pages, 4 figures

  15. arXiv:2506.09507  [pdf, ps, other

    cs.CL cs.AI

    TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding

    Authors: Bingheng Wu, Jingze Shi, Yifan Wu, Nan Tang, Yuyu Luo

    Abstract: Transformers exhibit proficiency in capturing long-range dependencies, whereas State Space Models (SSMs) facilitate linear-time sequence modeling. Notwithstanding their synergistic potential, the integration of these architectures presents a significant challenge, primarily attributable to a fundamental incongr inuity their respective positional encoding mechanisms: Transformers rely on explicit R… ▽ More

    Submitted 18 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

  16. arXiv:2506.09226  [pdf, ps, other

    cs.DB cs.DC cs.PF

    Terabyte-Scale Analytics in the Blink of an Eye

    Authors: Bowen Wu, Wei Cui, Carlo Curino, Matteo Interlandi, Rathijit Sen

    Abstract: For the past two decades, the DB community has devoted substantial research to take advantage of cheap clusters of machines for distributed data analytics -- we believe that we are at the beginning of a paradigm shift. The scaling laws and popularity of AI models lead to the deployment of incredibly powerful GPU clusters in commercial data centers. Compared to CPU-only solutions, these clusters de… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  17. arXiv:2506.08967  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

    Authors: Ailin Huang, Bingxin Li, Bruce Wang, Boyong Wu, Chao Yan, Chengli Feng, Heng Wang, Hongyu Zhou, Hongyuan Wang, Jingbei Li, Jianjian Sun, Joanna Wang, Mingrui Chen, Peng Liu, Ruihang Miao, Shilei Jiang, Tian Fei, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Ge, Zheng Gong, Zhewei Huang , et al. (51 additional authors not shown)

    Abstract: Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a du… ▽ More

    Submitted 13 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

    Comments: 12 pages, 3 figures

  18. arXiv:2506.05484  [pdf, ps, other

    cs.LG physics.geo-ph

    Initial Model Incorporation for Deep Learning FWI: Pretraining or Denormalization?

    Authors: Ruihua Chen, Bangyu Wu, Meng Li, Kai Yang

    Abstract: Subsurface property neural network reparameterized full waveform inversion (FWI) has emerged as an effective unsupervised learning framework, which can invert stably with an inaccurate starting model. It updates the trainable neural network parameters instead of fine-tuning on the subsurface model directly. There are primarily two ways to embed the prior knowledge of the initial model into neural… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  19. Scaling Transformers for Discriminative Recommendation via Generative Pretraining

    Authors: Chunqi Wang, Bingchao Wu, Zheng Chen, Lei Shen, Bing Wang, Xiaoyi Zeng

    Abstract: Discriminative recommendation tasks, such as CTR (click-through rate) and CVR (conversion rate) prediction, play critical roles in the ranking stage of large-scale industrial recommender systems. However, training a discriminative model encounters a significant overfitting issue induced by data sparsity. Moreover, this overfitting issue worsens with larger models, causing them to underperform smal… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: KDD'25

  20. arXiv:2506.03197  [pdf, ps, other

    cs.CV cs.AI cs.CL cs.LG

    Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

    Authors: Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Yanjie Liang, Zuming Huang, Haozhe Wang, Jun Huang, Ling Chen, Wei Chu, Yuan Qi

    Abstract: Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: 16 pages, 12 figures

    Report number: INF-CS-TR-2025-02

  21. arXiv:2506.00975  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction

    Authors: Qichao Wang, Ziqiao Meng, Wenqian Cui, Yifei Zhang, Pengcheng Wu, Bingzhe Wu, Irwin King, Liang Chen, Peilin Zhao

    Abstract: Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech language models (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures t… ▽ More

    Submitted 11 June, 2025; v1 submitted 1 June, 2025; originally announced June 2025.

    Comments: Accepted by ICML 2025

  22. arXiv:2506.00932  [pdf, other

    cs.LG

    Addressing the Collaboration Dilemma in Low-Data Federated Learning via Transient Sparsity

    Authors: Qiao Xiao, Boqian Wu, Andrey Poddubnyy, Elena Mocanu, Phuong H. Nguyen, Mykola Pechenizkiy, Decebal Constantin Mocanu

    Abstract: Federated learning (FL) enables collaborative model training across decentralized clients while preserving data privacy, leveraging aggregated updates to build robust global models. However, this training paradigm faces significant challenges due to data heterogeneity and limited local datasets, which often impede effective collaboration. In such scenarios, we identify the Layer-wise Inertia Pheno… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

  23. arXiv:2505.24037  [pdf, other

    cs.AI

    Leave it to the Specialist: Repair Sparse LLMs with Sparse Fine-Tuning via Sparsity Evolution

    Authors: Qiao Xiao, Alan Ansell, Boqian Wu, Lu Yin, Mykola Pechenizkiy, Shiwei Liu, Decebal Constantin Mocanu

    Abstract: Large language models (LLMs) have achieved remarkable success across various tasks but face deployment challenges due to their massive computational demands. While post-training pruning methods like SparseGPT and Wanda can effectively reduce the model size, but struggle to maintain model performance at high sparsity levels, limiting their utility for downstream tasks. Existing fine-tuning methods,… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  24. arXiv:2505.24034  [pdf, ps, other

    cs.LG cs.AI

    LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training

    Authors: Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk Helenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, Xiaocheng Tang, Yundi Qian, Beibei Zhu, Rui Hou

    Abstract: Reinforcement Learning (RL) has become the most effective post-training approach for improving the capabilities of Large Language Models (LLMs). In practice, because of the high demands on latency and memory, it is particularly challenging to develop an efficient RL framework that reliably manages policy models with hundreds to thousands of billions of parameters. In this paper, we present Llama… ▽ More

    Submitted 1 June, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

  25. arXiv:2505.22506  [pdf, ps, other

    cs.LG

    Sparsification and Reconstruction from the Perspective of Representation Geometry

    Authors: Wenjie Sun, Bingzhe Wu, Zhile Yang, Chengke Wu

    Abstract: Sparse Autoencoders (SAEs) have emerged as a predominant tool in mechanistic interpretability, aiming to identify interpretable monosemantic features. However, how does sparse encoding organize the representations of activation vector from language models? What is the relationship between this organizational paradigm and feature disentanglement as well as reconstruction performance? To address the… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: 24 pages, 5 figures

    MSC Class: 22-08 ACM Class: I.2.4; I.2.7

  26. arXiv:2505.21593  [pdf, ps, other

    cs.CV cs.AI

    Any-to-Bokeh: One-Step Video Bokeh via Multi-Plane Image Guided Diffusion

    Authors: Yang Yang, Siming Zheng, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, Peng-Tao Jiang

    Abstract: Recent advances in diffusion based editing models have enabled realistic camera simulation and image-based bokeh, but video bokeh remains largely unexplored. Existing video editing models cannot explicitly control focus planes or adjust bokeh intensity, limiting their applicability for controllable optical effects. Moreover, naively extending image-based bokeh methods to video often results in tem… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: project page: https://vivocameraresearch.github.io/any2bokeh/

  27. arXiv:2505.19716  [pdf, ps, other

    cs.AI

    Concise Reasoning, Big Gains: Pruning Long Reasoning Trace with Difficulty-Aware Prompting

    Authors: Yifan Wu, Jingze Shi, Bingheng Wu, Jiayi Zhang, Xiaotian Lin, Nan Tang, Yuyu Luo

    Abstract: Existing chain-of-thought (CoT) distillation methods can effectively transfer reasoning abilities to base models but suffer from two major limitations: excessive verbosity of reasoning traces and inadequate adaptability to problem difficulty. Long reasoning traces significantly increase inference costs, and uniform-length solutions prevent base models from learning adaptive reasoning strategies. T… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  28. arXiv:2505.18327  [pdf, ps, other

    stat.ML cs.LG math.NA math.OC math.ST stat.CO

    Online Statistical Inference of Constrained Stochastic Optimization via Random Scaling

    Authors: Xinchen Du, Wanrong Zhu, Wei Biao Wu, Sen Na

    Abstract: Constrained stochastic nonlinear optimization problems have attracted significant attention for their ability to model complex real-world scenarios in physics, economics, and biology. As datasets continue to grow, online inference methods have become crucial for enabling real-time decision-making without the need to store historical data. In this work, we develop an online inference procedure for… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: 43 pages, 1 figure, 8 tables

  29. arXiv:2505.17910  [pdf, ps, other

    cs.CV cs.AI

    DiffusionReward: Enhancing Blind Face Restoration through Reward Feedback Learning

    Authors: Bin Wu, Wei Wang, Yahui Liu, Zixiang Li, Yao Zhao

    Abstract: Reward Feedback Learning (ReFL) has recently shown great potential in aligning model outputs with human preferences across various generative tasks. In this work, we introduce a ReFL framework, named DiffusionReward, to the Blind Face Restoration task for the first time. DiffusionReward effectively overcomes the limitations of diffusion-based methods, which often fail to generate realistic facial… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: 22 pages, 13 figures, 5 tables

  30. arXiv:2505.17909  [pdf, ps, other

    cs.LG cs.AI

    NeuroTrails: Training with Dynamic Sparse Heads as the Key to Effective Ensembling

    Authors: Bram Grooten, Farid Hasanov, Chenxiang Zhang, Qiao Xiao, Boqian Wu, Zahra Atashgahi, Ghada Sokar, Shiwei Liu, Lu Yin, Elena Mocanu, Mykola Pechenizkiy, Decebal Constantin Mocanu

    Abstract: Model ensembles have long been a cornerstone for improving generalization and robustness in deep learning. However, their effectiveness often comes at the cost of substantial computational overhead. To address this issue, state-of-the-art methods aim to replicate ensemble-class performance without requiring multiple independently trained networks. Unfortunately, these algorithms often still demand… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: Our open-source code is available at https://github.com/bramgrooten/neurotrails

  31. arXiv:2505.16227  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Explain Less, Understand More: Jargon Detection via Personalized Parameter-Efficient Fine-tuning

    Authors: Bohao Wu, Qingyun Wang, Yue Guo

    Abstract: Personalizing jargon detection and explanation is essential for making technical documents accessible to readers with diverse disciplinary backgrounds. However, tailoring models to individual users typically requires substantial annotation efforts and computational resources due to user-specific finetuning. To address this, we present a systematic study of personalized jargon detection, focusing o… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  32. arXiv:2505.15431  [pdf, ps, other

    cs.CL

    Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought

    Authors: Tencent Hunyuan Team, Ao Liu, Botong Zhou, Can Xu, Chayse Zhou, ChenChen Zhang, Chengcheng Xu, Chenhao Wang, Decheng Wu, Dengpeng Wu, Dian Jiao, Dong Du, Dong Wang, Feng Zhang, Fengzong Lian, Guanghui Xu, Guanwei Zhang, Hai Wang, Haipeng Luo, Han Hu, Huilin Xu, Jiajia Wu, Jianchen Zhu, Jianfeng Yan, Jiaqi Zhu , et al. (230 additional authors not shown)

    Abstract: As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid response… ▽ More

    Submitted 4 July, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

  33. Physics-Guided Learning of Meteorological Dynamics for Weather Downscaling and Forecasting

    Authors: Yingtao Luo, Shikai Fang, Binqing Wu, Qingsong Wen, Liang Sun

    Abstract: Weather forecasting is essential but remains computationally intensive and physically incomplete in traditional numerical weather prediction (NWP) methods. Deep learning (DL) models offer efficiency and accuracy but often ignore physical laws, limiting interpretability and generalization. We propose PhyDL-NWP, a physics-guided deep learning framework that integrates physical equations with latent… ▽ More

    Submitted 23 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

    Comments: Published/Accepted in ACM SIGKDD 2025

  34. arXiv:2505.14299  [pdf, ps, other

    cs.MA

    Empowering LLMs in Task-Oriented Dialogues: A Domain-Independent Multi-Agent Framework and Fine-Tuning Strategy

    Authors: Zihao Feng, Xiaoxue Wang, Bowen Wu, Weihong Zhong, Zhen Xu, Hailong Cao, Tiejun Zhao, Ying Li, Baoxun Wang

    Abstract: Task-oriented dialogue systems based on Large Language Models (LLMs) have gained increasing attention across various industries and achieved significant results. Current approaches condense complex procedural workflows into a single agent to achieve satisfactory performance on large-scale LLMs. However, these approaches face challenges to achieve comparable performance on fine-tuned lightweight LL… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  35. arXiv:2505.14059  [pdf, ps, other

    cs.CV

    Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

    Authors: Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, Hao Liu, Can Huang

    Abstract: Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitati… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: Accepted to ACL 2025

  36. arXiv:2505.13299  [pdf, other

    stat.ML cs.LG math.ST

    Smoothed SGD for quantiles: Bahadur representation and Gaussian approximation

    Authors: Likai Chen, Georg Keilbar, Wei Biao Wu

    Abstract: This paper considers the estimation of quantiles via a smoothed version of the stochastic gradient descent (SGD) algorithm. By smoothing the score function in the conventional SGD quantile algorithm, we achieve monotonicity in the quantile level in that the estimated quantile curves do not cross. We derive non-asymptotic tail probability bounds for the smoothed SGD quantile estimate both for the c… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  37. arXiv:2505.12620  [pdf, ps, other

    cs.CV

    BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation

    Authors: Haiquan Wen, Yiwei He, Zhenglin Huang, Tianxiao Li, Zihan Yu, Xingru Huang, Lu Qi, Baoyuan Wu, Xiangtai Li, Guangliang Cheng

    Abstract: Advances in AI generative models facilitate super-realistic video synthesis, amplifying misinformation risks via social media and eroding trust in digital content. Several research works have explored new deepfake detection methods on AI-generated images to alleviate these risks. However, with the fast development of video generation models, such as Sora and WanX, there is currently a lack of larg… ▽ More

    Submitted 1 July, 2025; v1 submitted 18 May, 2025; originally announced May 2025.

  38. arXiv:2505.12396  [pdf, ps, other

    cs.IR

    LLM-CoT Enhanced Graph Neural Recommendation with Harmonized Group Policy Optimization

    Authors: Hailong Luo, Bin Wu, Hongyong Jia, Qingqing Zhu, Lianlei Shan

    Abstract: Graph neural networks (GNNs) have advanced recommender systems by modeling interaction relationships. However, existing graph-based recommenders rely on sparse ID features and do not fully exploit textual information, resulting in low information density within representations. Furthermore, graph contrastive learning faces challenges. Random negative sampling can introduce false negative samples,… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

  39. arXiv:2505.11541  [pdf, other

    cs.CR

    MorphMark: Flexible Adaptive Watermarking for Large Language Models

    Authors: Zongqi Wang, Tianle Gu, Baoyuan Wu, Yujiu Yang

    Abstract: Watermarking by altering token sampling probabilities based on red-green list is a promising method for tracing the origin of text generated by large language models (LLMs). However, existing watermark methods often struggle with a fundamental dilemma: improving watermark effectiveness (the detectability of the watermark) often comes at the cost of reduced text quality. This trade-off limits their… ▽ More

    Submitted 19 May, 2025; v1 submitted 14 May, 2025; originally announced May 2025.

    Comments: ACL 2025 Main

  40. arXiv:2505.10218  [pdf, ps, other

    cs.CL

    RAIDEN-R1: Improving Role-awareness of LLMs via GRPO with Verifiable Reward

    Authors: Zongsheng Wang, Kaili Sun, Bowen Wu, Qun Yu, Ying Li, Baoxun Wang

    Abstract: Role-playing conversational agents (RPCAs) face persistent challenges in maintaining role consistency. To address this, we propose RAIDEN-R1, a novel reinforcement learning framework that integrates Verifiable Role-Awareness Reward (VRAR). The method introduces both singular and multi-term mining strategies to generate quantifiable rewards by assessing role-specific keys. Additionally, we construc… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  41. arXiv:2505.08294  [pdf, ps, other

    cs.CV

    FauForensics: Boosting Audio-Visual Deepfake Detection with Facial Action Units

    Authors: Jian Wang, Baoyuan Wu, Li Liu, Qingshan Liu

    Abstract: The rapid evolution of generative AI has increased the threat of realistic audio-visual deepfakes, demanding robust detection methods. Existing solutions primarily address unimodal (audio or visual) forgeries but struggle with multimodal manipulations due to inadequate handling of heterogeneous modality features and poor generalization across datasets. To this end, we propose a novel framework cal… ▽ More

    Submitted 14 June, 2025; v1 submitted 13 May, 2025; originally announced May 2025.

  42. arXiv:2505.08125  [pdf, ps, other

    stat.ML cs.LG math.ST

    Sharp Gaussian approximations for Decentralized Federated Learning

    Authors: Soham Bonnerjee, Sayar Karmakar, Wei Biao Wu

    Abstract: Federated Learning has gained traction in privacy-sensitive collaborative environments, with local SGD emerging as a key optimization method in decentralized settings. While its convergence properties are well-studied, asymptotic statistical guarantees beyond convergence remain limited. In this paper, we present two generalized Gaussian approximation results for local SGD and explore their implica… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  43. arXiv:2505.07360  [pdf, ps, other

    cs.SE

    BinMetric: A Comprehensive Binary Analysis Benchmark for Large Language Models

    Authors: Xiuwei Shang, Guoqiang Chen, Shaoyin Cheng, Benlong Wu, Li Hu, Gangyang Li, Weiming Zhang, Nenghai Yu

    Abstract: Binary analysis remains pivotal in software security, offering insights into compiled programs without source code access. As large language models (LLMs) continue to excel in diverse language understanding and generation tasks, their potential in decoding complex binary data structures becomes evident. However, the lack of standardized benchmarks in this domain limits the assessment and compariso… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: 23 pages, 5 figures, to be published in IJCAI 2025

  44. arXiv:2505.04254  [pdf, other

    cs.SE

    CompileAgent: Automated Real-World Repo-Level Compilation with Tool-Integrated LLM-based Agent System

    Authors: Li Hu, Guoqiang Chen, Xiuwei Shang, Shaoyin Cheng, Benlong Wu, Gangyang Li, Xu Zhu, Weiming Zhang, Nenghai Yu

    Abstract: With open-source projects growing in size and complexity, manual compilation becomes tedious and error-prone, highlighting the need for automation to improve efficiency and accuracy. However, the complexity of compilation instruction search and error resolution makes automatic compilation challenging. Inspired by the success of LLM-based agents in various fields, we propose CompileAgent, the first… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: 12 pages, 4 figures

  45. arXiv:2505.04147  [pdf, other

    cs.CV cs.AI

    R^3-VQA: "Read the Room" by Video Social Reasoning

    Authors: Lixing Niu, Jiapeng Li, Xingping Yu, Shu Wang, Ruining Feng, Bo Wu, Ping Wei, Yisen Wang, Lifeng Fan

    Abstract: "Read the room" is a significant social reasoning capability in human daily life. Humans can infer others' mental states from subtle social cues. Previous social reasoning tasks and datasets lack complexity (e.g., simple scenes, basic interactions, incomplete mental state variables, single-step reasoning, etc.) and fall far short of the challenges present in real-life social interactions. In this… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  46. arXiv:2505.02704  [pdf, ps, other

    cs.CV

    VGLD: Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery

    Authors: Bojin Wu, Jing Chen

    Abstract: Monocular depth estimation can be broadly categorized into two directions: relative depth estimation, which predicts normalized or inverse depth without absolute scale, and metric depth estimation, which aims to recover depth with real-world scale. While relative methods are flexible and data-efficient, their lack of metric scale limits their utility in downstream tasks. A promising solution is to… ▽ More

    Submitted 13 July, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

    Comments: 19 pages, conference

  47. arXiv:2505.02156  [pdf, other

    cs.CL cs.AI cs.LG

    Adaptive Thinking via Mode Policy Optimization for Social Language Agents

    Authors: Minzheng Wang, Yongbin Li, Haobo Wang, Xinghua Zhang, Nan Xu, Bingli Wu, Fei Huang, Haiyang Yu, Wenji Mao

    Abstract: Effective social intelligence simulation requires language agents to dynamically adjust reasoning depth, a capability notably absent in current studies. Existing methods either lack this kind of reasoning capability or enforce Long Chain-of-Thought reasoning uniformly across all scenarios, resulting in excessive token usage and inflexible social simulation. To address this, we propose an… ▽ More

    Submitted 22 May, 2025; v1 submitted 4 May, 2025; originally announced May 2025.

    Comments: Work in Progress. The code and data are available, see https://github.com/MozerWang/AMPO

  48. arXiv:2504.21278  [pdf, other

    cs.MA

    Robust Multi-agent Communication Based on Decentralization-Oriented Adversarial Training

    Authors: Xuyan Ma, Yawen Wang, Junjie Wang, Xiaofei Xie, Boyu Wu, Shoubin Li, Fanjiang Xu, Qing Wang

    Abstract: In typical multi-agent reinforcement learning (MARL) problems, communication is important for agents to share information and make the right decisions. However, due to the complexity of training multi-agent communication, existing methods often fall into the dilemma of local optimization, which leads to the concentration of communication in a limited number of channels and presents an unbalanced s… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  49. arXiv:2504.20801  [pdf, other

    cs.CR cs.SE

    Unlocking User-oriented Pages: Intention-driven Black-box Scanner for Real-world Web Applications

    Authors: Weizhe Wang, Yao Zhang, Kaitai Liang, Guangquan Xu, Hongpeng Bai, Qingyang Yan, Xi Zheng, Bin Wu

    Abstract: Black-box scanners have played a significant role in detecting vulnerabilities for web applications. A key focus in current black-box scanning is increasing test coverage (i.e., accessing more web pages). However, since many web applications are user-oriented, some deep pages can only be accessed through complex user interactions, which are difficult to reach by existing black-box scanners. To fil… ▽ More

    Submitted 30 April, 2025; v1 submitted 29 April, 2025; originally announced April 2025.

  50. arXiv:2504.18870  [pdf, other

    cs.CV cs.RO eess.IV

    WLTCL: Wide Field-of-View 3-D LiDAR Truck Compartment Automatic Localization System

    Authors: Guodong Sun, Mingjing Li, Dingjie Liu, Mingxuan Liu, Bo Wu, Yang Zhang

    Abstract: As an essential component of logistics automation, the automated loading system is becoming a critical technology for enhancing operational efficiency and safety. Precise automatic positioning of the truck compartment, which serves as the loading area, is the primary step in automated loading. However, existing methods have difficulty adapting to truck compartments of various sizes, do not establi… ▽ More

    Submitted 26 April, 2025; originally announced April 2025.

    Comments: To appear in IEEE TIM