这是indexloc提供的服务,不要输入任何密码
Skip to main content

Showing 1–50 of 13,649 results for author: Wang, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.17745  [pdf, ps, other

    cs.CV cs.AI

    Ultra3D: Efficient and High-Fidelity 3D Generation with Part Attention

    Authors: Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, Guosheng Lin

    Abstract: Recent advances in sparse voxel representations have significantly improved the quality of 3D content generation, enabling high-resolution modeling with fine-grained geometry. However, existing frameworks suffer from severe computational inefficiencies due to the quadratic complexity of attention mechanisms in their two-stage diffusion pipelines. In this work, we propose Ultra3D, an efficient 3D g… ▽ More

    Submitted 23 July, 2025; originally announced July 2025.

    Comments: Project Page: https://buaacyw.github.io/ultra3d/

  2. arXiv:2507.17735  [pdf, ps, other

    eess.AS cs.SD

    Accent Normalization Using Self-Supervised Discrete Tokens with Non-Parallel Data

    Authors: Qibing Bai, Sho Inoue, Shuai Wang, Zhongjie Jiang, Yannan Wang, Haizhou Li

    Abstract: Accent normalization converts foreign-accented speech into native-like speech while preserving speaker identity. We propose a novel pipeline using self-supervised discrete tokens and non-parallel training data. The system extracts tokens from source speech, converts them through a dedicated model, and synthesizes the output using flow matching. Our method demonstrates superior performance over a f… ▽ More

    Submitted 23 July, 2025; originally announced July 2025.

    Comments: Accepted to INTERSPEECH 2025

  3. arXiv:2507.17728  [pdf, ps, other

    cs.CL

    Megrez2 Technical Report

    Authors: Boxun Li, Yadong Li, Zhiyuan Li, Congyi Liu, Weilin Liu, Guowei Niu, Zheyue Tan, Haiyang Xu, Zhuyu Yao, Tao Yuan, Dong Zhou, Yueqing Zhuang, Bo Zhao, Guohao Dai, Yu Wang

    Abstract: We present Megrez2, a novel lightweight and high-performance language model architecture optimized for device native deployment. Megrez2 introduces a novel cross-layer expert sharing mechanism, which significantly reduces total parameter count by reusing expert modules across adjacent transformer layers while maintaining most of the model's capacity. It also incorporates pre-gated routing, enablin… ▽ More

    Submitted 23 July, 2025; originally announced July 2025.

  4. arXiv:2507.17659  [pdf, ps, other

    cs.CV

    See the Forest and the Trees: A Synergistic Reasoning Framework for Knowledge-Based Visual Question Answering

    Authors: Junjie Wang, Yunhan Tang, Yijie Wang, Zhihao Yuan, Huan Wang, Yangfan He, Bin Li

    Abstract: Multimodal Large Language Models (MLLMs) have pushed the frontiers of Knowledge-Based Visual Question Answering (KBVQA), yet their reasoning is fundamentally bottlenecked by a reliance on uni-dimensional evidence. This "seeing only the trees, but not the forest" approach prevents robust, multi-faceted understanding. Inspired by the principle of seeing both the forest and trees, we propose Synergos… ▽ More

    Submitted 23 July, 2025; originally announced July 2025.

  5. arXiv:2507.17594  [pdf, ps, other

    cs.CV

    RemixFusion: Residual-based Mixed Representation for Large-scale Online RGB-D Reconstruction

    Authors: Yuqing Lan, Chenyang Zhu, Shuaifeng Zhi, Jiazhao Zhang, Zhoufeng Wang, Renjiao Yi, Yijie Wang, Kai Xu

    Abstract: The introduction of the neural implicit representation has notably propelled the advancement of online dense reconstruction techniques. Compared to traditional explicit representations, such as TSDF, it improves the mapping completeness and memory efficiency. However, the lack of reconstruction details and the time-consuming learning of neural representations hinder the widespread application of n… ▽ More

    Submitted 23 July, 2025; originally announced July 2025.

  6. arXiv:2507.17530  [pdf, ps, other

    cs.LG cs.RO

    Generalized Advantage Estimation for Distributional Policy Gradients

    Authors: Shahil Shaik, Jonathon M. Smereka, Yue Wang

    Abstract: Generalized Advantage Estimation (GAE) has been used to mitigate the computational complexity of reinforcement learning (RL) by employing an exponentially weighted estimation of the advantage function to reduce the variance in policy gradient estimates. Despite its effectiveness, GAE is not designed to handle value distributions integral to distributional RL, which can capture the inherent stochas… ▽ More

    Submitted 23 July, 2025; originally announced July 2025.

    Comments: 6 pages, 3 figures, published at ACC 2025 Conference

  7. arXiv:2507.17528  [pdf, ps, other

    cs.LG

    Generalized Low-Rank Matrix Contextual Bandits with Graph Information

    Authors: Yao Wang, Jiannan Li, Yue Kang, Shanxing Gao, Zhenxin Xiao

    Abstract: The matrix contextual bandit (CB), as an extension of the well-known multi-armed bandit, is a powerful framework that has been widely applied in sequential decision-making scenarios involving low-rank structure. In many real-world scenarios, such as online advertising and recommender systems, additional graph information often exists beyond the low-rank structure, that is, the similar relationship… ▽ More

    Submitted 23 July, 2025; originally announced July 2025.

  8. arXiv:2507.17527  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

    Authors: Shanbo Cheng, Yu Bao, Zhichao Huang, Yu Lu, Ningxin Peng, Lu Xu, Runsheng Yu, Rong Cao, Ting Han, Zeyang Li, Sitong Liu, Shengtao Ma, Shiguang Pan, Jiongchen Xiao, Nuo Xu, Meng Yang, Rong Ye, Yiming Yu, Ruofei Zhang, Wanyi Zhang, Wenhao Zhu, Liehao Zou, Lu Lu, Yuxuan Wang, Yonghui Wu

    Abstract: Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveI… ▽ More

    Submitted 23 July, 2025; originally announced July 2025.

    Comments: Seed-LiveInterpret 2.0 Technical Report

  9. arXiv:2507.17311  [pdf, ps, other

    cs.LG cs.AI physics.ao-ph

    EarthLink: Interpreting Climate Signals with Self-Evolving AI Agents

    Authors: Zijie Guo, Jiong Wang, Xiaoyu Yue, Wangxu Wei, Zhe Jiang, Wanghan Xu, Ben Fei, Wenlong Zhang, Xinyu Gu, Lijing Cheng, Jing-Jia Luo, Chao Li, Yaqiang Wang, Tao Chen, Wanli Ouyang, Fenghua Ling, Lei Bai

    Abstract: Modern Earth science is at an inflection point. The vast, fragmented, and complex nature of Earth system data, coupled with increasingly sophisticated analytical demands, creates a significant bottleneck for rapid scientific discovery. Here we introduce EarthLink, the first AI agent designed as an interactive copilot for Earth scientists. It automates the end-to-end research workflow, from plannin… ▽ More

    Submitted 23 July, 2025; originally announced July 2025.

  10. arXiv:2507.17303  [pdf, ps, other

    eess.IV cs.AI cs.CV

    A Versatile Pathology Co-pilot via Reasoning Enhanced Multimodal Large Language Model

    Authors: Zhe Xu, Ziyi Liu, Junlin Hou, Jiabo Ma, Cheng Jin, Yihui Wang, Zhixuan Chen, Zhengyu Zhang, Zhengrui Guo, Fengtao Zhou, Yingxue Xu, Xi Wang, Ronald Cheong Kin Chan, Li Liang, Hao Chen

    Abstract: Multimodal large language models (MLLMs) have emerged as powerful tools for computational pathology, offering unprecedented opportunities to integrate pathological images with language context for comprehensive diagnostic analysis. These models hold particular promise for automating complex tasks that traditionally require expert interpretation of pathologists. However, current MLLM approaches in… ▽ More

    Submitted 23 July, 2025; originally announced July 2025.

  11. arXiv:2507.17265  [pdf, ps, other

    cs.GR cs.HC

    Visualization-Driven Illumination for Density Plots

    Authors: Xin Chen, Yunhai Wang, Huaiwei Bao, Kecheng Lu, Jaemin Jo, Chi-Wing Fu, Jean-Daniel Fekete

    Abstract: We present a novel visualization-driven illumination model for density plots, a new technique to enhance density plots by effectively revealing the detailed structures in high- and medium-density regions and outliers in low-density regions, while avoiding artifacts in the density field's colors. When visualizing large and dense discrete point samples, scatterplots and dot density maps often suffer… ▽ More

    Submitted 23 July, 2025; originally announced July 2025.

  12. arXiv:2507.17242  [pdf

    cs.HC eess.SP q-bio.NC

    High-Density EEG Enables the Fastest Visual Brain-Computer Interfaces

    Authors: Gege Ming, Weihua Pei, Sen Tian, Xiaogang Chen, Xiaorong Gao, Yijun Wang

    Abstract: Brain-computer interface (BCI) technology establishes a direct communication pathway between the brain and external devices. Current visual BCI systems suffer from insufficient information transfer rates (ITRs) for practical use. Spatial information, a critical component of visual perception, remains underexploited in existing systems because the limited spatial resolution of recording methods hin… ▽ More

    Submitted 23 July, 2025; originally announced July 2025.

  13. arXiv:2507.17135  [pdf, ps, other

    cs.LG cs.AI cs.CV

    SADA: Stability-guided Adaptive Diffusion Acceleration

    Authors: Ting Jiang, Yixiao Wang, Hancheng Ye, Zishan Shao, Jingwei Sun, Jingyang Zhang, Zekai Chen, Jianyi Zhang, Yiran Chen, Hai Li

    Abstract: Diffusion models have achieved remarkable success in generative tasks but suffer from high computational costs due to their iterative sampling process and quadratic attention costs. Existing training-free acceleration strategies that reduce per-step computation cost, while effectively reducing sampling time, demonstrate low faithfulness compared to the original baseline. We hypothesize that this f… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

    Comments: Accepted and published by ICML 2025. Code is available at: https://github.com/Ting-Justin-Jiang/sada-icml

  14. Enhancing Transferability and Consistency in Cross-Domain Recommendations via Supervised Disentanglement

    Authors: Yuhan Wang, Qing Xie, Zhifeng Bao, Mengzi Tang, Lin Li, Yongjian Liu

    Abstract: Cross-domain recommendation (CDR) aims to alleviate the data sparsity by transferring knowledge across domains. Disentangled representation learning provides an effective solution to model complex user preferences by separating intra-domain features (domain-shared and domain-specific features), thereby enhancing robustness and interpretability. However, disentanglement-based CDR methods employing… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

  15. arXiv:2507.16884  [pdf, ps, other

    cs.LG cs.AI

    SplitMeanFlow: Interval Splitting Consistency in Few-Step Generative Modeling

    Authors: Yi Guo, Wei Wang, Zhihang Yuan, Rong Cao, Kuan Chen, Zhengyang Chen, Yuanyuan Huo, Yang Zhang, Yuping Wang, Shouda Liu, Yuxuan Wang

    Abstract: Generative models like Flow Matching have achieved state-of-the-art performance but are often hindered by a computationally expensive iterative sampling process. To address this, recent work has focused on few-step or one-step generation by learning the average velocity field, which directly maps noise to data. MeanFlow, a leading method in this area, learns this field by enforcing a differential… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

    Comments: Tech Report

  16. arXiv:2507.16869  [pdf, ps, other

    cs.GR cs.CV

    Controllable Video Generation: A Survey

    Authors: Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang, Mingzhe Zheng, Xuanhua He, Chenyang Zhu, Hongyu Liu, Yingqing He, Zeyu Wang, Zhifeng Li, Xiu Li, Wei Liu, Dan Xu, Linfeng Zhang, Qifeng Chen

    Abstract: With the rapid development of AI-generated content (AIGC), video generation has emerged as one of its most dynamic and impactful subfields. In particular, the advancement of video generation foundation models has led to growing demand for controllable video generation methods that can more accurately reflect user intent. Most existing foundation models are designed for text-to-video generation, wh… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

    Comments: project page: https://github.com/mayuelala/Awesome-Controllable-Video-Generation

  17. arXiv:2507.16815  [pdf, ps, other

    cs.CV cs.AI cs.LG cs.RO

    ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

    Authors: Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, Fu-En Yang

    Abstract: Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

    Comments: Project page: https://jasper0314-huang.github.io/thinkact-vla/

  18. arXiv:2507.16727   

    cs.AI

    Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with constraints

    Authors: Zhenyun Yin, Shujie Wang, Xuhong Wang, Xingjun Ma, Yinchun Wang

    Abstract: Improving the reliability of large language models (LLMs) is critical for deploying them in real-world scenarios. In this paper, we propose \textbf{Deliberative Searcher}, the first framework to integrate certainty calibration with retrieval-based search for open-domain question answering. The agent performs multi-step reflection and verification over Wikipedia data and is trained with a reinforce… ▽ More

    Submitted 22 July, 2025; v1 submitted 22 July, 2025; originally announced July 2025.

    Comments: The inconsistency of base models undermines the fairness of evaluation comparisons and affects the validity of the paper's conclusions

  19. arXiv:2507.16632  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Step-Audio 2 Technical Report

    Authors: Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen , et al. (84 additional authors not shown)

    Abstract: This paper presents Step-Audio~2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech convers… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

  20. arXiv:2507.16559  [pdf, ps, other

    cs.CV

    Comparative validation of surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation in endoscopy: Results of the PhaKIR 2024 challenge

    Authors: Tobias Rueckert, David Rauber, Raphaela Maerkl, Leonard Klausmann, Suemeyye R. Yildiran, Max Gutbrod, Danilo Weber Nunes, Alvaro Fernandez Moreno, Imanol Luengo, Danail Stoyanov, Nicolas Toussaint, Enki Cho, Hyeon Bae Kim, Oh Sung Choo, Ka Young Kim, Seong Tae Kim, Gonçalo Arantes, Kehan Song, Jianjun Zhu, Junchen Xiong, Tingyi Lin, Shunsuke Kikuchi, Hiroki Matsuzaki, Atsushi Kouno, João Renato Ribeiro Manesco , et al. (36 additional authors not shown)

    Abstract: Reliable recognition and localization of surgical instruments in endoscopic video recordings are foundational for a wide range of applications in computer- and robot-assisted minimally invasive surgery (RAMIS), including surgical training, skill assessment, and autonomous assistance. However, robust performance under real-world conditions remains a significant challenge. Incorporating surgical con… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

    Comments: A challenge report pre-print containing 36 pages, 15 figures, and 13 tables

  21. arXiv:2507.16391  [pdf, ps, other

    cs.AR

    Ironman: Accelerating Oblivious Transfer Extension for Privacy-Preserving AI with Near-Memory Processing

    Authors: Chenqi Lin, Kang Yang, Tianshi Xu, Ling Liang, Yufei Wang, Zhaohui Chen, Runsheng Wang, Mingyu Gao, Meng Li

    Abstract: With the wide application of machine learning (ML), privacy concerns arise with user data as they may contain sensitive information. Privacy-preserving ML (PPML) based on cryptographic primitives has emerged as a promising solution in which an ML model is directly computed on the encrypted data to provide a formal privacy guarantee. However, PPML frameworks heavily rely on the oblivious transfer (… ▽ More

    Submitted 23 July, 2025; v1 submitted 22 July, 2025; originally announced July 2025.

  22. arXiv:2507.16360  [pdf, ps, other

    eess.IV cs.CV

    A High Magnifications Histopathology Image Dataset for Oral Squamous Cell Carcinoma Diagnosis and Prognosis

    Authors: Jinquan Guan, Junhong Guo, Qi Chen, Jian Chen, Yongkang Cai, Yilin He, Zhiquan Huang, Yan Wang, Yutong Xie

    Abstract: Oral Squamous Cell Carcinoma (OSCC) is a prevalent and aggressive malignancy where deep learning-based computer-aided diagnosis and prognosis can enhance clinical assessments.However, existing publicly available OSCC datasets often suffer from limited patient cohorts and a restricted focus on either diagnostic or prognostic tasks, limiting the development of comprehensive and generalizable models.… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

    Comments: 12 pages, 11 tables, 4 figures

  23. arXiv:2507.16347  [pdf, ps, other

    cs.LG cs.AI

    Leveraging Personalized PageRank and Higher-Order Topological Structures for Heterophily Mitigation in Graph Neural Networks

    Authors: Yumeng Wang, Zengyi Wo, Wenjun Wang, Xingcheng Fu, Minglai Shao

    Abstract: Graph Neural Networks (GNNs) excel in node classification tasks but often assume homophily, where connected nodes share similar labels. This assumption does not hold in many real-world heterophilic graphs. Existing models for heterophilic graphs primarily rely on pairwise relationships, overlooking multi-scale information from higher-order structures. This leads to suboptimal performance, particul… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

    Comments: 10 pages, 5 figures, accepted at IJCAI 2025

    ACM Class: I.2.6

  24. arXiv:2507.16306  [pdf, ps, other

    cs.MA cs.RO

    COMPASS: Cooperative Multi-Agent Persistent Monitoring using Spatio-Temporal Attention Network

    Authors: Xingjian Zhang, Yizhuo Wang, Guillaume Sartoretti

    Abstract: Persistent monitoring of dynamic targets is essential in real-world applications such as disaster response, environmental sensing, and wildlife conservation, where mobile agents must continuously gather information under uncertainty. We propose COMPASS, a multi-agent reinforcement learning (MARL) framework that enables decentralized agents to persistently monitor multiple moving targets efficientl… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

  25. arXiv:2507.16274  [pdf, ps, other

    cs.LG cs.AI cs.DC cs.PF

    Reducing GPU Memory Fragmentation via Spatio-Temporal Planning for Efficient Large-Scale Model Training

    Authors: Zixiao Huang, Junhao Hu, Hao Lin, Chunyang Zhu, Yueran Tang, Quanlu Zhang, Zhen Guo, Zhenhua Li, Shengen Yan, Zhenhua Zhu, Guohao Dai, Yu Wang

    Abstract: The rapid scaling of large language models (LLMs) has significantly increased GPU memory pressure, which is further aggravated by training optimization techniques such as virtual pipeline and recomputation that disrupt tensor lifespans and introduce considerable memory fragmentation. Default GPU memory allocators of popular deep learning frameworks like PyTorch use online strategies without knowle… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

  26. arXiv:2507.16251  [pdf, ps, other

    cs.CV cs.AI

    HoliTracer: Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery

    Authors: Yu Wang, Bo Dang, Wanchun Li, Wei Chen, Yansheng Li

    Abstract: With the increasing resolution of remote sensing imagery (RSI), large-size RSI has emerged as a vital data source for high-precision vector mapping of geographic objects. Existing methods are typically constrained to processing small image patches, which often leads to the loss of contextual information and produces fragmented vector outputs. To address these, this paper introduces HoliTracer, the… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

  27. arXiv:2507.16242  [pdf, ps, other

    cs.DS cs.LG

    Toward a Lightweight and Robust Design for Caching

    Authors: Peng Chen, Hailiang Zhao, Jiaji Zhang, Xueyan Tang, Yixuan Wang, Shuiguang Deng

    Abstract: The online caching problem aims to minimize cache misses when serving a sequence of requests under a limited cache size. While naive learning-augmented caching algorithms achieve ideal $1$-consistency, they lack robustness guarantees. Existing robustification methods either sacrifice $1$-consistency or introduce significant computational overhead. In this paper, we introduce Guard, a lightweight r… ▽ More

    Submitted 23 July, 2025; v1 submitted 22 July, 2025; originally announced July 2025.

    Comments: preprint

  28. arXiv:2507.16227  [pdf, ps, other

    physics.plasm-ph cs.AI

    Predictive Hydrodynamic Simulations for Laser Direct-drive Implosion Experiments via Artificial Intelligence

    Authors: Zixu Wang, Yuhan Wang, Junfei Ma, Fuyuan Wu, Junchi Yan, Xiaohui Yuan, Zhe Zhang, Jie Zhang

    Abstract: This work presents predictive hydrodynamic simulations empowered by artificial intelligence (AI) for laser driven implosion experiments, taking the double-cone ignition (DCI) scheme as an example. A Transformer-based deep learning model MULTI-Net is established to predict implosion features according to laser waveforms and target radius. A Physics-Informed Decoder (PID) is proposed for high-dimens… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

    Comments: 7 pages, 7 figures

  29. arXiv:2507.16178  [pdf, ps, other

    cs.LG cs.AI

    LLM Data Selection and Utilization via Dynamic Bi-level Optimization

    Authors: Yang Yu, Kai Han, Hang Zhou, Yehui Tang, Kaiqi Huang, Yunhe Wang, Dacheng Tao

    Abstract: While large-scale training data is fundamental for developing capable large language models (LLMs), strategically selecting high-quality data has emerged as a critical approach to enhance training efficiency and reduce computational costs. Current data selection methodologies predominantly rely on static, training-agnostic criteria, failing to account for the dynamic model training and data intera… ▽ More

    Submitted 21 July, 2025; originally announced July 2025.

    Comments: The 42nd International Conference on Machine Learning (ICML 2025)

  30. arXiv:2507.15958  [pdf, ps, other

    eess.IV cs.AI cs.CV

    Quantization-Aware Neuromorphic Architecture for Efficient Skin Disease Classification on Resource-Constrained Devices

    Authors: Haitian Wang, Xinyu Wang, Yiren Wang, Karen Lee, Zichen Geng, Xian Zhang, Kehkashan Kiran, Yu Zhang, Bo Miao

    Abstract: Accurate and efficient skin lesion classification on edge devices is critical for accessible dermatological care but remains challenging due to computational, energy, and privacy constraints. We introduce QANA, a novel quantization-aware neuromorphic architecture for incremental skin lesion classification on resource-limited hardware. QANA effectively integrates ghost modules, efficient channel at… ▽ More

    Submitted 21 July, 2025; originally announced July 2025.

    Comments: This manuscript is under review for IEEE BIBM 2025

  31. arXiv:2507.15899  [pdf

    stat.ML cs.LG

    Structural DID with ML: Theory, Simulation, and a Roadmap for Applied Research

    Authors: Yile Yu, Anzhi Xu, Yi Wang

    Abstract: Causal inference in observational panel data has become a central concern in economics,policy analysis,and the broader social sciences.To address the core contradiction where traditional difference-in-differences (DID) struggles with high-dimensional confounding variables in observational panel data,while machine learning (ML) lacks causal structure interpretability,this paper proposes an innovati… ▽ More

    Submitted 20 July, 2025; originally announced July 2025.

    Comments: 45 pages, 29 figures

    MSC Class: 91-01

  32. arXiv:2507.15856  [pdf, ps, other

    cs.CV

    Latent Denoising Makes Good Visual Tokenizers

    Authors: Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, Yue Wang

    Abstract: Despite their fundamental role, it remains unclear what properties could make visual tokenizers more effective for generative modeling. We observe that modern generative models share a conceptually similar training objective -- reconstructing clean signals from corrupted inputs such as Gaussian noise or masking -- a process we term denoising. Motivated by this insight, we propose aligning tokenize… ▽ More

    Submitted 21 July, 2025; originally announced July 2025.

    Comments: Code is available at: https://github.com/Jiawei-Yang/DeTok

  33. arXiv:2507.15851  [pdf, ps, other

    cs.AI

    The Other Mind: How Language Models Exhibit Human Temporal Cognition

    Authors: Lingyu Li, Yang Yao, Yixu Wang, Chubo Li, Yan Teng, Yingchun Wang

    Abstract: As Large Language Models (LLMs) continue to advance, they exhibit certain cognitive patterns similar to those of humans that are not directly specified in training data. This study investigates this phenomenon by focusing on temporal cognition in LLMs. Leveraging the similarity judgment task, we find that larger models spontaneously establish a subjective temporal reference point and adhere to the… ▽ More

    Submitted 21 July, 2025; originally announced July 2025.

    Comments: 12 pages, 9 figures, 4 tables

  34. arXiv:2507.15620  [pdf, ps, other

    cs.CG q-bio.QM

    TrajLens: Visual Analysis for Constructing Cell Developmental Trajectories in Cross-Sample Exploration

    Authors: Qipeng Wang, Shaolun Ruan, Rui Sheng, Yong Wang, Min Zhu, Huamin Qu

    Abstract: Constructing cell developmental trajectories is a critical task in single-cell RNA sequencing (scRNA-seq) analysis, enabling the inference of potential cellular progression paths. However, current automated methods are limited to establishing cell developmental trajectories within individual samples, necessitating biologists to manually link cells across samples to construct complete cross-sample… ▽ More

    Submitted 21 July, 2025; originally announced July 2025.

  35. arXiv:2507.15607  [pdf, ps, other

    cs.RO

    A Universal Vehicle-Trailer Navigation System with Neural Kinematics and Online Residual Learning

    Authors: Yanbo Chen, Yunzhe Tan, Yaojia Wang, Zhengzhe Xu, Junbo Tan, Xueqian Wang

    Abstract: Autonomous navigation of vehicle-trailer systems is crucial in environments like airports, supermarkets, and concert venues, where various types of trailers are needed to navigate with different payloads and conditions. However, accurately modeling such systems remains challenging, especially for trailers with castor wheels. In this work, we propose a novel universal vehicle-trailer navigation sys… ▽ More

    Submitted 21 July, 2025; originally announced July 2025.

    Comments: 8 pages, 10 figures

  36. arXiv:2507.15597  [pdf, ps, other

    cs.CV cs.LG cs.RO

    Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

    Authors: Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, Zongqing Lu

    Abstract: We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this d… ▽ More

    Submitted 21 July, 2025; originally announced July 2025.

    Comments: 37 pages

  37. arXiv:2507.15496  [pdf, ps, other

    cs.CV cs.LG cs.RO

    Dense-depth map guided deep Lidar-Visual Odometry with Sparse Point Clouds and Images

    Authors: JunYing Huang, Ao Xu, DongSun Yong, KeRen Li, YuanFeng Wang, Qi Qin

    Abstract: Odometry is a critical task for autonomous systems for self-localization and navigation. We propose a novel LiDAR-Visual odometry framework that integrates LiDAR point clouds and images for accurate and robust pose estimation. Our method utilizes a dense-depth map estimated from point clouds and images through depth completion, and incorporates a multi-scale feature extraction network with attenti… ▽ More

    Submitted 21 July, 2025; originally announced July 2025.

  38. arXiv:2507.15244  [pdf, ps, other

    cs.HC

    How Does Empirical Research Facilitate Creation Tool Design? A Data Video Perspective

    Authors: Leixian Shen, Leni Yang, Haotian Li, Yun Wang, Yuyu Luo, Huamin Qu

    Abstract: Empirical research in creative design deepens our theoretical understanding of design principles and perceptual effects, offering valuable guidance for innovating creation tools. However, how these empirical insights currently influence the development of creation tools, and how their integration can be enhanced in the future, remains insufficiently understood. In this paper, we aim to unveil the… ▽ More

    Submitted 21 July, 2025; originally announced July 2025.

  39. arXiv:2507.15219  [pdf, ps, other

    cs.CR cs.AI

    PromptArmor: Simple yet Effective Prompt Injection Defenses

    Authors: Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuandong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, Dawn Song

    Abstract: Despite their potential, recent research has demonstrated that LLM agents are vulnerable to prompt injection attacks, where malicious prompts are injected into the agent's input, causing it to perform an attacker-specified task rather than the intended task provided by the user. In this paper, we present PromptArmor, a simple yet effective defense against prompt injection attacks. Specifically, Pr… ▽ More

    Submitted 20 July, 2025; originally announced July 2025.

  40. arXiv:2507.14897  [pdf, ps, other

    cs.AI

    AgentFly: Extensible and Scalable Reinforcement Learning for LM Agents

    Authors: Renxi Wang, Rifo Ahmad Genadi, Bilal El Bouardi, Yongxin Wang, Fajri Koto, Zhengzhong Liu, Timothy Baldwin, Haonan Li

    Abstract: Language model (LM) agents have gained significant attention for their ability to autonomously complete tasks through interactions with environments, tools, and APIs. LM agents are primarily built with prompt engineering or supervised finetuning. At the same time, reinforcement learning (RL) has been explored to enhance LM's capabilities, such as reasoning and factuality. However, the combination… ▽ More

    Submitted 20 July, 2025; originally announced July 2025.

    ACM Class: I.2.5

  41. arXiv:2507.14849  [pdf, ps, other

    cs.CL

    Beyond Isolated Capabilities: Bridging Long CoT Reasoning and Long-Context Understanding

    Authors: Yifei Wang

    Abstract: Reasoning distillation has emerged as an effective approach to enhance the reasoning capabilities of smaller language models. However, the impact of large-scale reasoning distillation on other critical abilities, particularly in-context retrieval and reasoning, remains unexplored. This gap in understanding is particularly significant given the increasing importance of Retrieval-Augmented Generatio… ▽ More

    Submitted 20 July, 2025; originally announced July 2025.

  42. arXiv:2507.14823  [pdf, ps, other

    cs.CV

    FinChart-Bench: Benchmarking Financial Chart Comprehension in Vision-Language Models

    Authors: Dong Shu, Haoyang Yuan, Yuchen Wang, Yanguang Liu, Huopu Zhang, Haiyan Zhao, Mengnan Du

    Abstract: Large vision-language models (LVLMs) have made significant progress in chart understanding. However, financial charts, characterized by complex temporal structures and domain-specific terminology, remain notably underexplored. We introduce FinChart-Bench, the first benchmark specifically focused on real-world financial charts. FinChart-Bench comprises 1,200 financial chart images collected from 20… ▽ More

    Submitted 20 July, 2025; originally announced July 2025.

    Comments: 20 Pages, 18 Figures

  43. Understanding How Visually Impaired Players Socialize in Mobile Games

    Authors: Zihe Ran, Xiyu Li, Qing Xiao, Yanyun Wang, Franklin Mingzhe Li, Zhicong Lu

    Abstract: Mobile games are becoming a vital medium for social interaction, offering a platform that transcends geographical boundaries. An increasing number of visually impaired individuals are engaging in mobile gaming to connect, collaborate, compete, and build friendships. In China, visually impaired communities face significant social challenges in offline settings, making mobile games a crucial avenue… ▽ More

    Submitted 20 July, 2025; originally announced July 2025.

    Comments: 16 pages, 1 table, accepted by ASSETS25

  44. arXiv:2507.14784  [pdf, ps, other

    cs.CV cs.AI

    LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering

    Authors: Xinxin Dong, Baoyun Peng, Haokai Ma, Yufei Wang, Zixuan Dong, Fei Hu, Xiaodong Wang

    Abstract: Video Question Answering (VideoQA) requires identifying sparse critical moments in long videos and reasoning about their causal relationships to answer semantically complex questions. While recent advances in multimodal learning have improved alignment and fusion, current approaches remain limited by two prevalent but fundamentally flawed strategies: (1) task-agnostic sampling indiscriminately pro… ▽ More

    Submitted 19 July, 2025; originally announced July 2025.

  45. arXiv:2507.14686  [pdf, ps, other

    cs.CV

    From Semantics, Scene to Instance-awareness: Distilling Foundation Model for Open-vocabulary Situation Recognition

    Authors: Chen Cai, Tianyi Liu, Jianjun Gao, Wenyang Liu, Kejun Wu, Ruoyu Wang, Yi Wang, Soo Chin Liew

    Abstract: Recent Multimodal Large Language Models (MLLMs) exhibit strong zero-shot abilities but struggle with complex Grounded Situation Recognition (GSR) and are resource-intensive for edge device deployment. Meanwhile, conventional GSR models often lack generalization ability, falling short in recognizing unseen and rare situations. In this paper, we exploit transferring knowledge from a teacher MLLM to… ▽ More

    Submitted 19 July, 2025; originally announced July 2025.

  46. arXiv:2507.14633  [pdf, ps, other

    cs.NI cs.LG

    Agentic Satellite-Augmented Low-Altitude Economy and Terrestrial Networks: A Survey on Generative Approaches

    Authors: Xiaozheng Gao, Yichen Wang, Bosen Liu, Xiao Zhou, Ruichen Zhang, Jiacheng Wang, Dusit Niyato, Dong In Kim, Abbas Jamalipour, Chau Yuen, Jianping An, Kai Yang

    Abstract: The development of satellite-augmented low-altitude economy and terrestrial networks (SLAETNs) demands intelligent and autonomous systems that can operate reliably across heterogeneous, dynamic, and mission-critical environments. To address these challenges, this survey focuses on enabling agentic artificial intelligence (AI), that is, artificial agents capable of perceiving, reasoning, and acting… ▽ More

    Submitted 19 July, 2025; originally announced July 2025.

  47. arXiv:2507.14482  [pdf, ps, other

    cs.HC

    Conch: Competitive Debate Analysis via Visualizing Clash Points and Hierarchical Strategies

    Authors: Qianhe Chen, Yong Wang, Yixin Yu, Xiyuan Zhu, Xuerou Yu, Ran Wang

    Abstract: In-depth analysis of competitive debates is essential for participants to develop argumentative skills and refine strategies, and further improve their debating performance. However, manual analysis of unstructured and unlabeled textual records of debating is time-consuming and ineffective, as it is challenging to reconstruct contextual semantics and track logical connections from raw data. To add… ▽ More

    Submitted 19 July, 2025; originally announced July 2025.

  48. arXiv:2507.14447  [pdf, ps, other

    cs.AI cs.CL

    Routine: A Structural Planning Framework for LLM Agent System in Enterprise

    Authors: Guancheng Zeng, Xueyi Chen, Jiawang Hu, Shaohua Qi, Yaxuan Mao, Zhantao Wang, Yifan Nie, Shuang Li, Qiuyang Feng, Pengxu Qiu, Yujia Wang, Wenqiang Han, Linyan Huang, Gang Li, Jingjing Mo, Haowen Hu

    Abstract: The deployment of agent systems in an enterprise environment is often hindered by several challenges: common models lack domain-specific process knowledge, leading to disorganized plans, missing key tools, and poor execution stability. To address this, this paper introduces Routine, a multi-step agent planning framework designed with a clear structure, explicit instructions, and seamless parameter… ▽ More

    Submitted 22 July, 2025; v1 submitted 18 July, 2025; originally announced July 2025.

    Comments: 26 pages, 8 figures, 5 tables

  49. arXiv:2507.13676  [pdf, ps, other

    cs.NI eess.SP

    CARTS: Cooperative and Adaptive Resource Triggering and Stitching for 5G ISAC

    Authors: Cheng Jiang, Yihe Yan, Yanxiang Wang, Jiawei Hu, Chun Tung Chou, Wen Hu

    Abstract: This paper presents CARTS, an adaptive 5G uplink sensing scheme designed to provide Integrated Sensing and Communication (ISAC) services. The performance of both communication and sensing fundamentally depends on the availability of accurate and up-to-date channel state information (CSI). In modern 5G networks, uplink CSI is derived from two reference signals: the demodulation reference signal (DM… ▽ More

    Submitted 18 July, 2025; originally announced July 2025.

  50. arXiv:2507.13659  [pdf, ps, other

    cs.CV cs.AI cs.LG cs.NE

    When Person Re-Identification Meets Event Camera: A Benchmark Dataset and An Attribute-guided Re-Identification Framework

    Authors: Xiao Wang, Qian Zhu, Shujuan Wu, Bo Jiang, Shiliang Zhang, Yaowei Wang, Yonghong Tian, Bin Luo

    Abstract: Recent researchers have proposed using event cameras for person re-identification (ReID) due to their promising performance and better balance in terms of privacy protection, event camera-based person ReID has attracted significant attention. Currently, mainstream event-based person ReID algorithms primarily focus on fusing visible light and event stream, as well as preserving privacy. Although si… ▽ More

    Submitted 18 July, 2025; originally announced July 2025.