这是indexloc提供的服务,不要输入任何密码
Skip to main content

Showing 1–50 of 1,346 results for author: Xu, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.15700  [pdf, ps, other

    cs.IT

    Estimating Rate-Distortion Functions Using the Energy-Based Model

    Authors: Shitong Wu, Sicheng Xu, Lingyi Chen, Huihui Wu, Wenyi Zhang

    Abstract: The rate-distortion (RD) theory is one of the key concepts in information theory, providing theoretical limits for compression performance and guiding the source coding design, with both theoretical and practical significance. The Blahut-Arimoto (BA) algorithm, as a classical algorithm to compute RD functions, encounters computational challenges when applied to high-dimensional scenarios. In recen… ▽ More

    Submitted 21 July, 2025; originally announced July 2025.

    Comments: 6 pages, 6 figures

  2. arXiv:2507.14914  [pdf, ps, other

    cs.RO cs.AI

    One Step Beyond: Feedthrough & Placement-Aware Rectilinear Floorplanner

    Authors: Zhexuan Xu, Jie Wang, Siyuan Xu, Zijie Geng, Mingxuan Yuan, Feng Wu

    Abstract: Floorplanning determines the shapes and locations of modules on a chip canvas and plays a critical role in optimizing the chip's Power, Performance, and Area (PPA) metrics. However, existing floorplanning approaches often fail to integrate with subsequent physical design stages, leading to suboptimal in-module component placement and excessive inter-module feedthrough. To tackle this challenge, we… ▽ More

    Submitted 20 July, 2025; originally announced July 2025.

  3. arXiv:2507.14683  [pdf, ps, other

    cs.CL

    MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization

    Authors: Xingxuan Li, Yao Xiao, Dianwen Ng, Hai Ye, Yue Deng, Xiang Lin, Bin Wang, Zhanfeng Mo, Chong Zhang, Yueyi Zhang, Zonglin Yang, Ruilin Li, Lei Lei, Shihao Xu, Han Zhao, Weiling Chen, Feng Ji, Lidong Bing

    Abstract: Large language models have recently evolved from fluent text generation to advanced reasoning across diverse domains, giving rise to reasoning language models. Among these domains, mathematical reasoning serves as a representative benchmark as it requires precise multi-step logic and abstract reasoning, which can be generalized to other tasks. While closed-source RLMs such as GPT-o3 demonstrate im… ▽ More

    Submitted 19 July, 2025; originally announced July 2025.

    Comments: Technical report

  4. arXiv:2507.14190  [pdf, ps, other

    eess.SP cs.RO

    Traffic Signal Phase and Timing Estimation with Large-Scale Floating Car Data

    Authors: Mingcheng Liao, Zebang Feng, Miao Fan, Shengtong Xu, Haoyi Xiong

    Abstract: Effective modern transportation systems depend critically on accurate Signal Phase and Timing (SPaT) estimation. However, acquiring ground-truth SPaT information faces significant hurdles due to communication challenges with transportation departments and signal installers. As a result, Floating Car Data (FCD) has become the primary source for large-scale SPaT analyses. Current FCD approaches ofte… ▽ More

    Submitted 13 July, 2025; originally announced July 2025.

    Comments: Accepted by ITSC'25

  5. arXiv:2507.14088  [pdf, ps, other

    cs.LG

    DPMT: Dual Process Multi-scale Theory of Mind Framework for Real-time Human-AI Collaboration

    Authors: Xiyun Li, Yining Ding, Yuhua Jiang, Yunlong Zhao, Runpeng Xie, Shuang Xu, Yuanhua Ni, Yiqin Yang, Bo Xu

    Abstract: Real-time human-artificial intelligence (AI) collaboration is crucial yet challenging, especially when AI agents must adapt to diverse and unseen human behaviors in dynamic scenarios. Existing large language model (LLM) agents often fail to accurately model the complex human mental characteristics such as domain intentions, especially in the absence of direct communication. To address this limitat… ▽ More

    Submitted 18 July, 2025; originally announced July 2025.

    Journal ref: cogsci-2025

  6. arXiv:2507.13575  [pdf, ps, other

    cs.LG cs.AI

    Apple Intelligence Foundation Language Models: Tech Report 2025

    Authors: Hanzhi Zhou, Erik Hornberger, Pengsheng Guo, Xiyou Zhou, Saiwen Wang, Xin Wang, Yifei He, Xuankai Chang, Rene Rauch, Louis D'hauwe, John Peebles, Alec Doane, Kohen Chia, Jenna Thibodeau, Zi-Yi Dou, Yuanyang Zhang, Ruoming Pang, Reed Li, Zhifeng Chen, Jeremy Warner, Zhaoyang Xu, Sophy Lee, David Mizrahi, Ramsey Tantawi, Chris Chaney , et al. (370 additional authors not shown)

    Abstract: We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transform… ▽ More

    Submitted 17 July, 2025; originally announced July 2025.

  7. arXiv:2507.12938  [pdf, ps, other

    eess.IV cs.CV

    Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion

    Authors: Caixia Dong, Duwei Dai, Xinyi Han, Fan Liu, Xu Yang, Zongfang Li, Songhua Xu

    Abstract: Accurate coronary artery segmentation is critical for computeraided diagnosis of coronary artery disease (CAD), yet it remains challenging due to the small size, complex morphology, and low contrast with surrounding tissues. To address these challenges, we propose a novel segmentation framework that leverages the power of vision foundation models (VFMs) through a parallel encoding architecture. Sp… ▽ More

    Submitted 17 July, 2025; originally announced July 2025.

    Journal ref: MICCAI2025

  8. arXiv:2507.12930  [pdf, ps, other

    cs.CL cs.AI

    Making Language Model a Hierarchical Classifier and Generator

    Authors: Yihong Wang, Zhonglin Jiang, Ningyuan Xi, Yue Zhao, Qingqing Gu, Xiyuan Chen, Hao Wu, Sheng Xu, Hange Zhou, Yong Chen, Luo Ji

    Abstract: Decoder-only language models, such as GPT and LLaMA, generally decode on the last layer. Motivated by human's hierarchical thinking capability, we propose that a hierarchical decoder architecture could be built with different layers decoding texts simultaneously. Due to limited time and computationally resources, we choose to adapt a pretrained language model into this form of hierarchical decoder… ▽ More

    Submitted 17 July, 2025; originally announced July 2025.

  9. arXiv:2507.12026  [pdf, ps, other

    cs.CV

    3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering

    Authors: Rongtao Xu, Han Gao, Mingming Yu, Dong An, Shunpeng Chen, Changwei Wang, Li Guo, Xiaodan Liang, Shibiao Xu

    Abstract: With the growing need for diverse and scalable data in indoor scene tasks, such as question answering and dense captioning, we propose 3D-MoRe, a novel paradigm designed to generate large-scale 3D-language datasets by leveraging the strengths of foundational models. The framework integrates key components, including multi-modal embedding, cross-modal interaction, and a language model decoder, to p… ▽ More

    Submitted 16 July, 2025; originally announced July 2025.

    Comments: Accepted by IROS 2025

  10. arXiv:2507.11949  [pdf, ps, other

    cs.GR cs.CV cs.RO

    MOSPA: Human Motion Generation Driven by Spatial Audio

    Authors: Shuyang Xu, Zhiyang Dou, Mingyi Shi, Liang Pan, Leo Ho, Jingbo Wang, Yuan Liu, Cheng Lin, Yuexin Ma, Wenping Wang, Taku Komura

    Abstract: Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As… ▽ More

    Submitted 16 July, 2025; originally announced July 2025.

  11. arXiv:2507.11847  [pdf, ps, other

    cs.LG stat.ML

    Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update

    Authors: Yu-Jie Zhang, Sheng-An Xu, Peng Zhao, Masashi Sugiyama

    Abstract: We study the generalized linear bandit (GLB) problem, a contextual multi-armed bandit framework that extends the classical linear model by incorporating a non-linear link function, thereby modeling a broad class of reward distributions such as Bernoulli and Poisson. While GLBs are widely applicable to real-world scenarios, their non-linear nature introduces significant challenges in achieving both… ▽ More

    Submitted 15 July, 2025; originally announced July 2025.

  12. Space Cybersecurity Testbed: Fidelity Framework, Example Implementation, and Characterization

    Authors: Jose Luis Castanon Remy, Caleb Chang, Ekzhin Ear, Shouhuai Xu

    Abstract: Cyber threats against space infrastructures, including satellites and systems on the ground, have not been adequately understood. Testbeds are important to deepen our understanding and validate space cybersecurity studies. The state of the art is that there are very few studies on building testbeds, and there are few characterizations of testbeds. In this paper, we propose a framework for characte… ▽ More

    Submitted 15 July, 2025; originally announced July 2025.

    Journal ref: Workshop on Security of Space and Satellite Systems (SpaceSec) 2025, 24 February 2025, San Diego, CA, USA

  13. arXiv:2507.10626  [pdf, ps, other

    cs.LG cs.AI

    Player-Team Heterogeneous Interaction Graph Transformer for Soccer Outcome Prediction

    Authors: Lintao Wang, Shiwen Xu, Michael Horton, Joachim Gudmundsson, Zhiyong Wang

    Abstract: Predicting soccer match outcomes is a challenging task due to the inherently unpredictable nature of the game and the numerous dynamic factors influencing results. While it conventionally relies on meticulous feature engineering, deep learning techniques have recently shown a great promise in learning effective player and team representations directly for soccer outcome prediction. However, existi… ▽ More

    Submitted 14 July, 2025; originally announced July 2025.

  14. Branch Explorer: Leveraging Branching Narratives to Support Interactive 360° Video Viewing for Blind and Low Vision Users

    Authors: Shuchang Xu, Xiaofu Jin, Wenshuo Zhang, Huamin Qu, Yukang Yan

    Abstract: 360° videos enable users to freely choose their viewing paths, but blind and low vision (BLV) users are often excluded from this interactive experience. To bridge this gap, we present Branch Explorer, a system that transforms 360° videos into branching narratives -- stories that dynamically unfold based on viewer choices -- to support interactive viewing for BLV audiences. Our formative study iden… ▽ More

    Submitted 14 July, 2025; originally announced July 2025.

  15. arXiv:2507.09214  [pdf, ps, other

    cs.CV

    Stereo-based 3D Anomaly Object Detection for Autonomous Driving: A New Dataset and Baseline

    Authors: Shiyi Mu, Zichong Gu, Hanqi Lyu, Yilin Gao, Shugong Xu

    Abstract: 3D detection technology is widely used in the field of autonomous driving, with its application scenarios gradually expanding from enclosed highways to open conventional roads. For rare anomaly categories that appear on the road, 3D detection models trained on closed sets often misdetect or fail to detect anomaly objects. To address this risk, it is necessary to enhance the generalization ability… ▽ More

    Submitted 12 July, 2025; originally announced July 2025.

    Comments: under review

  16. arXiv:2507.08903  [pdf, ps, other

    cs.RO cs.CV

    Multimodal HD Mapping for Intersections by Intelligent Roadside Units

    Authors: Zhongzhang Chen, Miao Fan, Shengtong Xu, Mengmeng Yang, Kun Jiang, Xiangzeng Liu, Haoyi Xiong

    Abstract: High-definition (HD) semantic mapping of complex intersections poses significant challenges for traditional vehicle-based approaches due to occlusions and limited perspectives. This paper introduces a novel camera-LiDAR fusion framework that leverages elevated intelligent roadside units (IRUs). Additionally, we present RS-seq, a comprehensive dataset developed through the systematic enhancement an… ▽ More

    Submitted 11 July, 2025; originally announced July 2025.

    Comments: Accepted by ITSC'25

  17. arXiv:2507.08901  [pdf, ps, other

    cs.RO

    End-to-End Generation of City-Scale Vectorized Maps by Crowdsourced Vehicles

    Authors: Zebang Feng, Miao Fan, Bao Liu, Shengtong Xu, Haoyi Xiong

    Abstract: High-precision vectorized maps are indispensable for autonomous driving, yet traditional LiDAR-based creation is costly and slow, while single-vehicle perception methods lack accuracy and robustness, particularly in adverse conditions. This paper introduces EGC-VMAP, an end-to-end framework that overcomes these limitations by generating accurate, city-scale vectorized maps through the aggregation… ▽ More

    Submitted 11 July, 2025; originally announced July 2025.

    Comments: Accepted by ITSC'25

  18. arXiv:2507.08716  [pdf, ps, other

    cs.CV

    Unreal is all you need: Multimodal ISAC Data Simulation with Only One Engine

    Authors: Kongwu Huang, Shiyi Mu, Jun Jiang, Yuan Gao, Shugong Xu

    Abstract: Scaling laws have achieved success in LLM and foundation models. To explore their potential in ISAC research, we propose Great-X. This single-engine multimodal data twin platform reconstructs the ray-tracing computation of Sionna within Unreal Engine and is deeply integrated with autonomous driving tools. This enables efficient and synchronized simulation of multimodal data, including CSI, RGB, Ra… ▽ More

    Submitted 22 July, 2025; v1 submitted 11 July, 2025; originally announced July 2025.

  19. arXiv:2507.06465  [pdf, ps, other

    cs.SI physics.soc-ph

    Temporal Motif Participation Profiles for Analyzing Node Similarity in Temporal Networks

    Authors: Maxwell C. Lee, Kevin S. Xu

    Abstract: Temporal networks consisting of timestamped interactions between a set of nodes provide a useful representation for analyzing complex networked systems that evolve over time. Beyond pairwise interactions between nodes, temporal motifs capture patterns of higher-order interactions such as directed triangles over short time periods. We propose temporal motif participation profiles (TMPPs) to capture… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  20. arXiv:2507.06261  [pdf, ps, other

    cs.CL cs.AI

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Authors: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu , et al. (3284 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal unde… ▽ More

    Submitted 22 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

    Comments: 72 pages, 17 figures

  21. arXiv:2507.05201  [pdf, ps, other

    cs.AI cs.CL cs.CV

    MedGemma Technical Report

    Authors: Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Brush, Kenneth Philbrick , et al. (56 additional authors not shown)

    Abstract: Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare's diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce Me… ▽ More

    Submitted 12 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

  22. arXiv:2507.04681  [pdf, ps, other

    cs.CV

    Colorectal Cancer Tumor Grade Segmentation in Digital Histopathology Images: From Giga to Mini Challenge

    Authors: Alper Bahcekapili, Duygu Arslan, Umut Ozdemir, Berkay Ozkirli, Emre Akbas, Ahmet Acar, Gozde B. Akar, Bingdou He, Shuoyu Xu, Umit Mert Caglar, Alptekin Temizel, Guillaume Picaud, Marc Chaumont, Gérard Subsol, Luc Téot, Fahad Alsharekh, Shahad Alghannam, Hexiang Mao, Wenhua Zhang

    Abstract: Colorectal cancer (CRC) is the third most diagnosed cancer and the second leading cause of cancer-related death worldwide. Accurate histopathological grading of CRC is essential for prognosis and treatment planning but remains a subjective process prone to observer variability and limited by global shortages of trained pathologists. To promote automated and standardized solutions, we organized the… ▽ More

    Submitted 12 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

    Comments: Accepted Grand Challenge Paper ICIP 2025

  23. arXiv:2507.04049  [pdf, ps, other

    cs.CV cs.RO

    Breaking Imitation Bottlenecks: Reinforced Diffusion Powers Diverse Trajectory Generation

    Authors: Ziying Song, Lin Liu, Hongyu Pan, Bencheng Liao, Mingzhe Guo, Lei Yang, Yongchang Zhang, Shaoqing Xu, Caiyan Jia, Yadan Luo

    Abstract: Most end-to-end autonomous driving methods rely on imitation learning from single expert demonstrations, often leading to conservative and homogeneous behaviors that limit generalization in complex real-world scenarios. In this work, we propose DIVER, an end-to-end driving framework that integrates reinforcement learning with diffusion-based generation to produce diverse and feasible trajectories.… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

    Comments: 16 pages, 6 figures

  24. arXiv:2507.03908  [pdf, ps, other

    cs.CV

    Bridging Vision and Language: Optimal Transport-Driven Radiology Report Generation via LLMs

    Authors: Haifeng Zhao, Yufei Zhang, Leilei Ma, Shuo Xu, Dengdi Sun

    Abstract: Radiology report generation represents a significant application within medical AI, and has achieved impressive results. Concurrently, large language models (LLMs) have demonstrated remarkable performance across various domains. However, empirical validation indicates that general LLMs tend to focus more on linguistic fluency rather than clinical effectiveness, and lack the ability to effectively… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

  25. arXiv:2507.03585  [pdf, ps, other

    cs.CV cs.AI cs.CL

    Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation

    Authors: Tao Tang, Shijie Xu, Yiting Wu, Zhixiang Lu

    Abstract: The clinical utility of deep learning models for medical image segmentation is severely constrained by their inability to generalize to unseen domains. This failure is often rooted in the models learning spurious correlations between anatomical content and domain-specific imaging styles. To overcome this fundamental challenge, we introduce Causal-SAM-LLM, a novel framework that elevates Large Lang… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

  26. arXiv:2507.02899  [pdf, ps, other

    cs.CV

    Learning to Generate Vectorized Maps at Intersections with Multiple Roadside Cameras

    Authors: Quanxin Zheng, Miao Fan, Shengtong Xu, Linghe Kong, Haoyi Xiong

    Abstract: Vectorized maps are indispensable for precise navigation and the safe operation of autonomous vehicles. Traditional methods for constructing these maps fall into two categories: offline techniques, which rely on expensive, labor-intensive LiDAR data collection and manual annotation, and online approaches that use onboard cameras to reduce costs but suffer from limited performance, especially at co… ▽ More

    Submitted 11 July, 2025; v1 submitted 23 June, 2025; originally announced July 2025.

    Comments: Accepted by IROS'25

  27. arXiv:2507.02546  [pdf, ps, other

    cs.CV

    MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    Authors: Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, Jiaolong Yang

    Abstract: We propose MoGe-2, an advanced open-domain geometry estimation model that recovers a metric scale 3D point map of a scene from a single image. Our method builds upon the recent monocular geometry estimation approach, MoGe, which predicts affine-invariant point maps with unknown scales. We explore effective strategies to extend MoGe for metric geometry prediction without compromising the relative g… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Project page: https://wangrc.site/MoGe2Page/

  28. arXiv:2507.01735  [pdf, ps, other

    cs.CV cs.AI cs.CL cs.LG

    ECCV 2024 W-CODA: 1st Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving

    Authors: Kai Chen, Ruiyuan Gao, Lanqing Hong, Hang Xu, Xu Jia, Holger Caesar, Dengxin Dai, Bingbing Liu, Dzmitry Tsishkou, Songcen Xu, Chunjing Xu, Qiang Xu, Huchuan Lu, Dit-Yan Yeung

    Abstract: In this paper, we present details of the 1st W-CODA workshop, held in conjunction with the ECCV 2024. W-CODA aims to explore next-generation solutions for autonomous driving corner cases, empowered by state-of-the-art multimodal perception and comprehension techniques. 5 Speakers from both academia and industry are invited to share their latest progress and opinions. We collect research papers and… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: ECCV 2024. Workshop page: https://coda-dataset.github.io/w-coda2024/

  29. arXiv:2507.01613  [pdf, ps, other

    stat.ML cs.LG

    When Less Is More: Binary Feedback Can Outperform Ordinal Comparisons in Ranking Recovery

    Authors: Shirong Xu, Jingnan Zhang, Junhui Wang

    Abstract: Paired comparison data, where users evaluate items in pairs, play a central role in ranking and preference learning tasks. While ordinal comparison data intuitively offer richer information than binary comparisons, this paper challenges that conventional wisdom. We propose a general parametric framework for modeling ordinal paired comparisons without ties. The model adopts a generalized additive s… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  30. arXiv:2507.01006  [pdf, ps, other

    cs.CV cs.AI cs.LG

    GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Authors: GLM-V Team, :, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Boyan Shi, Changyu Pang, Chenhui Zhang , et al. (54 additional authors not shown)

    Abstract: We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the fi… ▽ More

    Submitted 2 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

  31. arXiv:2506.24102  [pdf, ps, other

    cs.CV

    DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World

    Authors: Xiangtai Li, Tao Zhang, Yanwei Li, Haobo Yuan, Shihao Chen, Yikang Zhou, Jiahao Meng, Yueyi Sun, Shilin Xu, Lu Qi, Tianheng Cheng, Yi Lin, Zilong Huang, Wenhao Huang, Jiashi Feng, Guang Shi

    Abstract: Multimodal Large Language Models (MLLMs) demonstrate a complex understanding of scenes, benefiting from large-scale and high-quality datasets. Most existing caption datasets lack the ground locations and relations for visual entities. Several grounded caption datasets face the problems of missing detailed descriptions, relations, and massive object descriptions on high-resolution images. To fill t… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Datasets and Models: https://github.com/lxtGH/DenseWorld-1M

  32. arXiv:2506.23351  [pdf, ps, other

    cs.RO cs.AI cs.LG cs.MA

    Benchmarking Generalizable Bimanual Manipulation: RoboTwin Dual-Arm Collaboration Challenge at CVPR 2025 MEIS Workshop

    Authors: Tianxing Chen, Kaixuan Wang, Zhaohui Yang, Yuhao Zhang, Zanxin Chen, Baijun Chen, Wanxi Dong, Ziyuan Liu, Dong Chen, Tianshuo Yang, Haibao Yu, Xiaokang Yang, Yusen Qin, Zhiqiang Xie, Yao Mu, Ping Luo, Tian Nian, Weiliang Deng, Yiheng Ge, Yibin Liu, Zixuan Li, Dehui Wang, Zhixuan Liang, Haohui Xie, Rijie Zeng , et al. (74 additional authors not shown)

    Abstract: Embodied Artificial Intelligence (Embodied AI) is an emerging frontier in robotics, driven by the need for autonomous systems that can perceive, reason, and act in complex physical environments. While single-arm systems have shown strong task performance, collaborative dual-arm systems are essential for handling more intricate tasks involving rigid, deformable, and tactile-sensitive objects. To ad… ▽ More

    Submitted 2 July, 2025; v1 submitted 29 June, 2025; originally announced June 2025.

    Comments: Challenge Webpage: https://robotwin-benchmark.github.io/cvpr-2025-challenge/

  33. arXiv:2506.21945  [pdf

    cs.CV cs.AI

    SDRNET: Stacked Deep Residual Network for Accurate Semantic Segmentation of Fine-Resolution Remotely Sensed Images

    Authors: Naftaly Wambugu, Ruisheng Wang, Bo Guo, Tianshu Yu, Sheng Xu, Mohammed Elhassan

    Abstract: Land cover maps generated from semantic segmentation of high-resolution remotely sensed images have drawn mucon in the photogrammetry and remote sensing research community. Currently, massive fine-resolution remotely sensed (FRRS) images acquired by improving sensing and imaging technologies become available. However, accurate semantic segmentation of such FRRS images is greatly affected by substa… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  34. arXiv:2506.21077  [pdf, ps, other

    cs.RO

    CURL-SLAM: Continuous and Compact LiDAR Mapping

    Authors: Kaicheng Zhang, Shida Xu, Yining Ding, Xianwen Kong, Sen Wang

    Abstract: This paper studies 3D LiDAR mapping with a focus on developing an updatable and localizable map representation that enables continuity, compactness and consistency in 3D maps. Traditional LiDAR Simultaneous Localization and Mapping (SLAM) systems often rely on 3D point cloud maps, which typically require extensive storage to preserve structural details in large-scale environments. In this paper, w… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  35. Semantic-enhanced Modality-asymmetric Retrieval for Online E-commerce Search

    Authors: Zhigong Zhou, Ning Ding, Xiaochuan Fan, Yue Shang, Yiming Qiu, Jingwei Zhuo, Zhiwei Ge, Songlin Wang, Lin Liu, Sulong Xu, Han Zhang

    Abstract: Semantic retrieval, which retrieves semantically matched items given a textual query, has been an essential component to enhance system effectiveness in e-commerce search. In this paper, we study the multimodal retrieval problem, where the visual information (e.g, image) of item is leveraged as supplementary of textual information to enrich item representation and further improve retrieval perform… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: published in sigir2023

  36. arXiv:2506.19935  [pdf, ps, other

    cs.LG cs.CV stat.ML

    Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture

    Authors: Shuchen Xue, Tianyu Xie, Tianyang Hu, Zijin Feng, Jiacheng Sun, Kenji Kawaguchi, Zhenguo Li, Zhi-Ming Ma

    Abstract: Large language models (LLMs) predominantly use autoregressive (AR) approaches, but masked diffusion models (MDMs) are emerging as viable alternatives. A key challenge in comparing AR and MDM paradigms is their typical architectural difference: AR models are often decoder-only, while MDMs have largely been encoder-only. This practice of changing both the modeling paradigm and architecture simultane… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  37. arXiv:2506.19883  [pdf, ps, other

    cs.LG cs.AI

    STIMULUS: Achieving Fast Convergence and Low Sample Complexity in Stochastic Multi-Objective Learning

    Authors: Zhuqing Liu, Chaosheng Dong, Michinari Momma, Simone Shao, Shaoyuan Xu, Yan Gao, Haibo Yang, Jia Liu

    Abstract: Recently, multi-objective optimization (MOO) has gained attention for its broad applications in ML, operations research, and engineering. However, MOO algorithm design remains in its infancy and many existing MOO methods suffer from unsatisfactory convergence rate and sample complexity performance. To address this challenge, in this paper, we propose an algorithm called STIMULUS( stochastic path-i… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  38. arXiv:2506.19741  [pdf, ps, other

    cs.LG cs.CV stat.ML

    Noise Consistency Training: A Native Approach for One-Step Generator in Learning Additional Controls

    Authors: Yihong Luo, Shuchen Xue, Tianyang Hu, Jing Tang

    Abstract: The pursuit of efficient and controllable high-quality content generation remains a central challenge in artificial intelligence-generated content (AIGC). While one-step generators, enabled by diffusion distillation techniques, offer excellent generation quality and computational efficiency, adapting them to new control conditions--such as structural constraints, semantic guidelines, or external i… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  39. arXiv:2506.19660  [pdf, ps, other

    cs.DC

    PS-WL: A Probability-Sensitive Wear Leveling scheme for SSD array scaling

    Authors: Shuhang Xu, Yunfei Gu, Linhui Liu, Chentao Wu

    Abstract: As flash-based Solid State Drive (SSD) arrays become essential to modern data centers, scaling these arrays to meet explosive data growth is a frequent and critical operation. However, the conventional wear-leveling (WL) paradigm applied during scaling suffers from a fundamental flaw: it ignores the non-linear relationship between wear and failure probability, potentially pushing the most vulnerab… ▽ More

    Submitted 3 July, 2025; v1 submitted 24 June, 2025; originally announced June 2025.

  40. arXiv:2506.18890  [pdf, ps, other

    cs.CV

    4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time

    Authors: Ziqiao Ma, Xuweiyi Chen, Shoubin Yu, Sai Bi, Kai Zhang, Chen Ziwen, Sihan Xu, Jianing Yang, Zexiang Xu, Kalyan Sunkavalli, Mohit Bansal, Joyce Chai, Hao Tan

    Abstract: Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at some times to any view at any time? We provide an affirmative answer with 4D-LRM, the first large-scale 4D reconstruction model that takes input from unconstrained views and timestamps and renders arbitrary novel view-time combinations. Unlike prior 4D approaches, e.g., optimizati… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: Project page: https://4dlrm.github.io/

  41. arXiv:2506.18882  [pdf, ps, other

    cs.CV

    Light of Normals: Unified Feature Representation for Universal Photometric Stereo

    Authors: Hong Li, Houyuan Chen, Chongjie Ye, Zhaoxi Chen, Bohan Li, Shaocong Xu, Xianda Guo, Xuhui Liu, Yikai Wang, Baochang Zhang, Satoshi Ikehata, Boxin Shi, Anyi Rao, Hao Zhao

    Abstract: Universal photometric stereo (PS) aims to recover high-quality surface normals from objects under arbitrary lighting conditions without relying on specific illumination models. Despite recent advances such as SDM-UniPS and Uni MS-PS, two fundamental challenges persist: 1) the deep coupling between varying illumination and surface normal features, where ambiguity in observed intensity makes it diff… ▽ More

    Submitted 24 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

    Comments: Home: https://houyuanchen111.github.io/lino.github.io Github: https://github.com/houyuanchen111/LINO_UniPS HuggingFace Demo: https://huggingface.co/spaces/houyuanchen/lino

  42. arXiv:2506.18866  [pdf, ps, other

    cs.CV cs.AI cs.MM

    OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation

    Authors: Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, Steven Hoi

    Abstract: Significant progress has been made in audio-driven human animation, while most existing methods focus mainly on facial movements, limiting their ability to create full-body animations with natural synchronization and fluidity. They also struggle with precise prompt control for fine-grained generation. To tackle these challenges, we introduce OmniAvatar, an innovative audio-driven full-body video g… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: Project page: https://omni-avatar.github.io/

  43. arXiv:2506.18737  [pdf, ps, other

    cs.CV cs.RO

    USVTrack: USV-Based 4D Radar-Camera Tracking Dataset for Autonomous Driving in Inland Waterways

    Authors: Shanliang Yao, Runwei Guan, Yi Ni, Sen Xu, Yong Yue, Xiaohui Zhu, Ryan Wen Liu

    Abstract: Object tracking in inland waterways plays a crucial role in safe and cost-effective applications, including waterborne transportation, sightseeing tours, environmental monitoring and surface rescue. Our Unmanned Surface Vehicle (USV), equipped with a 4D radar, a monocular camera, a GPS, and an IMU, delivers robust tracking capabilities in complex waterborne environments. By leveraging these sensor… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: Accepted by IROS

  44. arXiv:2506.18560  [pdf, ps, other

    cs.ET cs.LG

    Efficient Beam Selection for ISAC in Cell-Free Massive MIMO via Digital Twin-Assisted Deep Reinforcement Learning

    Authors: Jiexin Zhang, Shu Xu, Chunguo Li, Yongming Huang, Luxi Yang

    Abstract: Beamforming enhances signal strength and quality by focusing energy in specific directions. This capability is particularly crucial in cell-free integrated sensing and communication (ISAC) systems, where multiple distributed access points (APs) collaborate to provide both communication and sensing services. In this work, we first derive the distribution of joint target detection probabilities acro… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: Submitted to IEEE Transactions on Wireless Communications

  45. arXiv:2506.18096  [pdf, ps, other

    cs.AI

    Deep Research Agents: A Systematic Examination And Roadmap

    Authors: Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, Jianye Hao, Kun Shao, Jun Wang

    Abstract: The rapid progress of Large Language Models (LLMs) has given rise to a new category of autonomous AI systems, referred to as Deep Research (DR) agents. These agents are designed to tackle complex, multi-turn informational research tasks by leveraging a combination of dynamic reasoning, adaptive long-horizon planning, multi-hop information retrieval, iterative tool use, and the generation of struct… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

  46. arXiv:2506.15492  [pdf, ps, other

    cs.LG stat.ML

    LIT-LVM: Structured Regularization for Interaction Terms in Linear Predictors using Latent Variable Models

    Authors: Mohammadreza Nemati, Zhipeng Huang, Kevin S. Xu

    Abstract: Some of the simplest, yet most frequently used predictors in statistics and machine learning use weighted linear combinations of features. Such linear predictors can model non-linear relationships between features by adding interaction terms corresponding to the products of all pairs of features. We consider the problem of accurately estimating coefficients for interaction terms in linear predicto… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  47. arXiv:2506.14121  [pdf, ps, other

    cs.CV

    FADPNet: Frequency-Aware Dual-Path Network for Face Super-Resolution

    Authors: Siyu Xu, Wenjie Li, Guangwei Gao, Jian Yang, Guo-Jun Qi, Chia-Wen Lin

    Abstract: Face super-resolution (FSR) under limited computational costs remains an open problem. Existing approaches typically treat all facial pixels equally, resulting in suboptimal allocation of computational resources and degraded FSR performance. CNN is relatively sensitive to high-frequency facial features, such as component contours and facial outlines. Meanwhile, Mamba excels at capturing low-freque… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: 12 pages, 11 figures, 6 tales

  48. arXiv:2506.13485  [pdf, ps, other

    q-bio.BM cs.LG

    Curriculum Learning for Biological Sequence Prediction: The Case of De Novo Peptide Sequencing

    Authors: Xiang Zhang, Jiaqi Wei, Zijie Qiu, Sheng Xu, Nanqing Dong, Zhiqiang Gao, Siqi Sun

    Abstract: Peptide sequencing-the process of identifying amino acid sequences from mass spectrometry data-is a fundamental task in proteomics. Non-Autoregressive Transformers (NATs) have proven highly effective for this task, outperforming traditional methods. Unlike autoregressive models, which generate tokens sequentially, NATs predict all positions simultaneously, leveraging bidirectional context through… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  49. arXiv:2506.12713  [pdf, ps, other

    cs.SE cs.CL

    Humanity's Last Code Exam: Can Advanced LLMs Conquer Human's Hardest Code Competition?

    Authors: Xiangyang Li, Xiaopeng Li, Kuicai Dong, Quanhu Zhang, Rongju Ruan, Xinyi Dai, Xiaoshuang Liu, Shengchun Xu, Yasheng Wang, Ruiming Tang

    Abstract: Code generation is a core capability of large language models (LLMs), yet mainstream benchmarks (e.g., APPs and LiveCodeBench) contain questions with medium-level difficulty and pose no challenge to advanced LLMs. To better reflected the advanced reasoning and code generation ability, We introduce Humanity's Last Code Exam (HLCE), comprising 235 most challenging problems from the International Col… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

  50. arXiv:2506.11697  [pdf, ps, other

    cs.SE

    SoK: Automated Vulnerability Repair: Methods, Tools, and Assessments

    Authors: Yiwei Hu, Zhen Li, Kedie Shu, Shenghua Guan, Deqing Zou, Shouhuai Xu, Bin Yuan, Hai Jin

    Abstract: The increasing complexity of software has led to the steady growth of vulnerabilities. Vulnerability repair investigates how to fix software vulnerabilities. Manual vulnerability repair is labor-intensive and time-consuming because it relies on human experts, highlighting the importance of Automated Vulnerability Repair (AVR). In this SoK, we present the systematization of AVR methods through the… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: The full version of "SoK: Automated Vulnerability Repair: Methods, Tools, and Assessments" accepted by the 34th USENIX Security Symposium (USENIX Security 2025)