这是indexloc提供的服务,不要输入任何密码
Skip to main content

Showing 1–50 of 383 results for author: Han, W

Searching in archive cs. Search in all archives.
.
  1. Asymmetric Lesion Detection with Geometric Patterns and CNN-SVM Classification

    Authors: M. A. Rasel, Sameem Abdul Kareem, Zhenli Kwan, Nik Aimee Azizah Faheem, Winn Hui Han, Rebecca Kai Jan Choong, Shin Shen Yong, Unaizah Obaidellah

    Abstract: In dermoscopic images, which allow visualization of surface skin structures not visible to the naked eye, lesion shape offers vital insights into skin diseases. In clinically practiced methods, asymmetric lesion shape is one of the criteria for diagnosing melanoma. Initially, we labeled data for a non-annotated dataset with symmetrical information based on clinical assessments. Subsequently, we pr… ▽ More

    Submitted 23 July, 2025; originally announced July 2025.

    Comments: Accepted version. Published in Computers in Biology and Medicine, Volume 179, 2024. DOI: 10.1016/j.compbiomed.2024.108851

    Journal ref: Computers in Biology and Medicine, Volume 179, 2024, Article 108851

  2. arXiv:2507.14447  [pdf, ps, other

    cs.AI cs.CL

    Routine: A Structural Planning Framework for LLM Agent System in Enterprise

    Authors: Guancheng Zeng, Xueyi Chen, Jiawang Hu, Shaohua Qi, Yaxuan Mao, Zhantao Wang, Yifan Nie, Shuang Li, Qiuyang Feng, Pengxu Qiu, Yujia Wang, Wenqiang Han, Linyan Huang, Gang Li, Jingjing Mo, Haowen Hu

    Abstract: The deployment of agent systems in an enterprise environment is often hindered by several challenges: common models lack domain-specific process knowledge, leading to disorganized plans, missing key tools, and poor execution stability. To address this, this paper introduces Routine, a multi-step agent planning framework designed with a clear structure, explicit instructions, and seamless parameter… ▽ More

    Submitted 22 July, 2025; v1 submitted 18 July, 2025; originally announced July 2025.

    Comments: 26 pages, 8 figures, 5 tables

  3. arXiv:2507.10293  [pdf, ps, other

    cs.CV

    Show and Polish: Reference-Guided Identity Preservation in Face Video Restoration

    Authors: Wenkang Han, Wang Lin, Yiyun Zhou, Qi Liu, Shulei Wang, Chang Yao, Jingyuan Chen

    Abstract: Face Video Restoration (FVR) aims to recover high-quality face videos from degraded versions. Traditional methods struggle to preserve fine-grained, identity-specific features when degradation is severe, often producing average-looking faces that lack individual characteristics. To address these challenges, we introduce IP-FVR, a novel method that leverages a high-quality reference face image as a… ▽ More

    Submitted 14 July, 2025; originally announced July 2025.

    Comments: Accepted by MM 2025

  4. arXiv:2507.06261  [pdf, ps, other

    cs.CL cs.AI

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Authors: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu , et al. (3284 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal unde… ▽ More

    Submitted 22 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

    Comments: 72 pages, 17 figures

  5. arXiv:2507.03302  [pdf, ps, other

    cs.CV cs.AI

    Leveraging Out-of-Distribution Unlabeled Images: Semi-Supervised Semantic Segmentation with an Open-Vocabulary Model

    Authors: Wooseok Shin, Jisu Kang, Hyeonki Jeong, Jin Sob Kim, Sung Won Han

    Abstract: In semi-supervised semantic segmentation, existing studies have shown promising results in academic settings with controlled splits of benchmark datasets. However, the potential benefits of leveraging significantly larger sets of unlabeled images remain unexplored. In real-world scenarios, abundant unlabeled images are often available from online sources (web-scraped images) or large-scale dataset… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

    Comments: 19pages, 8 figures

  6. arXiv:2507.02395  [pdf, ps, other

    cs.CV

    Continual Multiple Instance Learning with Enhanced Localization for Histopathological Whole Slide Image Analysis

    Authors: Byung Hyun Lee, Wongi Jeong, Woojae Han, Kyoungbun Lee, Se Young Chun

    Abstract: Multiple instance learning (MIL) significantly reduced annotation costs via bag-level weak labels for large-scale images, such as histopathological whole slide images (WSIs). However, its adaptability to continual tasks with minimal forgetting has been rarely explored, especially on instance classification for localization. Weakly incremental learning for semantic segmentation has been studied for… ▽ More

    Submitted 8 July, 2025; v1 submitted 3 July, 2025; originally announced July 2025.

    Comments: Accepted at ICCV 2025

  7. arXiv:2507.01785  [pdf, ps, other

    cs.CL cs.AI cs.LG

    MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

    Authors: Zhixun Chen, Ping Guo, Wenhan Han, Yifan Zhang, Binbin Liu, Haobin Lin, Fengze Liu, Yan Zhao, Bingni Zhang, Taifeng Wang, Yin Zheng, Meng Fang

    Abstract: Data quality is a critical driver of large language model performance, yet existing model-based selection methods focus almost exclusively on English. We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages. MuRating aggregates multiple English "raters" via pairwise comparisons to learn unified document-qualit… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  8. arXiv:2506.19468  [pdf, ps, other

    cs.CL cs.AI

    MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages

    Authors: Wenhan Han, Yifan Zhang, Zhixun Chen, Binbin Liu, Haobin Lin, Bingni Zhang, Taifeng Wang, Mykola Pechenizkiy, Meng Fang, Yin Zheng

    Abstract: Multilingual large language models (LLMs) are advancing rapidly, with new models frequently claiming support for an increasing number of languages. However, existing evaluation datasets are limited and lack cross-lingual alignment, leaving assessments of multilingual capabilities fragmented in both language and skill coverage. To address this, we introduce MuBench, a benchmark covering 61 language… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  9. arXiv:2506.17680  [pdf, ps, other

    cs.LG cond-mat.mtrl-sci cs.AI

    Enhancing Stress-Strain Predictions with Seq2Seq and Cross-Attention based on Small Punch Test

    Authors: Zhengni Yang, Rui Yang, Weijian Han, Qixin Liu

    Abstract: This paper introduces a novel deep-learning approach to predict true stress-strain curves of high-strength steels from small punch test (SPT) load-displacement data. The proposed approach uses Gramian Angular Field (GAF) to transform load-displacement sequences into images, capturing spatial-temporal features and employs a Sequence-to-Sequence (Seq2Seq) model with an LSTM-based encoder-decoder arc… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

    Comments: accepted by IJCNN2025

  10. arXiv:2506.16741  [pdf, ps, other

    eess.AS cs.AI

    RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching

    Authors: Hyun Joon Park, Jeongmin Liu, Jin Sob Kim, Jeong Yeol Yang, Sung Won Han, Eunwoo Song

    Abstract: We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, Ra… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: Accepted on Interspeech 2025

  11. arXiv:2506.16073  [pdf, ps, other

    cs.CV

    TD3Net: A Temporal Densely Connected Multi-Dilated Convolutional Network for Lipreading

    Authors: Byung Hoon Lee, Wooseok Shin, Sung Won Han

    Abstract: The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: 15 pages, 6 figures

    ACM Class: I.4.8; I.5.4; I.2.10

  12. arXiv:2506.15835  [pdf, ps, other

    eess.IV cs.AI cs.CV

    MoNetV2: Enhanced Motion Network for Freehand 3D Ultrasound Reconstruction

    Authors: Mingyuan Luo, Xin Yang, Zhongnuo Yan, Yan Cao, Yuanji Zhang, Xindi Hu, Jin Wang, Haoxuan Ding, Wei Han, Litao Sun, Dong Ni

    Abstract: Three-dimensional (3D) ultrasound (US) aims to provide sonographers with the spatial relationships of anatomical structures, playing a crucial role in clinical diagnosis. Recently, deep-learning-based freehand 3D US has made significant advancements. It reconstructs volumes by estimating transformations between images without external tracking. However, image-only reconstruction poses difficulties… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  13. Bounded Memory in Distributed Networks

    Authors: Ran Ben Basat, Keren Censor-Hillel, Yi-Jun Chang, Wenchen Han, Dean Leitersdorf, Gregory Schwartzman

    Abstract: The recent advent of programmable switches makes distributed algorithms readily deployable in real-world datacenter networks. However, there are still gaps between theory and practice that prevent the smooth adaptation of CONGEST algorithms to these environments. In this paper, we focus on the memory restrictions that arise in real-world deployments. We introduce the $μ$-CONGEST model where on top… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: Accepted at The 37th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '25). 22 pages

  14. arXiv:2506.11127  [pdf, ps, other

    cs.CL cs.AI

    GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions

    Authors: Wenkang Han, Zhixiong Zeng, Jing Huang, Shu Jiang, Liming Zheng, Longrong Yang, Haibo Qiu, Chang Yao, Jingyuan Chen, Lin Ma

    Abstract: Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this gap, we propose GUIRoboTron-Speech, the first end-to-end autonomous GUI agent that directly accepts speech instructions and on-device screensho… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  15. arXiv:2506.10778  [pdf, ps, other

    cs.CV cs.AI cs.LG

    SlotPi: Physics-informed Object-centric Reasoning Models

    Authors: Jian Li, Wan Han, Ning Lin, Yu-Liang Zhan, Ruizhi Chengze, Haining Wang, Yi Zhang, Hongsheng Liu, Zidong Wang, Fan Yu, Hao Sun

    Abstract: Understanding and reasoning about dynamics governed by physical laws through visual observation, akin to human capabilities in the real world, poses significant challenges. Currently, object-centric dynamic simulation methods, which emulate human behavior, have achieved notable progress but overlook two critical aspects: 1) the integration of physical knowledge into models. Humans gain physical in… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  16. arXiv:2506.01072  [pdf, ps, other

    cs.CR

    IDCloak: A Practical Secure Multi-party Dataset Join Framework for Vertical Privacy-preserving Machine Learning

    Authors: Shuyu Chen, Guopeng Lin, Haoyu Niu, Lushan Song, Chengxun Hong, Weili Han

    Abstract: Vertical privacy-preserving machine learning (vPPML) enables multiple parties to train models on their vertically distributed datasets while keeping datasets private. In vPPML, it is critical to perform the secure dataset join, which aligns features corresponding to intersection IDs across datasets and forms a secret-shared and joint training dataset. However, existing methods for this step could… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

  17. arXiv:2505.22087  [pdf, ps, other

    cs.AI

    Cognitively-Inspired Emergent Communication via Knowledge Graphs for Assisting the Visually Impaired

    Authors: Ruxiao Chen, Dezheng Han, Wenjie Han, Shuaishuai Guo

    Abstract: Assistive systems for visually impaired individuals must deliver rapid, interpretable, and adaptive feedback to facilitate real-time navigation. Current approaches face a trade-off between latency and semantic richness: natural language-based systems provide detailed guidance but are too slow for dynamic scenarios, while emergent communication frameworks offer low-latency symbolic languages but la… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  18. arXiv:2505.19955  [pdf, ps, other

    cs.LG cs.AI cs.CL

    MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

    Authors: Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, Bryan Hooi

    Abstract: Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge… ▽ More

    Submitted 1 July, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

    Comments: 42 pages, 9 figures

  19. arXiv:2505.19939  [pdf, ps, other

    cs.RO

    Uncertainty-Aware Safety-Critical Decision and Control for Autonomous Vehicles at Unsignalized Intersections

    Authors: Ran Yu, Zhuoren Li, Lu Xiong, Wei Han, Bo Leng

    Abstract: Reinforcement learning (RL) has demonstrated potential in autonomous driving (AD) decision tasks. However, applying RL to urban AD, particularly in intersection scenarios, still faces significant challenges. The lack of safety constraints makes RL vulnerable to risks. Additionally, cognitive limitations and environmental randomness can lead to unreliable decisions in safety-critical scenarios. The… ▽ More

    Submitted 14 July, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

    Comments: 7 pages, 4 figures

  20. arXiv:2505.18917  [pdf, ps, other

    cs.LG cs.AI

    Behavior Injection: Preparing Language Models for Reinforcement Learning

    Authors: Zhepeng Cen, Yihang Yao, William Han, Zuxin Liu, Ding Zhao

    Abstract: Reinforcement fine-tuning (RFT) has emerged as a powerful post-training technique to incentivize the reasoning ability of large language models (LLMs). However, LLMs can respond very inconsistently to RFT: some show substantial performance gains, while others plateau or even degrade. To understand this divergence, we analyze the per-step influence of the RL objective and identify two key condition… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  21. arXiv:2505.18847  [pdf, other

    cs.AI cs.CL

    Signal, Image, or Symbolic: Exploring the Best Input Representation for Electrocardiogram-Language Models Through a Unified Framework

    Authors: William Han, Chaojing Duan, Zhepeng Cen, Yihang Yao, Xiaoyu Song, Atharva Mhaskar, Dylan Leong, Michael A. Rosenberg, Emerson Liu, Ding Zhao

    Abstract: Recent advances have increasingly applied large language models (LLMs) to electrocardiogram (ECG) interpretation, giving rise to Electrocardiogram-Language Models (ELMs). Conditioned on an ECG and a textual query, an ELM autoregressively generates a free-form textual response. Unlike traditional classification-based systems, ELMs emulate expert cardiac electrophysiologists by issuing diagnoses, an… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

    Comments: 29 pages, 2 figures, 8 tables

  22. arXiv:2505.17540  [pdf, ps, other

    cs.CV cs.AI

    RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning

    Authors: Mingrui Wu, Lu Wang, Pu Zhao, Fangkai Yang, Jianjin Zhang, Jianfeng Liu, Yuefeng Zhan, Weihao Han, Hao Sun, Jiayi Ji, Xiaoshuai Sun, Qingwei Lin, Weiwei Deng, Dongmei Zhang, Feng Sun, Qi Zhang, Rongrong Ji

    Abstract: Despite recent progress in text-to-image (T2I) generation, existing models often struggle to faithfully capture user intentions from short and under-specified prompts. While prior work has attempted to enhance prompts using large language models (LLMs), these methods frequently generate stylistic or unrealistic content due to insufficient grounding in visual semantics and real-world composition. I… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: Code is available at: https://github.com/microsoft/DKI_LLM/tree/main/RePrompt

  23. arXiv:2505.16763  [pdf, ps, other

    cs.CV

    Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation

    Authors: Hongji Yang, Yucheng Zhou, Wencheng Han, Jianbing Shen

    Abstract: Text-to-image models are powerful for producing high-quality images based on given text prompts, but crafting these prompts often requires specialized vocabulary. To address this, existing methods train rewriting models with supervision from large amounts of manually annotated data and trained aesthetic assessment models. To alleviate the dependence on data scale for model training and the biases… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  24. arXiv:2505.16735  [pdf, ps, other

    eess.AS cs.AI

    Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword Spotting

    Authors: Youngmoon Jung, Yong-Hyeok Lee, Myunghun Jung, Jaeyoung Roh, Chang Woo Han, Hoon-Young Cho

    Abstract: For text enrollment-based open-vocabulary keyword spotting (KWS), acoustic and text embeddings are typically compared at either the phoneme or utterance level. To facilitate this, we optimize acoustic and text encoders using deep metric learning (DML), enabling direct comparison of multi-modal embeddings in a shared embedding space. However, the inherent heterogeneity between audio and text modali… ▽ More

    Submitted 22 May, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

    Comments: 5 pages, 1 figure, Accepted at Interspeech 2025

  25. arXiv:2505.15431  [pdf, ps, other

    cs.CL

    Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought

    Authors: Tencent Hunyuan Team, Ao Liu, Botong Zhou, Can Xu, Chayse Zhou, ChenChen Zhang, Chengcheng Xu, Chenhao Wang, Decheng Wu, Dengpeng Wu, Dian Jiao, Dong Du, Dong Wang, Feng Zhang, Fengzong Lian, Guanghui Xu, Guanwei Zhang, Hai Wang, Haipeng Luo, Han Hu, Huilin Xu, Jiajia Wu, Jianchen Zhu, Jianfeng Yan, Jiaqi Zhu , et al. (230 additional authors not shown)

    Abstract: As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid response… ▽ More

    Submitted 4 July, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

  26. arXiv:2505.13489  [pdf, ps, other

    cs.AI cs.CL

    Contrastive Cross-Course Knowledge Tracing via Concept Graph Guided Knowledge Transfer

    Authors: Wenkang Han, Wang Lin, Liya Hu, Zhenlong Dai, Yiyun Zhou, Mengze Li, Zemin Liu, Chang Yao, Jingyuan Chen

    Abstract: Knowledge tracing (KT) aims to predict learners' future performance based on historical learning interactions. However, existing KT models predominantly focus on data from a single course, limiting their ability to capture a comprehensive understanding of learners' knowledge states. In this paper, we propose TransKT, a contrastive cross-course knowledge tracing method that leverages concept graph… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: Accepted by IJCAI 2025

  27. arXiv:2505.13441  [pdf, ps, other

    cs.RO

    GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation

    Authors: Abhay Deshpande, Yuquan Deng, Arijit Ray, Jordi Salvador, Winson Han, Jiafei Duan, Kuo-Hao Zeng, Yuke Zhu, Ranjay Krishna, Rose Hendrix

    Abstract: We present GrasMolmo, a generalizable open-vocabulary task-oriented grasping (TOG) model. GraspMolmo predicts semantically appropriate, stable grasps conditioned on a natural language instruction and a single RGB-D frame. For instance, given "pour me some tea", GraspMolmo selects a grasp on a teapot handle rather than its body. Unlike prior TOG methods, which are limited by small datasets, simplis… ▽ More

    Submitted 22 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

  28. arXiv:2505.12233  [pdf, ps, other

    eess.IV cs.CV

    PRETI: Patient-Aware Retinal Foundation Model via Metadata-Guided Representation Learning

    Authors: Yeonkyung Lee, Woojung Han, Youngjun Jun, Hyeonmin Kim, Jungkyung Cho, Seong Jae Hwang

    Abstract: Retinal foundation models have significantly advanced retinal image analysis by leveraging self-supervised learning to reduce dependence on labeled data while achieving strong generalization. Many recent approaches enhance retinal image understanding using report supervision, but obtaining clinical reports is often costly and challenging. In contrast, metadata (e.g., age, gender) is widely availab… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

    Comments: MICCAI2025 early accept

  29. arXiv:2505.08292  [pdf, ps, other

    cs.CR

    On the Account Security Risks Posed by Password Strength Meters

    Authors: Ming Xu, Weili Han, Jitao Yu, Jing Liu, Xinyi Zhang, Yun Lin, Jin Song Dong

    Abstract: Password strength meters (PSMs) have been widely used by websites to gauge password strength, encouraging users to create stronger passwords. Popular data-driven PSMs, e.g., based on Markov, Probabilistic Context-free Grammar (PCFG) and neural networks, alarm strength based on a model learned from real passwords. Despite their proven effectiveness, the secure utility that arises from the leakage o… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  30. arXiv:2505.06055  [pdf, ps, other

    cs.CV

    Towards Better Cephalometric Landmark Detection with Diffusion Data Generation

    Authors: Dongqian Guo, Wencheng Han, Pang Lyu, Yuxi Zhou, Jianbing Shen

    Abstract: Cephalometric landmark detection is essential for orthodontic diagnostics and treatment planning. Nevertheless, the scarcity of samples in data collection and the extensive effort required for manual annotation have significantly impeded the availability of diverse datasets. This limitation has restricted the effectiveness of deep learning-based detection methods, particularly those based on large… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

  31. arXiv:2505.04201  [pdf, other

    cs.CV

    SToLa: Self-Adaptive Touch-Language Framework with Tactile Commonsense Reasoning in Open-Ended Scenarios

    Authors: Ning Cheng, Jinan Xu, Jialing Chen, Wenjuan Han

    Abstract: This paper explores the challenges of integrating tactile sensing into intelligent systems for multimodal reasoning, particularly in enabling commonsense reasoning about the open-ended physical world. We identify two key challenges: modality discrepancy, where existing large touch-language models often treat touch as a mere sub-modality of language, and open-ended tactile data scarcity, where curr… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  32. arXiv:2505.03835  [pdf

    cs.DL cs.AI cs.CY

    The Shift Towards Preprints in AI Policy Research: A Comparative Study of Preprint Trends in the U.S., Europe, and South Korea

    Authors: Simon Suh, Jihyuk Bang, Ji Woo Han

    Abstract: The adoption of open science has quickly changed how artificial intelligence (AI) policy research is distributed globally. This study examines the regional trends in the citation of preprints, specifically focusing on the impact of two major disruptive events: the COVID-19 pandemic and the release of ChatGPT, on research dissemination patterns in the United States, Europe, and South Korea from 201… ▽ More

    Submitted 4 May, 2025; originally announced May 2025.

    Comments: 22 pages, 6 figures, 3 tables. Uses cross-regional analysis to evaluate how preprint citation trends in AI - policy research have shifted over time in response to two major global events: the COVID-19 pandemic and the release of ChatGPT. Compares United States, Europe, and South Korea

    ACM Class: I.2.0; K.4.0

  33. arXiv:2505.01255  [pdf, other

    cs.CL cs.IR cs.MM

    PREMISE: Matching-based Prediction for Accurate Review Recommendation

    Authors: Wei Han, Hui Chen, Soujanya Poria

    Abstract: We present PREMISE (PREdict with Matching ScorEs), a new architecture for the matching-based learning in the multimodal fields for the multimodal review helpfulness (MRHP) task. Distinct to previous fusion-based methods which obtains multimodal representations via cross-modal attention for downstream tasks, PREMISE computes the multi-scale and multi-field representations, filters duplicated semant… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

    Comments: 19 pages, 16 figures

  34. arXiv:2505.00942  [pdf

    physics.optics cs.AR

    Enhancing Realism in Holographic Augmented Reality Displays through Occlusion Handling

    Authors: Woongseob Han, Chanseul Lee, Jae-Hyeung Park

    Abstract: In this paper, an occlusion-capable holographic augmented-reality (AR) display is proposed, and its ability to enhance AR imagery through occlusion is demonstrated. Holographic displays can generate ideal three-dimensional (3D) virtual images and have recently shown rapid advancements, particularly in noise reduction through learning-based approaches. However, these displays still face challenges… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

    Comments: 24 pages, 10 figures

  35. arXiv:2505.00416  [pdf, ps, other

    cs.AI

    ScaleTrack: Scaling and back-tracking Automated GUI Agents

    Authors: Jing Huang, Zhixiong Zeng, Wenkang Han, Yufeng Zhong, Liming Zheng, Shuai Fu, Jingyuan Chen, Lin Ma

    Abstract: Automated GUI agents aims to facilitate user interaction by automatically performing complex tasks in digital environments, such as web, mobile, desktop devices. It receives textual task instruction and GUI description to generate executable actions (\emph{e.g.}, click) and operation boxes step by step. Training a GUI agent mainly involves grounding and planning stages, in which the GUI grounding… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

  36. arXiv:2505.00017  [pdf, other

    cs.CL cs.AI cs.DB cs.LG

    ReCellTy: Domain-specific knowledge graph retrieval-augmented LLMs workflow for single-cell annotation

    Authors: Dezheng Han, Yibin Jia, Ruxiao Chen, Wenjie Han, Shuaishuai Guo, Jianbo Wang

    Abstract: To enable precise and fully automated cell type annotation with large language models (LLMs), we developed a graph structured feature marker database to retrieve entities linked to differential genes for cell reconstruction. We further designed a multi task workflow to optimize the annotation process. Compared to general purpose LLMs, our method improves human evaluation scores by up to 0.21 and s… ▽ More

    Submitted 23 April, 2025; originally announced May 2025.

  37. arXiv:2504.20525  [pdf, other

    cs.CV

    Geometry-aware Temporal Aggregation Network for Monocular 3D Lane Detection

    Authors: Huan Zheng, Wencheng Han, Tianyi Yan, Cheng-zhong Xu, Jianbing Shen

    Abstract: Monocular 3D lane detection aims to estimate 3D position of lanes from frontal-view (FV) images. However, current monocular 3D lane detection methods suffer from two limitations, including inaccurate geometric information of the predicted 3D lanes and difficulties in maintaining lane integrity. To address these issues, we seek to fully exploit the potential of multiple input frames. First, we aim… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  38. arXiv:2504.06271  [pdf, other

    cs.IR cs.AI cs.CL

    ER-RAG: Enhance RAG with ER-Based Unified Modeling of Heterogeneous Data Sources

    Authors: Yikuan Xia, Jiazun Chen, Yirui Zhan, Suifeng Zhao, Weipeng Jiang, Chaorui Zhang, Wei Han, Bo Bai, Jun Gao

    Abstract: Large language models (LLMs) excel in question-answering (QA) tasks, and retrieval-augmented generation (RAG) enhances their precision by incorporating external evidence from diverse sources like web pages, databases, and knowledge graphs. However, current RAG methods rely on agent-specific strategies for individual data sources, posing challenges low-resource or black-box environments and complic… ▽ More

    Submitted 2 March, 2025; originally announced April 2025.

  39. arXiv:2504.02222  [pdf, other

    eess.IV cs.CV

    APSeg: Auto-Prompt Model with Acquired and Injected Knowledge for Nuclear Instance Segmentation and Classification

    Authors: Liying Xu, Hongliang He, Wei Han, Hanbin Huang, Siwei Feng, Guohong Fu

    Abstract: Nuclear instance segmentation and classification provide critical quantitative foundations for digital pathology diagnosis. With the advent of the foundational Segment Anything Model (SAM), the accuracy and efficiency of nuclear segmentation have improved significantly. However, SAM imposes a strong reliance on precise prompts, and its class-agnostic design renders its classification results entir… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

    Comments: 10 pages, 3 figures

  40. arXiv:2503.24067  [pdf, other

    cs.LG

    TransMamba: Flexibly Switching between Transformer and Mamba

    Authors: Yixing Li, Ruobing Xie, Zhen Yang, Xingwu Sun, Shuaipeng Li, Weidong Han, Zhanhui Kang, Yu Cheng, Chengzhong Xu, Di Wang, Jie Jiang

    Abstract: Transformers are the cornerstone of modern large language models, but their quadratic computational complexity limits efficiency in long-sequence processing. Recent advancements in Mamba, a state space model (SSM) with linear complexity, offer promising efficiency gains but suffer from unstable contextual learning and multitask generalization. This paper proposes TransMamba, a novel framework that… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

    Comments: Preprint. Under review

  41. arXiv:2503.23650  [pdf, other

    cs.LG cs.RO

    A Survey of Reinforcement Learning-Based Motion Planning for Autonomous Driving: Lessons Learned from a Driving Task Perspective

    Authors: Zhuoren Li, Guizhe Jin, Ran Yu, Zhiwen Chen, Nan Li, Wei Han, Lu Xiong, Bo Leng, Jia Hu, Ilya Kolmanovsky, Dimitar Filev

    Abstract: Reinforcement learning (RL), with its ability to explore and optimize policies in complex, dynamic decision-making tasks, has emerged as a promising approach to addressing motion planning (MoP) challenges in autonomous driving (AD). Despite rapid advancements in RL and AD, a systematic description and interpretation of the RL design process tailored to diverse driving tasks remains underdeveloped.… ▽ More

    Submitted 30 March, 2025; originally announced March 2025.

    Comments: 21 pages, 5 figures

  42. arXiv:2503.22168  [pdf, other

    cs.CV

    Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis

    Authors: Woojung Han, Yeonkyung Lee, Chanyoung Kim, Kwanghyun Park, Seong Jae Hwang

    Abstract: Diffusion-based text-to-image (T2I) models have recently excelled in high-quality image generation, particularly in a training-free manner, enabling cost-effective adaptability and generalization across diverse tasks. However, while the existing methods have been continuously focusing on several challenges, such as "missing objects" and "mismatched attributes," another critical issue of "mislocate… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

    Comments: CVPR2025

  43. arXiv:2503.21259  [pdf, other

    cs.CV

    Reducing CT Metal Artifacts by Learning Latent Space Alignment with Gemstone Spectral Imaging Data

    Authors: Wencheng Han, Dongqian Guo, Xiao Chen, Pang Lyu, Yi Jin, Jianbing Shen

    Abstract: Metal artifacts in CT slices have long posed challenges in medical diagnostics. These artifacts degrade image quality, resulting in suboptimal visualization and complicating the accurate interpretation of tissues adjacent to metal implants. To address these issues, we introduce the Latent Gemstone Spectral Imaging (GSI) Alignment Framework, which effectively reduces metal artifacts while avoiding… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  44. arXiv:2503.19786  [pdf, other

    cs.CL cs.AI

    Gemma 3 Technical Report

    Authors: Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin , et al. (191 additional authors not shown)

    Abstract: We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achie… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

  45. arXiv:2503.19690  [pdf, other

    cs.RO

    Risk-Aware Reinforcement Learning for Autonomous Driving: Improving Safety When Driving through Intersection

    Authors: Bo Leng, Ran Yu, Wei Han, Lu Xiong, Zhuoren Li, Hailong Huang

    Abstract: Applying reinforcement learning to autonomous driving has garnered widespread attention. However, classical reinforcement learning methods optimize policies by maximizing expected rewards but lack sufficient safety considerations, often putting agents in hazardous situations. This paper proposes a risk-aware reinforcement learning approach for autonomous driving to improve the safety performance w… ▽ More

    Submitted 27 March, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

    Comments: 11 pages, 10 figures

  46. arXiv:2503.17675  [pdf, other

    cs.CV

    Towards Transformer-Based Aligned Generation with Self-Coherence Guidance

    Authors: Shulei Wang, Wang Lin, Hai Huang, Hanting Wang, Sihang Cai, WenKang Han, Tao Jin, Jingyuan Chen, Jiacheng Sun, Jieming Zhu, Zhou Zhao

    Abstract: We introduce a novel, training-free approach for enhancing alignment in Transformer-based Text-Guided Diffusion Models (TGDMs). Existing TGDMs often struggle to generate semantically aligned images, particularly when dealing with complex text prompts or multi-concept attribute binding challenges. Previous U-Net-based methods primarily optimized the latent space, but their direct application to Tra… ▽ More

    Submitted 22 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR 2025

  47. arXiv:2503.17013  [pdf

    cs.HC cs.AI

    Developing Critical Thinking in Second Language Learners: Exploring Generative AI like ChatGPT as a Tool for Argumentative Essay Writing

    Authors: Simon Suh, Jihyuk Bang, Ji Woo Han

    Abstract: This study employs the Paul-Elder Critical Thinking Model and Tan's argumentative writing framework to create a structured methodology. This methodology, ChatGPT Guideline for Critical Argumentative Writing (CGCAW) framework, integrates the models with ChatGPT's capabilities to guide L2 learners in utilizing ChatGPT to enhance their critical thinking skills. A quantitative experiment was conducted… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

    Comments: 12 pages, 3 figures. Uses Paul-Elder Critical Thinking Model and Tan's argumentative writing framework. Includes an experimental study with 10 participants

    ACM Class: I.2.7; K.3.1

  48. arXiv:2503.13542  [pdf, other

    cs.LG cs.AI

    HAR-DoReMi: Optimizing Data Mixture for Self-Supervised Human Activity Recognition Across Heterogeneous IMU Datasets

    Authors: Lulu Ban, Tao Zhu, Xiangqing Lu, Qi Qiu, Wenyong Han, Shuangjian Li, Liming Chen, Kevin I-Kai Wang, Mingxing Nie, Yaping Wan

    Abstract: Cross-dataset Human Activity Recognition (HAR) suffers from limited model generalization, hindering its practical deployment. To address this critical challenge, inspired by the success of DoReMi in Large Language Models (LLMs), we introduce a data mixture optimization strategy for pre-training HAR models, aiming to improve the recognition performance across heterogeneous datasets. However, direct… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

  49. arXiv:2503.08346  [pdf, other

    cs.CV

    Pathology-Aware Adaptive Watermarking for Text-Driven Medical Image Synthesis

    Authors: Chanyoung Kim, Dayun Ju, Jinyeong Kim, Woojung Han, Roberto Alcover-Couso, Seong Jae Hwang

    Abstract: As recent text-conditioned diffusion models have enabled the generation of high-quality images, concerns over their potential misuse have also grown. This issue is critical in the medical domain, where text-conditioned generated medical images could enable insurance fraud or falsified records, highlighting the urgent need for reliable safeguards against unethical use. While watermarking techniques… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

  50. arXiv:2503.06469  [pdf, other

    cs.CV

    Vector Quantized Feature Fields for Fast 3D Semantic Lifting

    Authors: George Tang, Aditya Agarwal, Weiqiao Han, Trevor Darrell, Yutong Bai

    Abstract: We generalize lifting to semantic lifting by incorporating per-view masks that indicate relevant pixels for lifting tasks. These masks are determined by querying corresponding multiscale pixel-aligned feature maps, which are derived from scene representations such as distilled feature fields and feature point clouds. However, storing per-view feature maps rendered from distilled feature fields is… ▽ More

    Submitted 9 March, 2025; originally announced March 2025.