Search | arXiv e-print repository

BenchRL-QAS: Benchmarking reinforcement learning algorithms for quantum architecture search

Authors: Azhar Ikhtiarudin, Aditi Das, Param Thakkar, Akash Kundu

Abstract: We introduce BenchRL-QAS, a unified benchmarking framework for systematically evaluating reinforcement learning (RL) algorithms in quantum architecture search (QAS) across diverse variational quantum algorithm tasks and system sizes ranging from 2- to 8-qubit. Our study benchmarks nine RL agents including both value-based and policy-gradient methods on representative quantum problems such as varia… ▽ More We introduce BenchRL-QAS, a unified benchmarking framework for systematically evaluating reinforcement learning (RL) algorithms in quantum architecture search (QAS) across diverse variational quantum algorithm tasks and system sizes ranging from 2- to 8-qubit. Our study benchmarks nine RL agents including both value-based and policy-gradient methods on representative quantum problems such as variational quantum eigensolver, variational quantum state diagonalization, quantum classification, and state preparation, spanning both noiseless and realistic noisy regimes. We propose a weighted ranking metric that balances accuracy, circuit depth, gate count, and computational efficiency, enabling fair and comprehensive comparison. Our results first reveal that RL-based quantum classifier outperforms baseline variational classifiers. Then we conclude that no single RL algorithm is universally optimal when considering a set of QAS tasks; algorithmic performance is highly context-dependent, varying with task structure, qubit count, and noise. This empirical finding provides strong evidence for the "no free lunch" principle in RL-based quantum circuit design and highlights the necessity of tailored algorithm selection and systematic benchmarking for advancing quantum circuit synthesis. This work represents the most comprehensive RL-QAS benchmarking effort to date, and BenchRL-QAS along with all experimental data are made publicly available to support reproducibility and future research https://github.com/azhar-ikhtiarudin/bench-rlqas. △ Less

Submitted 16 July, 2025; originally announced July 2025.

Comments: Comprehensive RL agent benchmark for QAS. Contributions are welcomed here: https://github.com/azhar-ikhtiarudin/bench-rlqas

arXiv:2507.10233 [pdf, ps, other]

doi 10.1007/s11128-025-04843-1

Secure and Efficient Quantum Signature Scheme Based on the Controlled Unitary Operations Encryption

Authors: Debnath Ghosh, Soumit Roy, Prithwi Bagchi, Indranil Chakrabarty, Ashok Kumar Das

Abstract: Quantum digital signatures ensure unforgeable message authenticity and integrity using quantum principles, offering unconditional security against both classical and quantum attacks. They are crucial for secure communication in high-stakes environments, ensuring trust and long-term protection in the quantum era. Nowadays, the majority of arbitrated quantum signature (AQS) protocols encrypt data qu… ▽ More Quantum digital signatures ensure unforgeable message authenticity and integrity using quantum principles, offering unconditional security against both classical and quantum attacks. They are crucial for secure communication in high-stakes environments, ensuring trust and long-term protection in the quantum era. Nowadays, the majority of arbitrated quantum signature (AQS) protocols encrypt data qubit by qubit using the quantum one-time pad (QOTP). Despite providing robust data encryption, QOTP is not a good fit for AQS because of its susceptibility to many types of attacks. In this work, we present an efficient AQS protocol to encrypt quantum message ensembles using a distinct encryption technique, the chained controlled unitary operations. In contrast to existing protocols, our approach successfully prevents disavowal and forgery attacks. We hope this contributes to advancing future investigations into the development of AQS protocols. △ Less

Submitted 14 July, 2025; originally announced July 2025.

Comments: 22 pages, 3 figures. Accepted in Quantum Information Processing

Report number: SN-1573-1332

Journal ref: Quantum Inf Process 24, 227 (2025)

arXiv:2507.06261 [pdf, ps, other]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Authors: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu , et al. (3284 additional authors not shown)

Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal unde… ▽ More In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving. △ Less

Submitted 22 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

Comments: 72 pages, 17 figures

arXiv:2507.02949 [pdf, ps, other]

RADIANT: Retrieval AugmenteD entIty-context AligNmenT -- Introducing RAG-ability and Entity-Context Divergence

Authors: Vipula Rawte, Rajarshi Roy, Gurpreet Singh, Danush Khanna, Yaswanth Narsupalli, Basab Ghosh, Abhay Gupta, Argha Kamal Samanta, Aditya Shingote, Aadi Krishna Vikram, Vinija Jain, Aman Chadha, Amit Sheth, Amitava Das

Abstract: As Large Language Models (LLMs) continue to advance, Retrieval-Augmented Generation (RAG) has emerged as a vital technique to enhance factual accuracy by integrating external knowledge into the generation process. However, LLMs often fail to faithfully integrate retrieved evidence into their generated responses, leading to factual inconsistencies. To quantify this gap, we introduce Entity-Context… ▽ More As Large Language Models (LLMs) continue to advance, Retrieval-Augmented Generation (RAG) has emerged as a vital technique to enhance factual accuracy by integrating external knowledge into the generation process. However, LLMs often fail to faithfully integrate retrieved evidence into their generated responses, leading to factual inconsistencies. To quantify this gap, we introduce Entity-Context Divergence (ECD), a metric that measures the extent to which retrieved information is accurately reflected in model outputs. We systematically evaluate contemporary LLMs on their ability to preserve factual consistency in retrieval-augmented settings, a capability we define as RAG-ability. Our empirical analysis reveals that RAG-ability remains low across most LLMs, highlighting significant challenges in entity retention and context fidelity. This paper introduces Radiant (Retrieval AugmenteD entIty-context AligNmenT), a novel framework that merges RAG with alignment designed to optimize the interplay between retrieved evidence and generated content. Radiant extends Direct Preference Optimization (DPO) to teach LLMs how to integrate provided additional information into subsequent generations. As a behavior correction mechanism, Radiant boosts RAG performance across varied retrieval scenarios, such as noisy web contexts, knowledge conflicts, and hallucination reduction. This enables more reliable, contextually grounded, and factually coherent content generation. △ Less

Submitted 28 June, 2025; originally announced July 2025.

arXiv:2507.02900 [pdf, ps, other]

Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions

Authors: Vineet Kumar Rakesh, Soumya Mazumdar, Research Pratim Maity, Sarbajit Pal, Amitabha Das, Tapas Samanta

Abstract: Talking Head Generation (THG) has emerged as a transformative technology in computer vision, enabling the synthesis of realistic human faces synchronized with image, audio, text, or video inputs. This paper provides a comprehensive review of methodologies and frameworks for talking head generation, categorizing approaches into 2D--based, 3D--based, Neural Radiance Fields (NeRF)--based, diffusion--… ▽ More Talking Head Generation (THG) has emerged as a transformative technology in computer vision, enabling the synthesis of realistic human faces synchronized with image, audio, text, or video inputs. This paper provides a comprehensive review of methodologies and frameworks for talking head generation, categorizing approaches into 2D--based, 3D--based, Neural Radiance Fields (NeRF)--based, diffusion--based, parameter-driven techniques and many other techniques. It evaluates algorithms, datasets, and evaluation metrics while highlighting advancements in perceptual realism and technical efficiency critical for applications such as digital avatars, video dubbing, ultra-low bitrate video conferencing, and online education. The study identifies challenges such as reliance on pre--trained models, extreme pose handling, multilingual synthesis, and temporal consistency. Future directions include modular architectures, multilingual datasets, hybrid models blending pre--trained and task-specific layers, and innovative loss functions. By synthesizing existing research and exploring emerging trends, this paper aims to provide actionable insights for researchers and practitioners in the field of talking head generation. For the complete survey, code, and curated resource list, visit our GitHub repository: https://github.com/VineetKumarRakesh/thg. △ Less

Submitted 23 June, 2025; originally announced July 2025.

arXiv:2507.02405 [pdf, ps, other]

doi 10.1109/JBHI.2025.3567708

PosDiffAE: Position-aware Diffusion Auto-encoder For High-Resolution Brain Tissue Classification Incorporating Artifact Restoration

Authors: Ayantika Das, Moitreya Chaudhuri, Koushik Bhat, Keerthi Ram, Mihail Bota, Mohanasankar Sivaprakasam

Abstract: Denoising diffusion models produce high-fidelity image samples by capturing the image distribution in a progressive manner while initializing with a simple distribution and compounding the distribution complexity. Although these models have unlocked new applicabilities, the sampling mechanism of diffusion does not offer means to extract image-specific semantic representation, which is inherently p… ▽ More Denoising diffusion models produce high-fidelity image samples by capturing the image distribution in a progressive manner while initializing with a simple distribution and compounding the distribution complexity. Although these models have unlocked new applicabilities, the sampling mechanism of diffusion does not offer means to extract image-specific semantic representation, which is inherently provided by auto-encoders. The encoding component of auto-encoders enables mapping between a specific image and its latent space, thereby offering explicit means of enforcing structures in the latent space. By integrating an encoder with the diffusion model, we establish an auto-encoding formulation, which learns image-specific representations and offers means to organize the latent space. In this work, First, we devise a mechanism to structure the latent space of a diffusion auto-encoding model, towards recognizing region-specific cellular patterns in brain images. We enforce the representations to regress positional information of the patches from high-resolution images. This creates a conducive latent space for differentiating tissue types of the brain. Second, we devise an unsupervised tear artifact restoration technique based on neighborhood awareness, utilizing latent representations and the constrained generation capability of diffusion models during inference. Third, through representational guidance and leveraging the inference time steerable noising and denoising capability of diffusion, we devise an unsupervised JPEG artifact restoration technique. △ Less

Submitted 3 July, 2025; originally announced July 2025.

Comments: Published in IEEE Journal of Biomedical and Health Informatics (Early Access Available) https://ieeexplore.ieee.org/document/10989734

arXiv:2506.23971 [pdf, ps, other]

UMA: A Family of Universal Models for Atoms

Authors: Brandon M. Wood, Misko Dzamba, Xiang Fu, Meng Gao, Muhammed Shuaibi, Luis Barroso-Luque, Kareem Abdelmaqsoud, Vahe Gharakhanyan, John R. Kitchin, Daniel S. Levine, Kyle Michel, Anuroop Sriram, Taco Cohen, Abhishek Das, Ammar Rizvi, Sushree Jagriti Sahoo, Zachary W. Ulissi, C. Lawrence Zitnick

Abstract: The ability to quickly and accurately compute properties from atomic simulations is critical for advancing a large number of applications in chemistry and materials science including drug discovery, energy storage, and semiconductor manufacturing. To address this need, Meta FAIR presents a family of Universal Models for Atoms (UMA), designed to push the frontier of speed, accuracy, and generalizat… ▽ More The ability to quickly and accurately compute properties from atomic simulations is critical for advancing a large number of applications in chemistry and materials science including drug discovery, energy storage, and semiconductor manufacturing. To address this need, Meta FAIR presents a family of Universal Models for Atoms (UMA), designed to push the frontier of speed, accuracy, and generalization. UMA models are trained on half a billion unique 3D atomic structures (the largest training runs to date) by compiling data across multiple chemical domains, e.g. molecules, materials, and catalysts. We develop empirical scaling laws to help understand how to increase model capacity alongside dataset size to achieve the best accuracy. The UMA small and medium models utilize a novel architectural design we refer to as mixture of linear experts that enables increasing model capacity without sacrificing speed. For example, UMA-medium has 1.4B parameters but only ~50M active parameters per atomic structure. We evaluate UMA models on a diverse set of applications across multiple domains and find that, remarkably, a single model without any fine-tuning can perform similarly or better than specialized models. We are releasing the UMA code, weights, and associated data to accelerate computational workflows and enable the community to continue to build increasingly capable AI models. △ Less

Submitted 30 June, 2025; originally announced June 2025.

Comments: 29 pages, 5 figures

arXiv:2506.22960 [pdf, ps, other]

Peccavi: Visual Paraphrase Attack Safe and Distortion Free Image Watermarking Technique for AI-Generated Images

Authors: Shreyas Dixit, Ashhar Aziz, Shashwat Bajpai, Vasu Sharma, Aman Chadha, Vinija Jain, Amitava Das

Abstract: A report by the European Union Law Enforcement Agency predicts that by 2026, up to 90 percent of online content could be synthetically generated, raising concerns among policymakers, who cautioned that "Generative AI could act as a force multiplier for political disinformation. The combined effect of generative text, images, videos, and audio may surpass the influence of any single modality." In r… ▽ More A report by the European Union Law Enforcement Agency predicts that by 2026, up to 90 percent of online content could be synthetically generated, raising concerns among policymakers, who cautioned that "Generative AI could act as a force multiplier for political disinformation. The combined effect of generative text, images, videos, and audio may surpass the influence of any single modality." In response, California's Bill AB 3211 mandates the watermarking of AI-generated images, videos, and audio. However, concerns remain regarding the vulnerability of invisible watermarking techniques to tampering and the potential for malicious actors to bypass them entirely. Generative AI-powered de-watermarking attacks, especially the newly introduced visual paraphrase attack, have shown an ability to fully remove watermarks, resulting in a paraphrase of the original image. This paper introduces PECCAVI, the first visual paraphrase attack-safe and distortion-free image watermarking technique. In visual paraphrase attacks, an image is altered while preserving its core semantic regions, termed Non-Melting Points (NMPs). PECCAVI strategically embeds watermarks within these NMPs and employs multi-channel frequency domain watermarking. It also incorporates noisy burnishing to counter reverse-engineering efforts aimed at locating NMPs to disrupt the embedded watermark, thereby enhancing durability. PECCAVI is model-agnostic. All relevant resources and codes will be open-sourced. △ Less

Submitted 28 June, 2025; originally announced June 2025.

arXiv:2506.22396 [pdf, ps, other]

QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization

Authors: Danush Khanna, Aditya Kumar Guru, Srivarshinee Sridhar, Zidan Ahmed, Rubhav Bahirwani, Meetu Malhotra, Vinija Jain, Aman Chadha, Amitava Das, Kripabandhu Ghosh

Abstract: Inference accounts for the majority of latency and energy consumption in large language model (LLM) deployments, often exceeding 90% of total cost. While training-time efficiency has seen extensive progress, runtime optimization remains a key bottleneck, particularly under autoregressive decoding. Existing approaches -- such as pruning, quantization, early exits, and speculative decoding -- often… ▽ More Inference accounts for the majority of latency and energy consumption in large language model (LLM) deployments, often exceeding 90% of total cost. While training-time efficiency has seen extensive progress, runtime optimization remains a key bottleneck, particularly under autoregressive decoding. Existing approaches -- such as pruning, quantization, early exits, and speculative decoding -- often require retraining, architectural changes, or disrupt decoding compatibility. We introduce QuickSilver, a modular, token-level framework that enables semantic adaptivity at inference time without altering model weights or structure. QuickSilver integrates four synergistic mechanisms: (i) Dynamic Token Halting, which halts computation for tokens with converged representations; (ii) KV Cache Skipping, which selectively suppresses memory writes to reduce attention overhead; and (iii) Contextual Token Fusion, which collapses redundant tokens into shared paths to shrink sequence length. Unlike speculative decoding or MoE routing, QuickSilver operates entirely on frozen, dense models and requires no auxiliary networks. Applied to GPT-2 and Llama-2 across WikiText-103 and C4, QuickSilver achieves up to 39.6% FLOP reduction with negligible perplexity degradation (<=0.2). △ Less

Submitted 27 June, 2025; originally announced June 2025.

Comments: Preprint. Under submission

ACM Class: I.2.0; I.2.7

arXiv:2506.19487 [pdf, ps, other]

TRMAC: A Time-Reversal-based MAC Protocol for Wireless Networks within Computing Packages

Authors: Ama Bandara, Abhijit Das, Fatima Rodriguez-Galan, Eduard Alarcon, Sergi Abadal

Abstract: As chiplet-based integration and many-core architectures become the norm in high-performance computing, on-chip wireless communication has emerged as a compelling alternative to traditional interconnects. However, scalable Medium Access Control (MAC) remains a fundamental challenge, particularly under dense traffic and limited spectral resources. This paper presents TRMAC, a novel cross-layer MAC… ▽ More As chiplet-based integration and many-core architectures become the norm in high-performance computing, on-chip wireless communication has emerged as a compelling alternative to traditional interconnects. However, scalable Medium Access Control (MAC) remains a fundamental challenge, particularly under dense traffic and limited spectral resources. This paper presents TRMAC, a novel cross-layer MAC protocol that exploits the spatial focusing capability of Time Reversal (TR) to enable multiple parallel transmissions over a shared frequency channel. By leveraging the quasi-deterministic nature of on-chip wireless channels, TRMAC pre-characterizes channel impulse responses to coordinate access using energy-based thresholds, eliminating the need for orthogonal resource allocation or centralized arbitration. Through detailed physical-layer simulation and system-level evaluation on diverse traffic, TRMAC demonstrates comparable or superior performance to existing multi-channel MAC protocols, achieving low latency, high throughput, and strong scalability across hundreds of cores. TRMAC provides a low-complexity, high-efficiency solution for future Wireless Networks-on-Chip (WNoCs), particularly in chiplet-based systems where spatial reuse and modularity are critical. With simulations we prove that TRMAC can be utilized for parallel transmissions with a single frequency channel with a similar throughput and latency as in using multiple frequency bands omitting the need for complex transceivers. This work establishes a new design direction for MAC protocols that are tightly integrated with the underlying channel physics to meet the demands of next-generation computing platforms. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.17721 [pdf, ps, other]

Distributed Butterfly Analysis using Mobile Agents

Authors: Prabhat Kumar Chand, Apurba Das, Anisur Rahaman Molla

Abstract: Butterflies, or 4-cycles in bipartite graphs, are crucial for identifying cohesive structures and dense subgraphs. While agent-based data mining is gaining prominence, its application to bipartite networks remains relatively unexplored. We propose distributed, agent-based algorithms for \emph{Butterfly Counting} in a bipartite graph $G((A,B),E)$. Agents first determine their respective partitions… ▽ More Butterflies, or 4-cycles in bipartite graphs, are crucial for identifying cohesive structures and dense subgraphs. While agent-based data mining is gaining prominence, its application to bipartite networks remains relatively unexplored. We propose distributed, agent-based algorithms for \emph{Butterfly Counting} in a bipartite graph $G((A,B),E)$. Agents first determine their respective partitions and collaboratively construct a spanning tree, electing a leader within $O(n \log λ)$ rounds using only $O(\log λ)$ bits per agent. A novel meeting mechanism between adjacent agents improves efficiency and eliminates the need for prior knowledge of the graph, requiring only the highest agent ID $λ$ among the $n$ agents. Notably, our techniques naturally extend to general graphs, where leader election and spanning tree construction maintain the same round and memory complexities. Building on these foundations, agents count butterflies per node in $O(Δ)$ rounds and compute the total butterfly count of $G$ in $O(Δ+\min\{|A|,|B|\})$ rounds. △ Less

Submitted 21 June, 2025; originally announced June 2025.

arXiv:2506.15853 [pdf]

Cross-Modality Learning for Predicting IHC Biomarkers from H&E-Stained Whole-Slide Images

Authors: Amit Das, Naofumi Tomita, Kyle J. Syme, Weijie Ma, Paige O'Connor, Kristin N. Corbett, Bing Ren, Xiaoying Liu, Saeed Hassanpour

Abstract: Hematoxylin and Eosin (H&E) staining is a cornerstone of pathological analysis, offering reliable visualization of cellular morphology and tissue architecture for cancer diagnosis, subtyping, and grading. Immunohistochemistry (IHC) staining provides molecular insights by detecting specific proteins within tissues, enhancing diagnostic accuracy, and improving treatment planning. However, IHC staini… ▽ More Hematoxylin and Eosin (H&E) staining is a cornerstone of pathological analysis, offering reliable visualization of cellular morphology and tissue architecture for cancer diagnosis, subtyping, and grading. Immunohistochemistry (IHC) staining provides molecular insights by detecting specific proteins within tissues, enhancing diagnostic accuracy, and improving treatment planning. However, IHC staining is costly, time-consuming, and resource-intensive, requiring specialized expertise. To address these limitations, this study proposes HistoStainAlign, a novel deep learning framework that predicts IHC staining patterns directly from H&E whole-slide images (WSIs) by learning joint representations of morphological and molecular features. The framework integrates paired H&E and IHC embeddings through a contrastive training strategy, capturing complementary features across staining modalities without patch-level annotations or tissue registration. The model was evaluated on gastrointestinal and lung tissue WSIs with three commonly used IHC stains: P53, PD-L1, and Ki-67. HistoStainAlign achieved weighted F1 scores of 0.735 [95% Confidence Interval (CI): 0.670-0.799], 0.830 [95% CI: 0.772-0.886], and 0.723 [95% CI: 0.607-0.836], respectively for these three IHC stains. Embedding analyses demonstrated the robustness of the contrastive alignment in capturing meaningful cross-stain relationships. Comparisons with a baseline model further highlight the advantage of incorporating contrastive learning for improved stain pattern prediction. This study demonstrates the potential of computational approaches to serve as a pre-screening tool, helping prioritize cases for IHC staining and improving workflow efficiency. △ Less

Submitted 18 June, 2025; originally announced June 2025.

arXiv:2506.15488 [pdf, ps, other]

Minimizing Communication for Parallel Symmetric Tensor Times Same Vector Computation

Authors: Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, Kathryn Rouse, Mathieu Vérité

Abstract: In this article, we focus on the parallel communication cost of multiplying the same vector along two modes of a $3$-dimensional symmetric tensor. This is a key computation in the higher-order power method for determining eigenpairs of a $3$-dimensional symmetric tensor and in gradient-based methods for computing a symmetric CP decomposition. We establish communication lower bounds that determine… ▽ More In this article, we focus on the parallel communication cost of multiplying the same vector along two modes of a $3$-dimensional symmetric tensor. This is a key computation in the higher-order power method for determining eigenpairs of a $3$-dimensional symmetric tensor and in gradient-based methods for computing a symmetric CP decomposition. We establish communication lower bounds that determine how much data movement is required to perform the specified computation in parallel. The core idea of the proof relies on extending a key geometric inequality for $3$-dimensional symmetric computations. We demonstrate that the communication lower bounds are tight by presenting an optimal algorithm where the data distribution is a natural extension of the triangle block partition scheme for symmetric matrices to 3-dimensional symmetric tensors. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: 19 pages, 1 figure

arXiv:2506.14903 [pdf, ps, other]

DETONATE: A Benchmark for Text-to-Image Alignment and Kernelized Direct Preference Optimization

Authors: Renjith Prasad, Abhilekh Borah, Hasnat Md Abdullah, Chathurangi Shyalika, Gurpreet Singh, Ritvik Garimella, Rajarshi Roy, Harshul Surana, Nasrin Imanpour, Suranjana Trivedy, Amit Sheth, Amitava Das

Abstract: Alignment is crucial for text-to-image (T2I) models to ensure that generated images faithfully capture user intent while maintaining safety and fairness. Direct Preference Optimization (DPO), prominent in large language models (LLMs), is extending its influence to T2I systems. This paper introduces DPO-Kernels for T2I models, a novel extension enhancing alignment across three dimensions: (i) Hybri… ▽ More Alignment is crucial for text-to-image (T2I) models to ensure that generated images faithfully capture user intent while maintaining safety and fairness. Direct Preference Optimization (DPO), prominent in large language models (LLMs), is extending its influence to T2I systems. This paper introduces DPO-Kernels for T2I models, a novel extension enhancing alignment across three dimensions: (i) Hybrid Loss, integrating embedding-based objectives with traditional probability-based loss for improved optimization; (ii) Kernelized Representations, employing Radial Basis Function (RBF), Polynomial, and Wavelet kernels for richer feature transformations and better separation between safe and unsafe inputs; and (iii) Divergence Selection, expanding beyond DPO's default Kullback-Leibler (KL) regularizer by incorporating Wasserstein and R'enyi divergences for enhanced stability and robustness. We introduce DETONATE, the first large-scale benchmark of its kind, comprising approximately 100K curated image pairs categorized as chosen and rejected. DETONATE encapsulates three axes of social bias and discrimination: Race, Gender, and Disability. Prompts are sourced from hate speech datasets, with images generated by leading T2I models including Stable Diffusion 3.5 Large, Stable Diffusion XL, and Midjourney. Additionally, we propose the Alignment Quality Index (AQI), a novel geometric measure quantifying latent-space separability of safe/unsafe image activations, revealing hidden vulnerabilities. Empirically, we demonstrate that DPO-Kernels maintain strong generalization bounds via Heavy-Tailed Self-Regularization (HT-SR). DETONATE and complete code are publicly released. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: 59 pages, 10 figures

arXiv:2506.14204 [pdf, ps, other]

Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios

Authors: Aswin Shanmugam Subramanian, Amit Das, Naoyuki Kanda, Jinyu Li, Xiaofei Wang, Yifan Gong

Abstract: We extend the frameworks of Serialized Output Training (SOT) to address practical needs of both streaming and offline automatic speech recognition (ASR) applications. Our approach focuses on balancing latency and accuracy, catering to real-time captioning and summarization requirements. We propose several key improvements: (1) Leveraging Continuous Speech Separation (CSS) single-channel front-end… ▽ More We extend the frameworks of Serialized Output Training (SOT) to address practical needs of both streaming and offline automatic speech recognition (ASR) applications. Our approach focuses on balancing latency and accuracy, catering to real-time captioning and summarization requirements. We propose several key improvements: (1) Leveraging Continuous Speech Separation (CSS) single-channel front-end with end-to-end (E2E) systems for highly overlapping scenarios, challenging the conventional wisdom of E2E versus cascaded setups. The CSS framework improves the accuracy of the ASR system by separating overlapped speech from multiple speakers. (2) Implementing dual models -- Conformer Transducer for streaming and Sequence-to-Sequence for offline -- or alternatively, a two-pass model based on cascaded encoders. (3) Exploring segment-based SOT (segSOT) which is better suited for offline scenarios while also enhancing readability of multi-talker transcriptions. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: Accepted to Interspeech 2025

arXiv:2506.13901 [pdf, ps, other]

Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

Authors: Abhilekh Borah, Chhavi Sharma, Danush Khanna, Utkarsh Bhatt, Gurpreet Singh, Hasnat Md Abdullah, Raghav Kaushik Ravi, Vinija Jain, Jyoti Patel, Shubham Singh, Vasu Sharma, Arpita Vats, Rahul Raja, Aman Chadha, Amitava Das

Abstract: Alignment is no longer a luxury, it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spo… ▽ More Alignment is no longer a luxury, it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking. To address this issue, we introduce the Alignment Quality Index (AQI). This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space. By combining measures such as the Davies-Bouldin Score (DBS), Dunn Index (DI), Xie-Beni Index (XBI), and Calinski-Harabasz Index (CHI) across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking, offering a robust, decoding invariant tool for behavior agnostic safety auditing. Additionally, we propose the LITMUS dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO, GRPO, and RLHF conditions demonstrate AQI's correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area. △ Less

Submitted 16 June, 2025; originally announced June 2025.

arXiv:2506.12073 [pdf, ps, other]

Seamless Dysfluent Speech Text Alignment for Disordered Speech Analysis

Authors: Zongli Ye, Jiachen Lian, Xuanru Zhou, Jinming Zhang, Haodong Li, Shuhe Li, Chenxu Guo, Anaisha Das, Peter Park, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Gorno-Tempini, Gopala Anumanchipalli

Abstract: Accurate alignment of dysfluent speech with intended text is crucial for automating the diagnosis of neurodegenerative speech disorders. Traditional methods often fail to model phoneme similarities effectively, limiting their performance. In this work, we propose Neural LCS, a novel approach for dysfluent text-text and speech-text alignment. Neural LCS addresses key challenges, including partial a… ▽ More Accurate alignment of dysfluent speech with intended text is crucial for automating the diagnosis of neurodegenerative speech disorders. Traditional methods often fail to model phoneme similarities effectively, limiting their performance. In this work, we propose Neural LCS, a novel approach for dysfluent text-text and speech-text alignment. Neural LCS addresses key challenges, including partial alignment and context-aware similarity mapping, by leveraging robust phoneme-level modeling. We evaluate our method on a large-scale simulated dataset, generated using advanced data simulation techniques, and real PPA data. Neural LCS significantly outperforms state-of-the-art models in both alignment accuracy and dysfluent speech segmentation. Our results demonstrate the potential of Neural LCS to enhance automated systems for diagnosing and analyzing speech disorders, offering a more accurate and linguistically grounded solution for dysfluent speech alignment. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: Accepted for Interspeech2025

arXiv:2506.11286 [pdf, ps, other]

Mapping and Scheduling Spiking Neural Networks On Segmented Ladder Bus Architectures

Authors: Phu Khanh Huynh, Francky Catthoor, Anup Das

Abstract: Large-scale neuromorphic architectures consist of computing tiles that communicate spikes using a shared interconnect. The communication patterns in these systems are inherently sparse, asynchronous, and localized, as neural activity is characterized by temporal sparsity with occasional bursts of high traffic. These characteristics require optimized interconnects to handle high-activity bursts whi… ▽ More Large-scale neuromorphic architectures consist of computing tiles that communicate spikes using a shared interconnect. The communication patterns in these systems are inherently sparse, asynchronous, and localized, as neural activity is characterized by temporal sparsity with occasional bursts of high traffic. These characteristics require optimized interconnects to handle high-activity bursts while consuming minimal power during idle periods. Among the proposed interconnect solutions, the dynamic segmented bus has gained attention due to its structural simplicity, scalability, and energy efficiency. Since the benefits of a dynamic segmented bus stem from its simplicity, it is essential to develop a streamlined control plane that can scale efficiently with the network. In this paper, we present a design methodology for a scenario-aware control plane tailored to a segmented ladder bus, with the aim of minimizing control overhead and optimizing energy and area utilization. We evaluated our approach using a combination of FPGA implementation and software simulation to assess scalability. The results demonstrated that our design process effectively reduces the control plane's area footprint compared to the data plane while maintaining scalability with network size. △ Less

Submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.08991 [pdf, ps, other]

Do Concept Replacement Techniques Really Erase Unacceptable Concepts?

Authors: Anudeep Das, Gurjot Singh, Prach Chantasantitam, N. Asokan

Abstract: Generative models, particularly diffusion-based text-to-image (T2I) models, have demonstrated astounding success. However, aligning them to avoid generating content with unacceptable concepts (e.g., offensive or copyrighted content, or celebrity likenesses) remains a significant challenge. Concept replacement techniques (CRTs) aim to address this challenge, often by trying to "erase" unacceptable… ▽ More Generative models, particularly diffusion-based text-to-image (T2I) models, have demonstrated astounding success. However, aligning them to avoid generating content with unacceptable concepts (e.g., offensive or copyrighted content, or celebrity likenesses) remains a significant challenge. Concept replacement techniques (CRTs) aim to address this challenge, often by trying to "erase" unacceptable concepts from models. Recently, model providers have started offering image editing services which accept an image and a text prompt as input, to produce an image altered as specified by the prompt. These are known as image-to-image (I2I) models. In this paper, we first use an I2I model to empirically demonstrate that today's state-of-the-art CRTs do not in fact erase unacceptable concepts. Existing CRTs are thus likely to be ineffective in emerging I2I scenarios, despite their proven ability to remove unwanted concepts in T2I pipelines, highlighting the need to understand this discrepancy between T2I and I2I settings. Next, we argue that a good CRT, while replacing unacceptable concepts, should preserve other concepts specified in the inputs to generative models. We call this fidelity. Prior work on CRTs have neglected fidelity in the case of unacceptable concepts. Finally, we propose the use of targeted image-editing techniques to achieve both effectiveness and fidelity. We present such a technique, AntiMirror, and demonstrate its viability. △ Less

Submitted 10 June, 2025; originally announced June 2025.

arXiv:2506.08885 [pdf, ps, other]

AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)

Authors: Danush Khanna, Krishna Kumar, Basab Ghosh, Vinija Jain, Vasu Sharma, Aman Chadha, Amitava Das

Abstract: Adversarial threats against LLMs are escalating faster than current defenses can adapt. We expose a critical geometric blind spot in alignment: adversarial prompts exploit latent camouflage, embedding perilously close to the safe representation manifold while encoding unsafe intent thereby evading surface level defenses like Direct Preference Optimization (DPO), which remain blind to the latent ge… ▽ More Adversarial threats against LLMs are escalating faster than current defenses can adapt. We expose a critical geometric blind spot in alignment: adversarial prompts exploit latent camouflage, embedding perilously close to the safe representation manifold while encoding unsafe intent thereby evading surface level defenses like Direct Preference Optimization (DPO), which remain blind to the latent geometry. We introduce ALKALI, the first rigorously curated adversarial benchmark and the most comprehensive to date spanning 9,000 prompts across three macro categories, six subtypes, and fifteen attack families. Evaluation of 21 leading LLMs reveals alarmingly high Attack Success Rates (ASRs) across both open and closed source models, exposing an underlying vulnerability we term latent camouflage, a structural blind spot where adversarial completions mimic the latent geometry of safe ones. To mitigate this vulnerability, we introduce GRACE - Geometric Representation Aware Contrastive Enhancement, an alignment framework coupling preference learning with latent space regularization. GRACE enforces two constraints: latent separation between safe and adversarial completions, and adversarial cohesion among unsafe and jailbreak behaviors. These operate over layerwise pooled embeddings guided by a learned attention profile, reshaping internal geometry without modifying the base model, and achieve up to 39% ASR reduction. Moreover, we introduce AVQI, a geometry aware metric that quantifies latent alignment failure via cluster separation and compactness. AVQI reveals when unsafe completions mimic the geometry of safe ones, offering a principled lens into how models internally encode safety. We make the code publicly available at https://anonymous.4open.science/r/alkali-B416/README.md. △ Less

Submitted 11 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

arXiv:2506.03392 [pdf, ps, other]

Improving Performance of Spike-based Deep Q-Learning using Ternary Neurons

Authors: Aref Ghoreishee, Abhishek Mishra, John Walsh, Anup Das, Nagarajan Kandasamy

Abstract: We propose a new ternary spiking neuron model to improve the representation capacity of binary spiking neurons in deep Q-learning. Although a ternary neuron model has recently been introduced to overcome the limited representation capacity offered by the binary spiking neurons, we show that its performance is worse than that of binary models in deep Q-learning tasks. We hypothesize gradient estima… ▽ More We propose a new ternary spiking neuron model to improve the representation capacity of binary spiking neurons in deep Q-learning. Although a ternary neuron model has recently been introduced to overcome the limited representation capacity offered by the binary spiking neurons, we show that its performance is worse than that of binary models in deep Q-learning tasks. We hypothesize gradient estimation bias during the training process as the underlying potential cause through mathematical and empirical analysis. We propose a novel ternary spiking neuron model to mitigate this issue by reducing the estimation bias. We use the proposed ternary spiking neuron as the fundamental computing unit in a deep spiking Q-learning network (DSQN) and evaluate the network's performance in seven Atari games from the Gym environment. Results show that the proposed ternary spiking neuron mitigates the drastic performance degradation of ternary neurons in Q-learning tasks and improves the network performance compared to the existing binary neurons, making DSQN a more practical solution for on-board autonomous decision-making tasks. △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2506.00653 [pdf, ps, other]

Linear Representation Transferability Hypothesis: Leveraging Small Models to Steer Large Models

Authors: Femi Bello, Anubrata Das, Fanzhi Zeng, Fangcong Yin, Liu Leqi

Abstract: It has been hypothesized that neural networks with similar architectures trained on similar data learn shared representations relevant to the learning task. We build on this idea by extending the conceptual framework where representations learned across models trained on the same data can be expressed as linear combinations of a \emph{universal} set of basis features. These basis features underlie… ▽ More It has been hypothesized that neural networks with similar architectures trained on similar data learn shared representations relevant to the learning task. We build on this idea by extending the conceptual framework where representations learned across models trained on the same data can be expressed as linear combinations of a \emph{universal} set of basis features. These basis features underlie the learning task itself and remain consistent across models, regardless of scale. From this framework, we propose the \textbf{Linear Representation Transferability (LRT)} Hypothesis -- that there exists an affine transformation between the representation spaces of different models. To test this hypothesis, we learn affine mappings between the hidden states of models of different sizes and evaluate whether steering vectors -- directions in hidden state space associated with specific model behaviors -- retain their semantic effect when transferred from small to large language models using the learned mappings. We find strong empirical evidence that such affine mappings can preserve steering behaviors. These findings suggest that representations learned by small models can be used to guide the behavior of large models, and that the LRT hypothesis may be a promising direction on understanding representation alignment across model scales. △ Less

Submitted 4 June, 2025; v1 submitted 31 May, 2025; originally announced June 2025.

arXiv:2505.23503 [pdf, ps, other]

Can Large Language Models Challenge CNNs in Medical Image Analysis?

Authors: Shibbir Ahmed, Shahnewaz Karim Sakib, Anindya Bijoy Das

Abstract: This study presents a multimodal AI framework designed for precisely classifying medical diagnostic images. Utilizing publicly available datasets, the proposed system compares the strengths of convolutional neural networks (CNNs) and different large language models (LLMs). This in-depth comparative analysis highlights key differences in diagnostic performance, execution efficiency, and environment… ▽ More This study presents a multimodal AI framework designed for precisely classifying medical diagnostic images. Utilizing publicly available datasets, the proposed system compares the strengths of convolutional neural networks (CNNs) and different large language models (LLMs). This in-depth comparative analysis highlights key differences in diagnostic performance, execution efficiency, and environmental impacts. Model evaluation was based on accuracy, F1-score, average execution time, average energy consumption, and estimated $CO_2$ emission. The findings indicate that although CNN-based models can outperform various multimodal techniques that incorporate both images and contextual information, applying additional filtering on top of LLMs can lead to substantial performance gains. These findings highlight the transformative potential of multimodal AI systems to enhance the reliability, efficiency, and scalability of medical diagnostics in clinical settings. △ Less

Submitted 3 June, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

arXiv:2505.21597 [pdf, other]

Optimizing Deep Learning for Skin Cancer Classification: A Computationally Efficient CNN with Minimal Accuracy Trade-Off

Authors: Abdullah Al Mamun, Pollob Chandra Ray, Md Rahat Ul Nasib, Akash Das, Jia Uddin, Md Nurul Absur

Abstract: The rapid advancement of deep learning in medical image analysis has greatly enhanced the accuracy of skin cancer classification. However, current state-of-the-art models, especially those based on transfer learning like ResNet50, come with significant computational overhead, rendering them impractical for deployment in resource-constrained environments. This study proposes a custom CNN model that… ▽ More The rapid advancement of deep learning in medical image analysis has greatly enhanced the accuracy of skin cancer classification. However, current state-of-the-art models, especially those based on transfer learning like ResNet50, come with significant computational overhead, rendering them impractical for deployment in resource-constrained environments. This study proposes a custom CNN model that achieves a 96.7\% reduction in parameters (from 23.9 million in ResNet50 to 692,000) while maintaining a classification accuracy deviation of less than 0.022\%. Our empirical analysis of the HAM10000 dataset reveals that although transfer learning models provide a marginal accuracy improvement of approximately 0.022\%, they result in a staggering 13,216.76\% increase in FLOPs, considerably raising computational costs and inference latency. In contrast, our lightweight CNN architecture, which encompasses only 30.04 million FLOPs compared to ResNet50's 4.00 billion, significantly reduces energy consumption, memory footprint, and inference time. These findings underscore the trade-off between the complexity of deep models and their real-world feasibility, positioning our optimized CNN as a practical solution for mobile and edge-based skin cancer diagnostics. △ Less

Submitted 27 May, 2025; originally announced May 2025.

Comments: 6 pages, & 7 Images

arXiv:2505.17584 [pdf, ps, other]

Private kNN-VC: Interpretable Anonymization of Converted Speech

Authors: Carlos Franzreb, Arnab Das, Tim Polzehl, Sebastian Möller

Abstract: Speaker anonymization seeks to conceal a speaker's identity while preserving the utility of their speech. The achieved privacy is commonly evaluated with a speaker recognition model trained on anonymized speech. Although this represents a strong attack, it is unclear which aspects of speech are exploited to identify the speakers. Our research sets out to unveil these aspects. It starts with kNN-VC… ▽ More Speaker anonymization seeks to conceal a speaker's identity while preserving the utility of their speech. The achieved privacy is commonly evaluated with a speaker recognition model trained on anonymized speech. Although this represents a strong attack, it is unclear which aspects of speech are exploited to identify the speakers. Our research sets out to unveil these aspects. It starts with kNN-VC, a powerful voice conversion model that performs poorly as an anonymization system, presumably because of prosody leakage. To test this hypothesis, we extend kNN-VC with two interpretable components that anonymize the duration and variation of phones. These components increase privacy significantly, proving that the studied prosodic factors encode speaker identity and are exploited by the privacy attack. Additionally, we show that changes in the target selection algorithm considerably influence the outcome of the privacy attack. △ Less

Submitted 23 May, 2025; originally announced May 2025.

Comments: Accepted by Interspeech 2025

arXiv:2505.17332 [pdf, ps, other]

SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use

Authors: Hitesh Laxmichand Patel, Amit Agarwal, Arion Das, Bhargava Kumar, Srikant Panda, Priyaranjan Pattnayak, Taki Hasan Rafi, Tejaswini Kumar, Dong-Kyu Chae

Abstract: Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to miti… ▽ More Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to mitigate reputational risks, maintain trust, and ensure compliance by effectively identifying and handling unsafe or offensive language. To address this, we introduce SweEval, a benchmark simulating real-world scenarios with variations in tone (positive or negative) and context (formal or informal). The prompts explicitly instruct the model to include specific swear words while completing the task. This benchmark evaluates whether LLMs comply with or resist such inappropriate instructions and assesses their alignment with ethical frameworks, cultural nuances, and language comprehension capabilities. In order to advance research in building ethically aligned AI systems for enterprise use and beyond, we release the dataset and code: https://github.com/amitbcp/multilingual_profanity. △ Less

Submitted 22 May, 2025; originally announced May 2025.

Comments: Published in the Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025), Industry Track, pages 558-582

ACM Class: I.2.7; I.2.6

arXiv:2505.16986 [pdf, ps, other]

T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning

Authors: Amartya Chakraborty, Paresh Dashore, Nadia Bathaee, Anmol Jain, Anirban Das, Shi-Xiong Zhang, Sambit Sahu, Milind Naphade, Genta Indra Winata

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities as intelligent agents capable of solving complex problems. However, effective planning in scenarios involving dependencies between API or tool calls-particularly in multi-turn conversations-remains a significant challenge. To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn conversational dataset specif… ▽ More Large Language Models (LLMs) have demonstrated impressive capabilities as intelligent agents capable of solving complex problems. However, effective planning in scenarios involving dependencies between API or tool calls-particularly in multi-turn conversations-remains a significant challenge. To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn conversational dataset specifically designed to capture and manage inter-tool dependencies across diverse domains. T1 enables rigorous evaluation of agents' ability to coordinate tool use across nine distinct domains (4 single domain and 5 multi-domain) with the help of an integrated caching mechanism for both short- and long-term memory, while supporting dynamic replanning-such as deciding whether to recompute or reuse cached results. Beyond facilitating research on tool use and planning, T1 also serves as a benchmark for evaluating the performance of open-source language models. We present results powered by T1-Agent, highlighting their ability to plan and reason in complex, tool-dependent scenarios. △ Less

Submitted 22 May, 2025; originally announced May 2025.

Comments: Preprint

arXiv:2505.16351 [pdf, other]

Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection

Authors: Chenxu Guo, Jiachen Lian, Xuanru Zhou, Jinming Zhang, Shuhe Li, Zongli Ye, Hwi Joo Park, Anaisha Das, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Gorno-Tempini, Gopala Anumanchipalli

Abstract: Automatic detection of speech dysfluency aids speech-language pathologists in efficient transcription of disordered speech, enhancing diagnostics and treatment planning. Traditional methods, often limited to classification, provide insufficient clinical insight, and text-independent models misclassify dysfluency, especially in context-dependent cases. This work introduces Dysfluent-WFST, a zero-sh… ▽ More Automatic detection of speech dysfluency aids speech-language pathologists in efficient transcription of disordered speech, enhancing diagnostics and treatment planning. Traditional methods, often limited to classification, provide insufficient clinical insight, and text-independent models misclassify dysfluency, especially in context-dependent cases. This work introduces Dysfluent-WFST, a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency. Unlike previous models, Dysfluent-WFST operates with upstream encoders like WavLM and requires no additional training. It achieves state-of-the-art performance in both phonetic error rate and dysfluency detection on simulated and real speech data. Our approach is lightweight, interpretable, and effective, demonstrating that explicit modeling of pronunciation behavior in decoding, rather than complex architectures, is key to improving dysfluency processing systems. △ Less

Submitted 24 May, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

Comments: Accepted for Interspeech2025

arXiv:2505.12217 [pdf, ps, other]

Hyperspectral Image Land Cover Captioning Dataset for Vision Language Models

Authors: Aryan Das, Tanishq Rachamalla, Pravendra Singh, Koushik Biswas, Vinay Kumar Verma, Swalpa Kumar Roy

Abstract: We introduce HyperCap, the first large-scale hyperspectral captioning dataset designed to enhance model performance and effectiveness in remote sensing applications. Unlike traditional hyperspectral imaging (HSI) datasets that focus solely on classification tasks, HyperCap integrates spectral data with pixel-wise textual annotations, enabling deeper semantic understanding of hyperspectral imagery.… ▽ More We introduce HyperCap, the first large-scale hyperspectral captioning dataset designed to enhance model performance and effectiveness in remote sensing applications. Unlike traditional hyperspectral imaging (HSI) datasets that focus solely on classification tasks, HyperCap integrates spectral data with pixel-wise textual annotations, enabling deeper semantic understanding of hyperspectral imagery. This dataset enhances model performance in tasks like classification and feature extraction, providing a valuable resource for advanced remote sensing applications. HyperCap is constructed from four benchmark datasets and annotated through a hybrid approach combining automated and manual methods to ensure accuracy and consistency. Empirical evaluations using state-of-the-art encoders and diverse fusion techniques demonstrate significant improvements in classification performance. These results underscore the potential of vision-language learning in HSI and position HyperCap as a foundational dataset for future research in the field. △ Less

Submitted 17 May, 2025; originally announced May 2025.

arXiv:2505.10303 [pdf, ps, other]

An algebraic theory of ω-regular languages, via μν-expressions

Authors: Anupam Das, Abhishek De

Abstract: Alternating parity automata (APAs) provide a robust formalism for modelling infinite behaviours and play a central role in formal verification. Despite their widespread use, the algebraic theory underlying APAs has remained largely unexplored. In recent work, a notation for non-deterministic finite automata (NFAs) was introduced, along with a sound and complete axiomatisation of their equational t… ▽ More Alternating parity automata (APAs) provide a robust formalism for modelling infinite behaviours and play a central role in formal verification. Despite their widespread use, the algebraic theory underlying APAs has remained largely unexplored. In recent work, a notation for non-deterministic finite automata (NFAs) was introduced, along with a sound and complete axiomatisation of their equational theory via right-linear algebras. In this paper, we extend that line of work, in particular to the setting of infinite words. We present a dualised syntax, yielding a notation for APAs based on right-linear lattice expressions, and provide a natural axiomatisation of their equational theory with respect to the standard language model of ω-regular languages. The design of this axiomatisation is guided by the theory of fixed point logics; in fact, the completeness factors cleanly through the completeness of the linear-time μ-calculus. △ Less

Submitted 15 May, 2025; originally announced May 2025.

Comments: Preprint

arXiv:2505.09000 [pdf, ps, other]

Cyclic system for an algebraic theory of alternating parity automata

Authors: Anupam Das, Abhishek De

Abstract: $ω… ▽ More $ω$-regular languages are a natural extension of the regular languages to the setting of infinite words. Likewise, they are recognised by a host of automata models, one of the most important being Alternating Parity Automata (APAs), a generalisation of Büchi automata that symmetrises both the transitions (with universal as well as existential branching) and the acceptance condition (by a parity condition). In this work we develop a cyclic proof system manipulating APAs, represented by an algebraic notation of Right Linear Lattice expressions. This syntax dualises that of previously introduced Right Linear Algebras, which comprised a notation for non-deterministic finite automata (NFAs). This dualisation induces a symmetry in the proof systems we design, with lattice operations behaving dually on each side of the sequent. Our main result is the soundness and completeness of our system for $ω$-language inclusion, heavily exploiting game theoretic techniques from the theory of $ω$-regular languages. △ Less

Submitted 13 May, 2025; originally announced May 2025.

Comments: 26 pages, 3 figures

arXiv:2505.08910 [pdf, ps, other]

Behind Maya: Building a Multilingual Vision Language Model

Authors: Nahid Alam, Karthik Reddy Kanjula, Surya Guthikonda, Timothy Chung, Bala Krishna S Vegesna, Abhipsha Das, Anthony Susevski, Ryan Sze-Yin Chan, S M Iftekhar Uddin, Shayekh Bin Islam, Roshan Santhosh, Snegha A, Drishti Sharma, Chen Liu, Isha Chaturvedi, Genta Indra Winata, Ashvanth. S, Snehanshu Mukherjee, Alham Fikri Aji

Abstract: In recent times, we have seen a rapid development of large Vision-Language Models (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. To address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pre… ▽ More In recent times, we have seen a rapid development of large Vision-Language Models (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. To address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; and 2) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at https://github.com/nahidalam/maya. △ Less

Submitted 15 May, 2025; v1 submitted 13 May, 2025; originally announced May 2025.

Comments: Accepted at VLMs4ALL CVPR 2025 Workshop; corrected workshop name spelling

arXiv:2505.04181 [pdf]

Privacy Challenges In Image Processing Applications

Authors: Maneesha, Bharat Gupta, Rishabh Sethi, Charvi Adita Das

Abstract: As image processing systems proliferate, privacy concerns intensify given the sensitive personal information contained in images. This paper examines privacy challenges in image processing and surveys emerging privacy-preserving techniques including differential privacy, secure multiparty computation, homomorphic encryption, and anonymization. Key applications with heightened privacy risks include… ▽ More As image processing systems proliferate, privacy concerns intensify given the sensitive personal information contained in images. This paper examines privacy challenges in image processing and surveys emerging privacy-preserving techniques including differential privacy, secure multiparty computation, homomorphic encryption, and anonymization. Key applications with heightened privacy risks include healthcare, where medical images contain patient health data, and surveillance systems that can enable unwarranted tracking. Differential privacy offers rigorous privacy guarantees by injecting controlled noise, while MPC facilitates collaborative analytics without exposing raw data inputs. Homomorphic encryption enables computations on encrypted data and anonymization directly removes identifying elements. However, balancing privacy protections and utility remains an open challenge. Promising future directions identified include quantum-resilient cryptography, federated learning, dedicated hardware, and conceptual innovations like privacy by design. Ultimately, a holistic effort combining technological innovations, ethical considerations, and policy frameworks is necessary to uphold the fundamental right to privacy as image processing capabilities continue advancing rapidly. △ Less

Submitted 7 May, 2025; originally announced May 2025.

Comments: 19 pages, 3 figures

arXiv:2505.03983 [pdf, other]

Diffusion Models are Secretly Exchangeable: Parallelizing DDPMs via Autospeculation

Authors: Hengyuan Hu, Aniket Das, Dorsa Sadigh, Nima Anari

Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have emerged as powerful tools for generative modeling. However, their sequential computation requirements lead to significant inference-time bottlenecks. In this work, we utilize the connection between DDPMs and Stochastic Localization to prove that, under an appropriate reparametrization, the increments of DDPM satisfy an exchangeability property.… ▽ More Denoising Diffusion Probabilistic Models (DDPMs) have emerged as powerful tools for generative modeling. However, their sequential computation requirements lead to significant inference-time bottlenecks. In this work, we utilize the connection between DDPMs and Stochastic Localization to prove that, under an appropriate reparametrization, the increments of DDPM satisfy an exchangeability property. This general insight enables near-black-box adaptation of various performance optimization techniques from autoregressive models to the diffusion setting. To demonstrate this, we introduce \emph{Autospeculative Decoding} (ASD), an extension of the widely used speculative decoding algorithm to DDPMs that does not require any auxiliary draft models. Our theoretical analysis shows that ASD achieves a $\tilde{O} (K^{\frac{1}{3}})$ parallel runtime speedup over the $K$ step sequential DDPM. We also demonstrate that a practical implementation of autospeculative decoding accelerates DDPM inference significantly in various domains. △ Less

Submitted 6 May, 2025; originally announced May 2025.

arXiv:2504.20976 [pdf, other]

Real-Time Wayfinding Assistant for Blind and Low-Vision Users

Authors: Dabbrata Das, Argho Deb Das, Farhan Sadaf

Abstract: Navigating unfamiliar places continues to be one of the most persistent and essential everyday obstacles for those who are blind or have limited vision (BLV). Existing assistive technologies, such as GPS-based navigation systems, AI-powered smart glasses, and sonar-equipped canes, often face limitations in real-time obstacle avoidance, precise localization, and adaptability to dynamic surroundings… ▽ More Navigating unfamiliar places continues to be one of the most persistent and essential everyday obstacles for those who are blind or have limited vision (BLV). Existing assistive technologies, such as GPS-based navigation systems, AI-powered smart glasses, and sonar-equipped canes, often face limitations in real-time obstacle avoidance, precise localization, and adaptability to dynamic surroundings. To investigate potential solutions, we introduced PathFinder, a novel map-less navigation system that explores different models for understanding 2D images, including Vision Language Models (VLMs), Large Language Models (LLMs), and employs monocular depth estimation for free-path detection. Our approach integrates a Depth-First Search (DFS) algorithm on depth images to determine the longest obstacle-free path, ensuring optimal route selection while maintaining computational efficiency. We conducted comparative evaluations against existing AI-powered navigation methods and performed a usability study with BLV participants. The results demonstrate that PathFinder achieves a favorable balance between accuracy, computational efficiency, and real-time responsiveness. Notably, it reduces mean absolute error (MAE) and improves decision-making speed in outdoor navigation compared to AI-based alternatives. Participant feedback emphasizes the system's usability and effectiveness in outside situations, but also identifies issues in complicated indoor locations and low-light conditions. Usability testing revealed that 73% of participants understood how to use the app in about a minute, and 80% praised its balance of accuracy, quick response, and overall convenience. △ Less

Submitted 29 April, 2025; originally announced April 2025.

arXiv:2504.19967 [pdf]

Enhancing short-term traffic prediction by integrating trends and fluctuations with attention mechanism

Authors: Adway Das, Agnimitra Sengupta, S. Ilgin Guler

Abstract: Traffic flow prediction is a critical component of intelligent transportation systems, yet accurately forecasting traffic remains challenging due to the interaction between long-term trends and short-term fluctuations. Standard deep learning models often struggle with these challenges because their architectures inherently smooth over fine-grained fluctuations while focusing on general trends. Thi… ▽ More Traffic flow prediction is a critical component of intelligent transportation systems, yet accurately forecasting traffic remains challenging due to the interaction between long-term trends and short-term fluctuations. Standard deep learning models often struggle with these challenges because their architectures inherently smooth over fine-grained fluctuations while focusing on general trends. This limitation arises from low-pass filtering effects, gate biases favoring stability, and memory update mechanisms that prioritize long-term information retention. To address these shortcomings, this study introduces a hybrid deep learning framework that integrates both long-term trend and short-term fluctuation information using two input features processed in parallel, designed to capture complementary aspects of traffic flow dynamics. Further, our approach leverages attention mechanisms, specifically Bahdanau attention, to selectively focus on critical time steps within traffic data, enhancing the model's ability to predict congestion and other transient phenomena. Experimental results demonstrate that features learned from both branches are complementary, significantly improving the goodness-of-fit statistics across multiple prediction horizons compared to a baseline model. Notably, the attention mechanism enhances short-term forecast accuracy by directly targeting immediate fluctuations, though challenges remain in fully integrating long-term trends. This framework can contribute to more effective congestion mitigation and urban mobility planning by advancing the robustness and precision of traffic prediction models. △ Less

Submitted 28 April, 2025; originally announced April 2025.

arXiv:2504.19061 [pdf, other]

Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models

Authors: Anindya Bijoy Das, Shibbir Ahmed, Shahnewaz Karim Sakib

Abstract: Clinical summarization is crucial in healthcare as it distills complex medical data into digestible information, enhancing patient understanding and care management. Large language models (LLMs) have shown significant potential in automating and improving the accuracy of such summarizations due to their advanced natural language understanding capabilities. These models are particularly applicable… ▽ More Clinical summarization is crucial in healthcare as it distills complex medical data into digestible information, enhancing patient understanding and care management. Large language models (LLMs) have shown significant potential in automating and improving the accuracy of such summarizations due to their advanced natural language understanding capabilities. These models are particularly applicable in the context of summarizing medical/clinical texts, where precise and concise information transfer is essential. In this paper, we investigate the effectiveness of open-source LLMs in extracting key events from discharge reports, such as reasons for hospital admission, significant in-hospital events, and critical follow-up actions. In addition, we also assess the prevalence of various types of hallucinations in the summaries produced by these models. Detecting hallucinations is vital as it directly influences the reliability of the information, potentially affecting patient care and treatment outcomes. We conduct comprehensive numerical simulations to rigorously evaluate the performance of these models, further probing the accuracy and fidelity of the extracted content in clinical summarization. △ Less

Submitted 26 April, 2025; originally announced April 2025.

arXiv:2504.18887 [pdf, ps, other]

Closed-Form Expressions for I/O Relation in Zak-OTFS with Different Delay-Doppler Filters

Authors: Arpan Das, Fathima Jesbin, Ananthanarayanan Chockalingam

Abstract: The transceiver operations in the delay-Doppler (DD) domain in Zak-OTFS modulation, including DD domain filtering at the transmitter and receiver, involve twisted convolution operation. The twisted convolution operations give rise to multiple integrals in the end-to-end DD domain input-output (I/O) relation. The I/O relation plays a crucial role in performance evaluation and algorithm development… ▽ More The transceiver operations in the delay-Doppler (DD) domain in Zak-OTFS modulation, including DD domain filtering at the transmitter and receiver, involve twisted convolution operation. The twisted convolution operations give rise to multiple integrals in the end-to-end DD domain input-output (I/O) relation. The I/O relation plays a crucial role in performance evaluation and algorithm development for transceiver implementation. In this paper, we derive discrete DD domain closed-form expressions for the I/O relation and noise covariance in Zak-OTFS. We derive these expressions for sinc and Gaussian pulse shaping DD filters at the transmitter (Tx). On the receiver (Rx) side, three types of DD filters are considered, viz., $(i)$ Rx filter identical to Tx filter (referred to as `identical filtering'), $(ii)$ Rx filter matched to the Tx filter (referred to as `matched filtering'), and $(iii)$ Rx filter matched to both Tx filter and channel response (referred to as `channel matched filtering'). For all the above cases, except for the case of sinc identical filtering, we derive exact I/O relation and noise covariance expressions in closed-form. For the sinc identical filtering case, we derive approximate closed-form expressions which are shown to be accurate. Using the derived closed-form expressions, we evaluate the bit error performance of Zak-OTFS for different Tx/Rx filter configurations. Our results using Vehicular-A (Veh-A) channel model with fractional DDs show that, while matched filtering achieves slightly better or almost same performance as identical filtering, channel matched filtering achieves the best performance among the three. △ Less

Submitted 26 April, 2025; originally announced April 2025.

Comments: IEEE TVT. Copyright IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:2504.09249 [pdf, other]

NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding

Authors: Aniket Pal, Sanket Biswas, Alloy Das, Ayush Lodh, Priyanka Banerjee, Soumitri Chattopadhyay, Dimosthenis Karatzas, Josep Llados, C. V. Jawahar

Abstract: Understanding and reasoning over academic handwritten notes remains a challenge in document AI, particularly for mathematical equations, diagrams, and scientific notations. Existing visual question answering (VQA) benchmarks focus on printed or structured handwritten text, limiting generalization to real-world note-taking. To address this, we introduce NoTeS-Bank, an evaluation benchmark for Neura… ▽ More Understanding and reasoning over academic handwritten notes remains a challenge in document AI, particularly for mathematical equations, diagrams, and scientific notations. Existing visual question answering (VQA) benchmarks focus on printed or structured handwritten text, limiting generalization to real-world note-taking. To address this, we introduce NoTeS-Bank, an evaluation benchmark for Neural Transcription and Search in note-based question answering. NoTeS-Bank comprises complex notes across multiple domains, requiring models to process unstructured and multimodal content. The benchmark defines two tasks: (1) Evidence-Based VQA, where models retrieve localized answers with bounding-box evidence, and (2) Open-Domain VQA, where models classify the domain before retrieving relevant documents and answers. Unlike classical Document VQA datasets relying on optical character recognition (OCR) and structured data, NoTeS-BANK demands vision-language fusion, retrieval, and multimodal reasoning. We benchmark state-of-the-art Vision-Language Models (VLMs) and retrieval frameworks, exposing structured transcription and reasoning limitations. NoTeS-Bank provides a rigorous evaluation with NDCG@5, MRR, Recall@K, IoU, and ANLS, establishing a new standard for visual document understanding and reasoning. △ Less

Submitted 12 April, 2025; originally announced April 2025.

arXiv:2504.03603 [pdf, other]

Towards deployment-centric multimodal AI beyond vision and language

Authors: Xianyuan Liu, Jiayang Zhang, Shuo Zhou, Thijs L. van der Plas, Avish Vijayaraghavan, Anastasiia Grishina, Mengdie Zhuang, Daniel Schofield, Christopher Tomlinson, Yuhan Wang, Ruizhe Li, Louisa van Zeeland, Sina Tabakhi, Cyndie Demeocq, Xiang Li, Arunav Das, Orlando Timmerman, Thomas Baldwin-McDonald, Jinge Wu, Peizhen Bai, Zahraa Al Sahili, Omnia Alwazzan, Thao N. Do, Mohammod N. I. Suvon, Angeline Wang , et al. (23 additional authors not shown)

Abstract: Multimodal artificial intelligence (AI) integrates diverse types of data via machine learning to improve understanding, prediction, and decision-making across disciplines such as healthcare, science, and engineering. However, most multimodal AI advances focus on models for vision and language data, while their deployability remains a key challenge. We advocate a deployment-centric workflow that in… ▽ More Multimodal artificial intelligence (AI) integrates diverse types of data via machine learning to improve understanding, prediction, and decision-making across disciplines such as healthcare, science, and engineering. However, most multimodal AI advances focus on models for vision and language data, while their deployability remains a key challenge. We advocate a deployment-centric workflow that incorporates deployment constraints early to reduce the likelihood of undeployable solutions, complementing data-centric and model-centric approaches. We also emphasise deeper integration across multiple levels of multimodality and multidisciplinary collaboration to significantly broaden the research scope beyond vision and language. To facilitate this approach, we identify common multimodal-AI-specific challenges shared across disciplines and examine three real-world use cases: pandemic response, self-driving car design, and climate change adaptation, drawing expertise from healthcare, social science, engineering, science, sustainability, and finance. By fostering multidisciplinary dialogue and open research practices, our community can accelerate deployment-centric development for broad societal impact. △ Less

Submitted 4 April, 2025; originally announced April 2025.

arXiv:2504.02671 [pdf, other]

LLM for Complex Reasoning Task: An Exploratory Study in Fermi Problems

Authors: Zishuo Liu, Carlos Rabat Villarreal, Mostafa Rahgouy, Amit Das, Zheng Zhang, Chang Ren, Dongji Feng

Abstract: Fermi Problems (FPs) are mathematical reasoning tasks that require human-like logic and numerical reasoning. Unlike other reasoning questions, FPs often involve real-world impracticalities or ambiguous concepts, making them challenging even for humans to solve. Despite advancements in AI, particularly with large language models (LLMs) in various reasoning tasks, FPs remain relatively under-explore… ▽ More Fermi Problems (FPs) are mathematical reasoning tasks that require human-like logic and numerical reasoning. Unlike other reasoning questions, FPs often involve real-world impracticalities or ambiguous concepts, making them challenging even for humans to solve. Despite advancements in AI, particularly with large language models (LLMs) in various reasoning tasks, FPs remain relatively under-explored. This work conducted an exploratory study to examine the capabilities and limitations of LLMs in solving FPs. We first evaluated the overall performance of three advanced LLMs using a publicly available FP dataset. We designed prompts according to the recently proposed TELeR taxonomy, including a zero-shot scenario. Results indicated that all three LLMs achieved a fp_score (range between 0 - 1) below 0.5, underscoring the inherent difficulty of these reasoning tasks. To further investigate, we categorized FPs into standard and specific questions, hypothesizing that LLMs would perform better on standard questions, which are characterized by clarity and conciseness, than on specific ones. Comparative experiments confirmed this hypothesis, demonstrating that LLMs performed better on standard FPs in terms of both accuracy and efficiency. △ Less

Submitted 3 April, 2025; originally announced April 2025.

Comments: 7 pages,7 tables, 5 figures

arXiv:2504.01281 [pdf, ps, other]

Scaling Test-Time Inference with Policy-Optimized, Dynamic Retrieval-Augmented Generation via KV Caching and Decoding

Authors: Sakhinana Sagar Srinivas, Akash Das, Shivam Gupta, Venkataramana Runkana

Abstract: We present a comprehensive framework for enhancing Retrieval-Augmented Generation (RAG) systems through dynamic retrieval strategies and reinforcement fine-tuning. This approach significantly improves large language models on knowledge-intensive tasks, including opendomain question answering and complex reasoning. Our framework integrates two complementary techniques: Policy-Optimized RetrievalAug… ▽ More We present a comprehensive framework for enhancing Retrieval-Augmented Generation (RAG) systems through dynamic retrieval strategies and reinforcement fine-tuning. This approach significantly improves large language models on knowledge-intensive tasks, including opendomain question answering and complex reasoning. Our framework integrates two complementary techniques: Policy-Optimized RetrievalAugmented Generation (PORAG), which optimizes the use of retrieved information, and Adaptive Token-Layer Attention Scoring (ATLAS), which dynamically determines retrieval timing and content based on contextual needs. Together, these techniques enhance both the utilization and relevance of retrieved content, improving factual accuracy and response quality. Designed as a lightweight solution compatible with any Transformer-based LLM without requiring additional training, our framework excels in knowledge-intensive tasks, boosting output accuracy in RAG settings. We further propose CRITIC, a novel method to selectively compress key-value caches by token importance, mitigating memory bottlenecks in long-context applications. The framework also incorporates test-time scaling techniques to dynamically balance reasoning depth and computational resources, alongside optimized decoding strategies for faster inference. Experiments on benchmark datasets show that our framework reduces hallucinations, strengthens domain-specific reasoning, and achieves significant efficiency and scalability gains over traditional RAG systems. This integrated approach advances the development of robust, efficient, and scalable RAG systems across diverse applications. △ Less

Submitted 20 May, 2025; v1 submitted 1 April, 2025; originally announced April 2025.

arXiv:2504.00338 [pdf, other]

Agentic Multimodal AI for Hyperpersonalized B2B and B2C Advertising in Competitive Markets: An AI-Driven Competitive Advertising Framework

Authors: Sakhinana Sagar Srinivas, Akash Das, Shivam Gupta, Venkataramana Runkana

Abstract: The growing use of foundation models (FMs) in real-world applications demands adaptive, reliable, and efficient strategies for dynamic markets. In the chemical industry, AI-discovered materials drive innovation, but commercial success hinges on market adoption, requiring FM-driven advertising frameworks that operate in-the-wild. We present a multilingual, multimodal AI framework for autonomous, hy… ▽ More The growing use of foundation models (FMs) in real-world applications demands adaptive, reliable, and efficient strategies for dynamic markets. In the chemical industry, AI-discovered materials drive innovation, but commercial success hinges on market adoption, requiring FM-driven advertising frameworks that operate in-the-wild. We present a multilingual, multimodal AI framework for autonomous, hyper-personalized advertising in B2B and B2C markets. By integrating retrieval-augmented generation (RAG), multimodal reasoning, and adaptive persona-based targeting, our system generates culturally relevant, market-aware ads tailored to shifting consumer behaviors and competition. Validation combines real-world product experiments with a Simulated Humanistic Colony of Agents to model consumer personas, optimize strategies at scale, and ensure privacy compliance. Synthetic experiments mirror real-world scenarios, enabling cost-effective testing of ad strategies without risky A/B tests. Combining structured retrieval-augmented reasoning with in-context learning (ICL), the framework boosts engagement, prevents market cannibalization, and maximizes ROAS. This work bridges AI-driven innovation and market adoption, advancing multimodal FM deployment for high-stakes decision-making in commercial marketing. △ Less

Submitted 31 March, 2025; originally announced April 2025.

arXiv:2503.19786 [pdf, other]

Gemma 3 Technical Report

Authors: Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin , et al. (191 additional authors not shown)

Abstract: We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achie… ▽ More We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community. △ Less

Submitted 25 March, 2025; originally announced March 2025.

arXiv:2503.14053 [pdf, other]

ON-Traffic: An Operator Learning Framework for Online Traffic Flow Estimation and Uncertainty Quantification from Lagrangian Sensors

Authors: Jake Rap, Amritam Das

Abstract: Accurate traffic flow estimation and prediction are critical for the efficient management of transportation systems, particularly under increasing urbanization. Traditional methods relying on static sensors often suffer from limited spatial coverage, while probe vehicles provide richer, albeit sparse and irregular data. This work introduces ON-Traffic, a novel deep operator Network and a receding… ▽ More Accurate traffic flow estimation and prediction are critical for the efficient management of transportation systems, particularly under increasing urbanization. Traditional methods relying on static sensors often suffer from limited spatial coverage, while probe vehicles provide richer, albeit sparse and irregular data. This work introduces ON-Traffic, a novel deep operator Network and a receding horizon learning-based framework tailored for online estimation of spatio-temporal traffic state along with quantified uncertainty by using measurements from moving probe vehicles and downstream boundary inputs. Our framework is evaluated in both numerical and simulation datasets, showcasing its ability to handle irregular, sparse input data, adapt to time-shifted scenarios, and provide well-calibrated uncertainty estimates. The results demonstrate that the model captures complex traffic phenomena, including shockwaves and congestion propagation, while maintaining robustness to noise and sensor dropout. These advancements present a significant step toward online, adaptive traffic management systems. △ Less

Submitted 18 March, 2025; originally announced March 2025.

arXiv:2503.12996 [pdf, ps, other]

Semi-Streaming Algorithms for Graph Property Certification

Authors: Avinandan Das, Pierre Fraigniaud, Ami Paz, Adi Rosen

Abstract: We introduce the {\em certification} of solutions to graph problems when access to the input is restricted. This topic has received a lot of attention in the distributed computing setting, and we introduce it here in the context of \emph{streaming} algorithms, where the input is too large to be stored in memory. Given a graph property $\mbox{P}$, a \emph{streaming certification scheme} for… ▽ More We introduce the {\em certification} of solutions to graph problems when access to the input is restricted. This topic has received a lot of attention in the distributed computing setting, and we introduce it here in the context of \emph{streaming} algorithms, where the input is too large to be stored in memory. Given a graph property $\mbox{P}$, a \emph{streaming certification scheme} for $\mbox{P}$ is a \emph{prover-verifier} pair where the prover is a computationally unlimited but non-trustable oracle, and the verifier is a streaming algorithm. For any input graph, the prover provides the verifier with a \emph{certificate}. The verifier then receives the input graph as a stream of edges in an adversarial order, and must check whether the certificate is indeed a \emph{proof} that the input graph satisfies $\mbox{P}$. The main complexity measure for a streaming certification scheme is its \emph{space complexity}, defined as the sum of the size of the certificate provided by the oracle, and of the memory space required by the verifier. We give streaming certification schemes for several graph properties, including maximum matching, diameter, degeneracy, and coloring, with space complexity matching the requirement of \emph{semi-streaming}, i.e., with space complexity $O(n\,\mbox{polylog}\, n)$ for $n$-node graphs. All these problems do {\em not} admit semi-streaming algorithms, showing that also in the (semi) streaming setting, certification is sometimes easier than calculation (like $NP$). For each of these properties, we provide upper and lower bounds on the space complexity of the corresponding certification schemes, many being tight up to logarithmic multiplicative factors. We also show that some graph properties are hard for streaming certification, in the sense that they cannot be certified in semi-streaming, as they require $Ω(n^2)$-bit certificates. △ Less

Submitted 17 March, 2025; originally announced March 2025.

arXiv:2503.10852 [pdf, ps, other]

New Vertex Ordering Characterizations of Circular-Arc Bigraphs

Authors: Indrajit Paul, Ashok Kumar Das

Abstract: In this article, we present two new characterizations of circular-arc bigraphs based on their vertex ordering. Also, we provide a characterization of circular-arc bigraphs in terms of forbidden patterns with respect to a particular ordering of their vertices. In this article, we present two new characterizations of circular-arc bigraphs based on their vertex ordering. Also, we provide a characterization of circular-arc bigraphs in terms of forbidden patterns with respect to a particular ordering of their vertices. △ Less

Submitted 8 April, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

MSC Class: 05C90; 68R10; 05C10

arXiv:2503.10690 [pdf, other]

Battling Misinformation: An Empirical Study on Adversarial Factuality in Open-Source Large Language Models

Authors: Shahnewaz Karim Sakib, Anindya Bijoy Das, Shibbir Ahmed

Abstract: Adversarial factuality refers to the deliberate insertion of misinformation into input prompts by an adversary, characterized by varying levels of expressed confidence. In this study, we systematically evaluate the performance of several open-source large language models (LLMs) when exposed to such adversarial inputs. Three tiers of adversarial confidence are considered: strongly confident, modera… ▽ More Adversarial factuality refers to the deliberate insertion of misinformation into input prompts by an adversary, characterized by varying levels of expressed confidence. In this study, we systematically evaluate the performance of several open-source large language models (LLMs) when exposed to such adversarial inputs. Three tiers of adversarial confidence are considered: strongly confident, moderately confident, and limited confidence. Our analysis encompasses eight LLMs: LLaMA 3.1 (8B), Phi 3 (3.8B), Qwen 2.5 (7B), Deepseek-v2 (16B), Gemma2 (9B), Falcon (7B), Mistrallite (7B), and LLaVA (7B). Empirical results indicate that LLaMA 3.1 (8B) exhibits a robust capability in detecting adversarial inputs, whereas Falcon (7B) shows comparatively lower performance. Notably, for the majority of the models, detection success improves as the adversary's confidence decreases; however, this trend is reversed for LLaMA 3.1 (8B) and Phi 3 (3.8B), where a reduction in adversarial confidence corresponds with diminished detection performance. Further analysis of the queries that elicited the highest and lowest rates of successful attacks reveals that adversarial attacks are more effective when targeting less commonly referenced or obscure information. △ Less

Submitted 11 March, 2025; originally announced March 2025.

arXiv:2503.09894 [pdf, other]

What's In Your Field? Mapping Scientific Research with Knowledge Graphs and Large Language Models

Authors: Abhipsha Das, Nicholas Lourie, Siavash Golkar, Mariel Pettee

Abstract: The scientific literature's exponential growth makes it increasingly challenging to navigate and synthesize knowledge across disciplines. Large language models (LLMs) are powerful tools for understanding scientific text, but they fail to capture detailed relationships across large bodies of work. Unstructured approaches, like retrieval augmented generation, can sift through such corpora to recall… ▽ More The scientific literature's exponential growth makes it increasingly challenging to navigate and synthesize knowledge across disciplines. Large language models (LLMs) are powerful tools for understanding scientific text, but they fail to capture detailed relationships across large bodies of work. Unstructured approaches, like retrieval augmented generation, can sift through such corpora to recall relevant facts; however, when millions of facts influence the answer, unstructured approaches become cost prohibitive. Structured representations offer a natural complement -- enabling systematic analysis across the whole corpus. Recent work enhances LLMs with unstructured or semistructured representations of scientific concepts; to complement this, we try extracting structured representations using LLMs. By combining LLMs' semantic understanding with a schema of scientific concepts, we prototype a system that answers precise questions about the literature as a whole. Our schema applies across scientific fields and we extract concepts from it using only 20 manually annotated abstracts. To demonstrate the system, we extract concepts from 30,000 papers on arXiv spanning astrophysics, fluid dynamics, and evolutionary biology. The resulting database highlights emerging trends and, by visualizing the knowledge graph, offers new ways to explore the ever-growing landscape of scientific knowledge. Demo: abby101/surveyor-0 on HF Spaces. Code: https://github.com/chiral-carbon/kg-for-science. △ Less

Submitted 28 May, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

Comments: 9 pages, 5 pdf figures

arXiv:2503.09184 [pdf, other]

Exploiting Unstructured Sparsity in Fully Homomorphic Encrypted DNNs

Authors: Aidan Ferguson, Perry Gibson, Lara D'Agata, Parker McLeod, Ferhat Yaman, Amitabh Das, Ian Colbert, José Cano

Abstract: The deployment of deep neural networks (DNNs) in privacy-sensitive environments is constrained by computational overheads in fully homomorphic encryption (FHE). This paper explores unstructured sparsity in FHE matrix multiplication schemes as a means of reducing this burden while maintaining model accuracy requirements. We demonstrate that sparsity can be exploited in arbitrary matrix multiplicati… ▽ More The deployment of deep neural networks (DNNs) in privacy-sensitive environments is constrained by computational overheads in fully homomorphic encryption (FHE). This paper explores unstructured sparsity in FHE matrix multiplication schemes as a means of reducing this burden while maintaining model accuracy requirements. We demonstrate that sparsity can be exploited in arbitrary matrix multiplication, providing runtime benefits compared to a baseline naive algorithm at all sparsity levels. This is a notable departure from the plaintext domain, where there is a trade-off between sparsity and the overhead of the sparse multiplication algorithm. In addition, we propose three sparse multiplication schemes in FHE based on common plaintext sparse encodings. We demonstrate the performance gain is scheme-invariant; however, some sparse schemes vastly reduce the memory storage requirements of the encrypted matrix at high sparsity values. Our proposed sparse schemes yield an average performance gain of 2.5x at 50% unstructured sparsity, with our multi-threading scheme providing a 32.5x performance increase over the equivalent single-threaded sparse computation when utilizing 64 cores. △ Less

Submitted 3 April, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

Comments: Accepted to 5th Workshop on Machine Learning and Systems (EuroMLSys) co-located with EuroSys '25

Showing 1–50 of 645 results for author: Das, A