Search | arXiv e-print repository

HOTA: Hamiltonian framework for Optimal Transport Advection

Authors: Nazar Buzun, Daniil Shlenskii, Maxim Bobrin, Dmitry V. Dylov

Abstract: Optimal transport (OT) has become a natural framework for guiding the probability flows. Yet, the majority of recent generative models assume trivial geometry (e.g., Euclidean) and rely on strong density-estimation assumptions, yielding trajectories that do not respect the true principles of optimality in the underlying manifold. We present Hamiltonian Optimal Transport Advection (HOTA), a Hamilto… ▽ More Optimal transport (OT) has become a natural framework for guiding the probability flows. Yet, the majority of recent generative models assume trivial geometry (e.g., Euclidean) and rely on strong density-estimation assumptions, yielding trajectories that do not respect the true principles of optimality in the underlying manifold. We present Hamiltonian Optimal Transport Advection (HOTA), a Hamilton-Jacobi-Bellman based method that tackles the dual dynamical OT problem explicitly through Kantorovich potentials, enabling efficient and scalable trajectory optimization. Our approach effectively evades the need for explicit density modeling, performing even when the cost functionals are non-smooth. Empirically, HOTA outperforms all baselines in standard benchmarks, as well as in custom datasets with non-differentiable costs, both in terms of feasibility and optimality. △ Less

Submitted 23 July, 2025; originally announced July 2025.

arXiv:2507.16105 [pdf, ps, other]

Monotone Circuit Complexity of Matching

Authors: Bruno Cavalar, Mika Göös, Artur Riazanov, Anastasia Sofronova, Dmitry Sokolov

Abstract: We show that the perfect matching function on $n$-vertex graphs requires monotone circuits of size $\smash{2^{n^{Ω(1)}}}$. This improves on the $n^{Ω(\log n)}$ lower bound of Razborov (1985). Our proof uses the standard approximation method together with a new sunflower lemma for matchings. We show that the perfect matching function on $n$-vertex graphs requires monotone circuits of size $\smash{2^{n^{Ω(1)}}}$. This improves on the $n^{Ω(\log n)}$ lower bound of Razborov (1985). Our proof uses the standard approximation method together with a new sunflower lemma for matchings. △ Less

Submitted 21 July, 2025; originally announced July 2025.

arXiv:2507.16008 [pdf, ps, other]

Enhancing Stability of Physics-Informed Neural Network Training Through Saddle-Point Reformulation

Authors: Dmitry Bylinkin, Mikhail Aleksandrov, Savelii Chezhegov, Aleksandr Beznosikov

Abstract: Physics-informed neural networks (PINNs) have gained prominence in recent years and are now effectively used in a number of applications. However, their performance remains unstable due to the complex landscape of the loss function. To address this issue, we reformulate PINN training as a nonconvex-strongly concave saddle-point problem. After establishing the theoretical foundation for this approa… ▽ More Physics-informed neural networks (PINNs) have gained prominence in recent years and are now effectively used in a number of applications. However, their performance remains unstable due to the complex landscape of the loss function. To address this issue, we reformulate PINN training as a nonconvex-strongly concave saddle-point problem. After establishing the theoretical foundation for this approach, we conduct an extensive experimental study, evaluating its effectiveness across various tasks and architectures. Our results demonstrate that the proposed method outperforms the current state-of-the-art techniques. △ Less

Submitted 21 July, 2025; originally announced July 2025.

Comments: 34 pages, 4 tables, 3 figures, 4 theorems; code available at https://anonymous.4open.science/r/pinns-bgda-00D6/README.md

arXiv:2507.15186 [pdf]

doi 10.20380/GI2000.29

Model Simplification through refinement

Authors: Dmitry Brodsky, Benjamin Watson

Abstract: As modeling and visualization applications proliferate, there arises a need to simplify large polygonal models at interactive rates. Unfortunately existing polygon mesh simplification algorithms are not well suited for this task because they are either too slow (requiring the simplified model to be pre-computed) or produce models that are too poor in quality. These shortcomings become particularly… ▽ More As modeling and visualization applications proliferate, there arises a need to simplify large polygonal models at interactive rates. Unfortunately existing polygon mesh simplification algorithms are not well suited for this task because they are either too slow (requiring the simplified model to be pre-computed) or produce models that are too poor in quality. These shortcomings become particularly acute when models are extremely large. We present an algorithm suitable for simplification of large models at interactive speeds. The algorithm is fast and can guarantee displayable results within a given time limit. Results also have good quality. Inspired by splitting algorithms from vector quantization literature, we simplify models in reverse, beginning with an extremely coarse approximation and refining it. Approximations of surface curvature guide the simplification process. Previously produced simplifications can be further refined by using them as input to the algorithm. △ Less

Submitted 20 July, 2025; originally announced July 2025.

Journal ref: Proceedings of Graphics Interface 2000: Montréal, Québec, Canada, 15 - 17 May 2000, 221-228

arXiv:2507.14258 [pdf, ps, other]

Dispute Resolution in Peer Review with Abstract Argumentation and OWL DL

Authors: Ildar Baimuratov, Elena Lisanyuk, Dmitry Prokudin

Abstract: The peer review process for scientific publications faces significant challenges due to the increasing volume of submissions and inherent reviewer biases. While artificial intelligence offers the potential to facilitate the process, it also risks perpetuating biases present in training data. This research addresses these challenges by applying formal methods from argumentation theory to support tr… ▽ More The peer review process for scientific publications faces significant challenges due to the increasing volume of submissions and inherent reviewer biases. While artificial intelligence offers the potential to facilitate the process, it also risks perpetuating biases present in training data. This research addresses these challenges by applying formal methods from argumentation theory to support transparent and unbiased dispute resolution in peer review. Specifically, we conceptualize scientific peer review as a single mixed argumentative dispute between manuscript authors and reviewers and formalize it using abstract argumentation frameworks. We analyze the resulting peer review argumentation frameworks from semantic, graph-theoretic, and computational perspectives, showing that they are well-founded and decidable in linear time. These frameworks are then implemented using OWL DL and resolved with reasoning engines. We validate our approach by annotating a corpus of scientific peer reviews with abstract argumentation frameworks and applying a proof of concept to resolve the annotated disputes. The results demonstrate that integrating our method could enhance the quality of published work by providing a more rigorous and systematic approach to accounting reviewer arguments. △ Less

Submitted 18 July, 2025; originally announced July 2025.

arXiv:2507.13413 [pdf, ps, other]

LightAutoDS-Tab: Multi-AutoML Agentic System for Tabular Data

Authors: Aleksey Lapin, Igor Hromov, Stanislav Chumakov, Mile Mitrovic, Dmitry Simakov, Nikolay O. Nikitin, Andrey V. Savchenko

Abstract: AutoML has advanced in handling complex tasks using the integration of LLMs, yet its efficiency remains limited by dependence on specific underlying tools. In this paper, we introduce LightAutoDS-Tab, a multi-AutoML agentic system for tasks with tabular data, which combines an LLM-based code generation with several AutoML tools. Our approach improves the flexibility and robustness of pipeline desi… ▽ More AutoML has advanced in handling complex tasks using the integration of LLMs, yet its efficiency remains limited by dependence on specific underlying tools. In this paper, we introduce LightAutoDS-Tab, a multi-AutoML agentic system for tasks with tabular data, which combines an LLM-based code generation with several AutoML tools. Our approach improves the flexibility and robustness of pipeline design, outperforming state-of-the-art open-source solutions on several data science tasks from Kaggle. The code of LightAutoDS-Tab is available in the open repository https://github.com/sb-ai-lab/LADS △ Less

Submitted 17 July, 2025; originally announced July 2025.

Comments: 11 pages, 2 figures

arXiv:2507.12202 [pdf, ps, other]

Sparse Autoencoders for Sequential Recommendation Models: Interpretation and Flexible Control

Authors: Anton Klenitskiy, Konstantin Polev, Daria Denisova, Alexey Vasilev, Dmitry Simakov, Gleb Gusev

Abstract: Many current state-of-the-art models for sequential recommendations are based on transformer architectures. Interpretation and explanation of such black box models is an important research question, as a better understanding of their internals can help understand, influence, and control their behavior, which is very important in a variety of real-world applications. Recently sparse autoencoders (S… ▽ More Many current state-of-the-art models for sequential recommendations are based on transformer architectures. Interpretation and explanation of such black box models is an important research question, as a better understanding of their internals can help understand, influence, and control their behavior, which is very important in a variety of real-world applications. Recently sparse autoencoders (SAE) have been shown to be a promising unsupervised approach for extracting interpretable features from language models. These autoencoders learn to reconstruct hidden states of the transformer's internal layers from sparse linear combinations of directions in their activation space. This paper is focused on the application of SAE to the sequential recommendation domain. We show that this approach can be successfully applied to the transformer trained on a sequential recommendation task: learned directions turn out to be more interpretable and monosemantic than the original hidden state dimensions. Moreover, we demonstrate that the features learned by SAE can be used to effectively and flexibly control the model's behavior, providing end-users with a straightforward method to adjust their recommendations to different custom scenarios and contexts. △ Less

Submitted 16 July, 2025; originally announced July 2025.

arXiv:2507.12124 [pdf, ps, other]

Searching for Falsified Clause in Random (log n)-CNFs is Hard for Randomized Communication

Authors: Artur Riazanov, Anastasia Sofronova, Dmitry Sokolov, Weiqiang Yuan

Abstract: We show that for a randomly sampled unsatisfiable $O(\log n)$-CNF over $n$ variables the randomized two-party communication cost of finding a clause falsified by the given variable assignment is linear in $n$. We show that for a randomly sampled unsatisfiable $O(\log n)$-CNF over $n$ variables the randomized two-party communication cost of finding a clause falsified by the given variable assignment is linear in $n$. △ Less

Submitted 16 July, 2025; originally announced July 2025.

arXiv:2507.11059 [pdf, ps, other]

SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks

Authors: Pavel Adamenko, Mikhail Ivanov, Aidar Valeev, Rodion Levichev, Pavel Zadorozhny, Ivan Lopatin, Dmitry Babayev, Alena Fenogenova, Valentin Malykh

Abstract: The rapid advancement of Large Language Models (LLMs) in software engineering has revealed critical limitations in existing benchmarks, particularly the widely used SWE-bench dataset. Recent studies have uncovered severe data contamination issues, e.g. SWE-bench reports 32.67% of successful patches involve direct solution leakage and 31.08% pass due to inadequate test cases. We introduce SWE-MERA,… ▽ More The rapid advancement of Large Language Models (LLMs) in software engineering has revealed critical limitations in existing benchmarks, particularly the widely used SWE-bench dataset. Recent studies have uncovered severe data contamination issues, e.g. SWE-bench reports 32.67% of successful patches involve direct solution leakage and 31.08% pass due to inadequate test cases. We introduce SWE-MERA, a dynamic, continuously updated benchmark designed to address these fundamental challenges through an automated collection of real-world GitHub issues and rigorous quality validation. Our approach implements a reliable pipeline that ensures quality while minimizing contamination risks, resulting in approximately 10,000 potential tasks with 300 samples currently available. Evaluation using the Aider coding agent demonstrates strong discriminative power in state-of-the-art models. We report performance across a dozen recent LLMs evaluated on tasks collected between September 2024 and June 2025. △ Less

Submitted 17 July, 2025; v1 submitted 15 July, 2025; originally announced July 2025.

arXiv:2507.10349 [pdf, ps, other]

TAT: Temporal-Aligned Transformer for Multi-Horizon Peak Demand Forecasting

Authors: Zhiyuan Zhao, Sitan Yang, Kin G. Olivares, Boris N. Oreshkin, Stan Vitebsky, Michael W. Mahoney, B. Aditya Prakash, Dmitry Efimov

Abstract: Multi-horizon time series forecasting has many practical applications such as demand forecasting. Accurate demand prediction is critical to help make buying and inventory decisions for supply chain management of e-commerce and physical retailers, and such predictions are typically required for future horizons extending tens of weeks. This is especially challenging during high-stake sales events wh… ▽ More Multi-horizon time series forecasting has many practical applications such as demand forecasting. Accurate demand prediction is critical to help make buying and inventory decisions for supply chain management of e-commerce and physical retailers, and such predictions are typically required for future horizons extending tens of weeks. This is especially challenging during high-stake sales events when demand peaks are particularly difficult to predict accurately. However, these events are important not only for managing supply chain operations but also for ensuring a seamless shopping experience for customers. To address this challenge, we propose Temporal-Aligned Transformer (TAT), a multi-horizon forecaster leveraging apriori-known context variables such as holiday and promotion events information for improving predictive performance. Our model consists of an encoder and decoder, both embedded with a novel Temporal Alignment Attention (TAA), designed to learn context-dependent alignment for peak demand forecasting. We conduct extensive empirical analysis on two large-scale proprietary datasets from a large e-commerce retailer. We demonstrate that TAT brings up to 30% accuracy improvement on peak demand forecasting while maintaining competitive overall performance compared to other state-of-the-art methods. △ Less

Submitted 14 July, 2025; originally announced July 2025.

Comments: 9 pages, 4 figures, 7 tables, published at KDD 2025 workshop on AI for Supply Chain: Today and Future

arXiv:2507.10164 [pdf, ps, other]

Robust RL Control for Bipedal Locomotion with Closed Kinematic Chains

Authors: Egor Maslennikov, Eduard Zaliaev, Nikita Dudorov, Oleg Shamanin, Karanov Dmitry, Gleb Afanasev, Alexey Burkov, Egor Lygin, Simeon Nedelchev, Evgeny Ponomarev

Abstract: Developing robust locomotion controllers for bipedal robots with closed kinematic chains presents unique challenges, particularly since most reinforcement learning (RL) approaches simplify these parallel mechanisms into serial models during training. We demonstrate that this simplification significantly impairs sim-to-real transfer by failing to capture essential aspects such as joint coupling, fr… ▽ More Developing robust locomotion controllers for bipedal robots with closed kinematic chains presents unique challenges, particularly since most reinforcement learning (RL) approaches simplify these parallel mechanisms into serial models during training. We demonstrate that this simplification significantly impairs sim-to-real transfer by failing to capture essential aspects such as joint coupling, friction dynamics, and motor-space control characteristics. In this work, we present an RL framework that explicitly incorporates closed-chain dynamics and validate it on our custom-built robot TopA. Our approach enhances policy robustness through symmetry-aware loss functions, adversarial training, and targeted network regularization. Experimental results demonstrate that our integrated approach achieves stable locomotion across diverse terrains, significantly outperforming methods based on simplified kinematic models. △ Less

Submitted 14 July, 2025; originally announced July 2025.

arXiv:2507.09823 [pdf, ps, other]

Nesterov Finds GRAAL: Optimal and Adaptive Gradient Method for Convex Optimization

Authors: Ekaterina Borodich, Dmitry Kovalev

Abstract: In this paper, we focus on the problem of minimizing a continuously differentiable convex objective function $\min_x f(x)$. Recently, several adaptive gradient methods, including GRAAL (Malitsky, 2020), have been developed. These methods estimate the local curvature of the objective function to compute stepsizes, attain the standard convergence rate $\mathcal{O}(1/k)$ of fixed-stepsize gradient de… ▽ More In this paper, we focus on the problem of minimizing a continuously differentiable convex objective function $\min_x f(x)$. Recently, several adaptive gradient methods, including GRAAL (Malitsky, 2020), have been developed. These methods estimate the local curvature of the objective function to compute stepsizes, attain the standard convergence rate $\mathcal{O}(1/k)$ of fixed-stepsize gradient descent for Lipschitz-smooth functions, and do not require any line search procedures or hyperparameter tuning. However, a natural question arises: is it possible to accelerate the convergence of these algorithms to match the optimal rate $\mathcal{O}(1/k^2)$ of the accelerated gradient descent of Nesterov (1983)? Although some attempts have been made (Li and Lan, 2023), the capabilities of the existing accelerated algorithms to adapt to the curvature of the objective function are highly limited. Consequently, we provide a positive answer to this question and develop GRAAL with Nesterov acceleration. We prove that our algorithm achieves the desired optimal convergence rate for Lipschitz smooth functions. Moreover, in contrast to existing methods, it does so with an arbitrary, even excessively small, initial stepsize at the cost of a logarithmic additive term in the iteration complexity. △ Less

Submitted 13 July, 2025; originally announced July 2025.

arXiv:2507.06261 [pdf, ps, other]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Authors: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu , et al. (3284 additional authors not shown)

Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal unde… ▽ More In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving. △ Less

Submitted 22 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

Comments: 72 pages, 17 figures

arXiv:2507.06211 [pdf, ps, other]

Modern Methods in Associative Memory

Authors: Dmitry Krotov, Benjamin Hoover, Parikshit Ram, Bao Pham

Abstract: Associative Memories like the famous Hopfield Networks are elegant models for describing fully recurrent neural networks whose fundamental job is to store and retrieve information. In the past few years they experienced a surge of interest due to novel theoretical results pertaining to their information storage capabilities, and their relationship with SOTA AI architectures, such as Transformers a… ▽ More Associative Memories like the famous Hopfield Networks are elegant models for describing fully recurrent neural networks whose fundamental job is to store and retrieve information. In the past few years they experienced a surge of interest due to novel theoretical results pertaining to their information storage capabilities, and their relationship with SOTA AI architectures, such as Transformers and Diffusion Models. These connections open up possibilities for interpreting the computation of traditional AI networks through the theoretical lens of Associative Memories. Additionally, novel Lagrangian formulations of these networks make it possible to design powerful distributed models that learn useful representations and inform the design of novel architectures. This tutorial provides an approachable introduction to Associative Memories, emphasizing the modern language and methods used in this area of research, with practical hands-on mathematical derivations and coding notebooks. △ Less

Submitted 8 July, 2025; originally announced July 2025.

Comments: Tutorial at ICML 2025

arXiv:2507.05867 [pdf, ps, other]

Assessing Linear Control Strategies for Zero-Speed Fin Roll Damping

Authors: Nikita Savin, Elena Ambrosovskaya, Dmitry Romaev, Anton Proskurnikov

Abstract: Roll stabilization is a critical aspect of ship motion control, particularly for vessels operating in low-speed or zero-speed conditions, where traditional hydrodynamic fins lose their effectiveness. In this paper, we consider a roll damping system, developed by Navis JSC, based on two actively controlled zero-speed fins. Unlike conventional fin stabilizers, zero-speed fins employ a drag-based mec… ▽ More Roll stabilization is a critical aspect of ship motion control, particularly for vessels operating in low-speed or zero-speed conditions, where traditional hydrodynamic fins lose their effectiveness. In this paper, we consider a roll damping system, developed by Navis JSC, based on two actively controlled zero-speed fins. Unlike conventional fin stabilizers, zero-speed fins employ a drag-based mechanism and active oscillations to generate stabilizing forces even when the vessel is stationary. We propose a simple linear control architecture that, however, accounts for nonlinear drag forces and actuator limitations. Simulation results on a high-fidelity vessel model used for HIL testing demonstrate the effectiveness of the proposed approach. △ Less

Submitted 8 July, 2025; originally announced July 2025.

arXiv:2507.05201 [pdf, ps, other]

MedGemma Technical Report

Authors: Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Brush, Kenneth Philbrick , et al. (56 additional authors not shown)

Abstract: Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare's diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce Me… ▽ More Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare's diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at https://goo.gle/medgemma. △ Less

Submitted 12 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

arXiv:2507.03482 [pdf, ps, other]

OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction

Authors: Pablo Alonso-Jiménez, Pedro Ramoneda, R. Oguz Araz, Andrea Poltronieri, Dmitry Bogdanov

Abstract: Developing open-source foundation models is essential for advancing research in music audio understanding and ensuring access to powerful, multipurpose representations for music information retrieval. We present OMAR-RQ, a model trained with self-supervision via masked token classification methodologies using a large-scale dataset with over 330,000 hours of music audio. We experiment with differen… ▽ More Developing open-source foundation models is essential for advancing research in music audio understanding and ensuring access to powerful, multipurpose representations for music information retrieval. We present OMAR-RQ, a model trained with self-supervision via masked token classification methodologies using a large-scale dataset with over 330,000 hours of music audio. We experiment with different input features and quantization options, and achieve state-of-the-art performance in music tagging, pitch estimation, chord recognition, beat tracking, segmentation, and difficulty estimation among open self-supervised models. We open-source our training and evaluation pipelines and model weights, available at https://github.com/mtg/omar-rq. △ Less

Submitted 4 July, 2025; originally announced July 2025.

arXiv:2507.02205 [pdf, ps, other]

Team RAS in 9th ABAW Competition: Multimodal Compound Expression Recognition Approach

Authors: Elena Ryumina, Maxim Markitantov, Alexandr Axyonov, Dmitry Ryumin, Mikhail Dolgushin, Alexey Karpov

Abstract: Compound Expression Recognition (CER), a subfield of affective computing, aims to detect complex emotional states formed by combinations of basic emotions. In this work, we present a novel zero-shot multimodal approach for CER that combines six heterogeneous modalities into a single pipeline: static and dynamic facial expressions, scene and label matching, scene context, audio, and text. Unlike pr… ▽ More Compound Expression Recognition (CER), a subfield of affective computing, aims to detect complex emotional states formed by combinations of basic emotions. In this work, we present a novel zero-shot multimodal approach for CER that combines six heterogeneous modalities into a single pipeline: static and dynamic facial expressions, scene and label matching, scene context, audio, and text. Unlike previous approaches relying on task-specific training data, our approach uses zero-shot components, including Contrastive Language-Image Pretraining (CLIP)-based label matching and Qwen-VL for semantic scene understanding. We further introduce a Multi-Head Probability Fusion (MHPF) module that dynamically weights modality-specific predictions, followed by a Compound Expressions (CE) transformation module that uses Pair-Wise Probability Aggregation (PPA) and Pair-Wise Feature Similarity Aggregation (PFSA) methods to produce interpretable compound emotion outputs. Evaluated under multi-corpus training, the proposed approach shows F1 scores of 46.95% on AffWild2, 49.02% on Acted Facial Expressions in The Wild (AFEW), and 34.85% on C-EXPR-DB via zero-shot testing, which is comparable to the results of supervised approaches trained on target data. This demonstrates the effectiveness of the proposed approach for capturing CE without domain adaptation. The source code is publicly available. △ Less

Submitted 4 July, 2025; v1 submitted 2 July, 2025; originally announced July 2025.

Comments: 7

arXiv:2507.01633 [pdf, ps, other]

Confidence and Stability of Global and Pairwise Scores in NLP Evaluation

Authors: Georgii Levtsov, Dmitry Ustalov

Abstract: With the advent of highly capable instruction-tuned neural language models, benchmarking in natural language processing (NLP) is increasingly shifting towards pairwise comparison leaderboards, such as LMSYS Arena, from traditional global pointwise scores (e.g., GLUE, BIG-bench, SWE-bench). This paper empirically investigates the strengths and weaknesses of both global scores and pairwise compariso… ▽ More With the advent of highly capable instruction-tuned neural language models, benchmarking in natural language processing (NLP) is increasingly shifting towards pairwise comparison leaderboards, such as LMSYS Arena, from traditional global pointwise scores (e.g., GLUE, BIG-bench, SWE-bench). This paper empirically investigates the strengths and weaknesses of both global scores and pairwise comparisons to aid decision-making in selecting appropriate model evaluation strategies. Through computational experiments on synthetic and real-world datasets using standard global metrics and the popular Bradley-Terry model for pairwise comparisons, we found that while global scores provide more reliable overall rankings, they can underestimate strong models with rare, significant errors or low confidence. Conversely, pairwise comparisons are particularly effective for identifying strong contenders among models with lower global scores, especially where quality metrics are hard to define (e.g., text generation), though they require more comparisons to converge if ties are frequent. Our code and data are available at https://github.com/HSPyroblast/srw-ranking under a permissive license. △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: 8 pages, accepted at ACL SRW 2025

MSC Class: 62-04 ACM Class: D.2.3

arXiv:2507.01117 [pdf, ps, other]

A Neural Operator based on Dynamic Mode Decomposition

Authors: Nikita Sakovich, Dmitry Aksenov, Ekaterina Pleshakova, Sergey Gataullin

Abstract: The scientific computation methods development in conjunction with artificial intelligence technologies remains a hot research topic. Finding a balance between lightweight and accurate computations is a solid foundation for this direction. The study presents a neural operator based on the dynamic mode decomposition algorithm (DMD), mapping functional spaces, which combines DMD and deep learning (D… ▽ More The scientific computation methods development in conjunction with artificial intelligence technologies remains a hot research topic. Finding a balance between lightweight and accurate computations is a solid foundation for this direction. The study presents a neural operator based on the dynamic mode decomposition algorithm (DMD), mapping functional spaces, which combines DMD and deep learning (DL) for spatiotemporal processes efficient modeling. Solving PDEs for various initial and boundary conditions requires significant computational resources. The method suggested automatically extracts key modes and system dynamics using them to construct predictions, reducing computational costs compared to traditional numerical methods. The approach has demonstrated its efficiency through comparative analysis of performance with closest analogues DeepONet and FNO in the heat equation, Laplaces equation, and Burgers equation solutions approximation, where it achieves high reconstruction accuracy. △ Less

Submitted 1 July, 2025; originally announced July 2025.

Comments: 30 pages, 10 figures

MSC Class: 68T07; 35A99

arXiv:2507.00428 [pdf, ps, other]

doi 10.1145/3708035.3736086

Real-Time In-Network Machine Learning on P4-Programmable FPGA SmartNICs with Fixed-Point Arithmetic and Taylor

Authors: Mohammad Firas Sada, John J. Graham, Mahidhar Tatineni, Dmitry Mishin, Thomas A. DeFanti, Frank Würthwein

Abstract: As machine learning (ML) applications become integral to modern network operations, there is an increasing demand for network programmability that enables low-latency ML inference for tasks such as Quality of Service (QoS) prediction and anomaly detection in cybersecurity. ML models provide adaptability through dynamic weight adjustments, making Programming Protocol-independent Packet Processors (… ▽ More As machine learning (ML) applications become integral to modern network operations, there is an increasing demand for network programmability that enables low-latency ML inference for tasks such as Quality of Service (QoS) prediction and anomaly detection in cybersecurity. ML models provide adaptability through dynamic weight adjustments, making Programming Protocol-independent Packet Processors (P4)-programmable FPGA SmartNICs an ideal platform for investigating In-Network Machine Learning (INML). These devices offer high-throughput, low-latency packet processing and can be dynamically reconfigured via the control plane, allowing for flexible integration of ML models directly at the network edge. This paper explores the application of the P4 programming paradigm to neural networks and regression models, where weights and biases are stored in control plane table lookups. This approach enables flexible programmability and efficient deployment of retrainable ML models at the network edge, independent of core infrastructure at the switch level. △ Less

Submitted 1 July, 2025; originally announced July 2025.

Comments: To appear in Proceedings of the Practice and Experience in Advanced Research Computing (PEARC25)

Journal ref: Proceedings of the Practice and Experience in Advanced Research Computing PEARC '25, July 20-24, 2025, Columbus, OH, USA

arXiv:2507.00418 [pdf, ps, other]

doi 10.1145/3708035.3736087

Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs

Authors: Mohammad Firas Sada, John J. Graham, Elham E Khoda, Mahidhar Tatineni, Dmitry Mishin, Rajesh K. Gupta, Rick Wagner, Larry Smarr, Thomas A. DeFanti, Frank Würthwein

Abstract: This study presents a benchmarking analysis of the Qualcomm Cloud AI 100 Ultra (QAic) accelerator for large language model (LLM) inference, evaluating its energy efficiency (throughput per watt) and performance against leading NVIDIA (A100, H200) and AMD (MI300A) GPUs within the National Research Platform (NRP) ecosystem. A total of 15 open-source LLMs, ranging from 117 million to 90 billion param… ▽ More This study presents a benchmarking analysis of the Qualcomm Cloud AI 100 Ultra (QAic) accelerator for large language model (LLM) inference, evaluating its energy efficiency (throughput per watt) and performance against leading NVIDIA (A100, H200) and AMD (MI300A) GPUs within the National Research Platform (NRP) ecosystem. A total of 15 open-source LLMs, ranging from 117 million to 90 billion parameters, are served using the vLLM framework. The QAic inference cards appears to be energy efficient and performs well in the energy efficiency metric in most cases. The findings offer insights into the potential of the Qualcomm Cloud AI 100 Ultra for high-performance computing (HPC) applications within the National Research Platform (NRP). △ Less

Submitted 1 July, 2025; originally announced July 2025.

Comments: To appear in Proceedings of the Practice and Experience in Advanced Research Computing (PEARC '25)

Journal ref: Proceedings of the Practice and Experience in Advanced Research Computing PEARC25 2025

arXiv:2506.23803 [pdf, ps, other]

SGD with Adaptive Preconditioning: Unified Analysis and Momentum Acceleration

Authors: Dmitry Kovalev

Abstract: In this paper, we revisit stochastic gradient descent (SGD) with AdaGrad-type preconditioning. Our contributions are twofold. First, we develop a unified convergence analysis of SGD with adaptive preconditioning under anisotropic or matrix smoothness and noise assumptions. This allows us to recover state-of-the-art convergence results for several popular adaptive gradient methods, including AdaGra… ▽ More In this paper, we revisit stochastic gradient descent (SGD) with AdaGrad-type preconditioning. Our contributions are twofold. First, we develop a unified convergence analysis of SGD with adaptive preconditioning under anisotropic or matrix smoothness and noise assumptions. This allows us to recover state-of-the-art convergence results for several popular adaptive gradient methods, including AdaGrad-Norm, AdaGrad, and ASGO/One-sided Shampoo. In addition, we establish the fundamental connection between two recently proposed algorithms, Scion and DASGO, and provide the first theoretical guarantees for the latter. Second, we show that the convergence of methods like AdaGrad and DASGO can be provably accelerated beyond the best-known rates using Nesterov momentum. Consequently, we obtain the first theoretical justification that AdaGrad-type algorithms can simultaneously benefit from both diagonal preconditioning and momentum, which may provide an ultimate explanation for the practical efficiency of Adam. △ Less

Submitted 30 June, 2025; originally announced June 2025.

arXiv:2506.22661 [pdf, ps, other]

Enhancing Neural Audio Fingerprint Robustness to Audio Degradation for Music Identification

Authors: R. Oguz Araz, Guillem Cortès-Sebastià, Emilio Molina, Joan Serrà, Xavier Serra, Yuki Mitsufuji, Dmitry Bogdanov

Abstract: Audio fingerprinting (AFP) allows the identification of unknown audio content by extracting compact representations, termed audio fingerprints, that are designed to remain robust against common audio degradations. Neural AFP methods often employ metric learning, where representation quality is influenced by the nature of the supervision and the utilized loss function. However, recent work unrealis… ▽ More Audio fingerprinting (AFP) allows the identification of unknown audio content by extracting compact representations, termed audio fingerprints, that are designed to remain robust against common audio degradations. Neural AFP methods often employ metric learning, where representation quality is influenced by the nature of the supervision and the utilized loss function. However, recent work unrealistically simulates real-life audio degradation during training, resulting in sub-optimal supervision. Additionally, although several modern metric learning approaches have been proposed, current neural AFP methods continue to rely on the NT-Xent loss without exploring the recent advances or classical alternatives. In this work, we propose a series of best practices to enhance the self-supervision by leveraging musical signal properties and realistic room acoustics. We then present the first systematic evaluation of various metric learning approaches in the context of AFP, demonstrating that a self-supervised adaptation of the triplet loss yields superior performance. Our results also reveal that training with multiple positive samples per anchor has critically different effects across loss functions. Our approach is built upon these insights and achieves state-of-the-art performance on both a large, synthetically degraded dataset and a real-world dataset recorded using microphones in diverse music venues. △ Less

Submitted 27 June, 2025; originally announced June 2025.

Comments: Accepted to ISMIR2025

arXiv:2506.22631 [pdf, ps, other]

A hierarchical Vovk-Azoury-Warmuth forecaster with discounting for online regression in RKHS

Authors: Dmitry B. Rokhlin

Abstract: We study the problem of online regression with the unconstrained quadratic loss against a time-varying sequence of functions from a Reproducing Kernel Hilbert Space (RKHS). Recently, Jacobsen and Cutkosky (2024) introduced a discounted Vovk-Azoury-Warmuth (DVAW) forecaster that achieves optimal dynamic regret in the finite-dimensional case. In this work, we lift their approach to the non-parametri… ▽ More We study the problem of online regression with the unconstrained quadratic loss against a time-varying sequence of functions from a Reproducing Kernel Hilbert Space (RKHS). Recently, Jacobsen and Cutkosky (2024) introduced a discounted Vovk-Azoury-Warmuth (DVAW) forecaster that achieves optimal dynamic regret in the finite-dimensional case. In this work, we lift their approach to the non-parametric domain by synthesizing the DVAW framework with a random feature approximation. We propose a fully adaptive, hierarchical algorithm, which we call H-VAW-D (Hierarchical Vovk-Azoury-Warmuth with Discounting), that learns both the discount factor and the number of random features. We prove that this algorithm, which has a per-iteration computational complexity of $O(T\ln T)$, achieves an expected dynamic regret of $O(T^{2/3}P_T^{1/3} + \sqrt{T}\ln T)$, where $P_T$ is the functional path length of a comparator sequence. △ Less

Submitted 27 June, 2025; originally announced June 2025.

MSC Class: 68Q32; 68W27; 68W20

arXiv:2506.21782 [pdf, ps, other]

M3PO: Massively Multi-Task Model-Based Policy Optimization

Authors: Aditya Narendra, Dmitry Makarov, Aleksandr Panov

Abstract: We introduce Massively Multi-Task Model-Based Policy Optimization (M3PO), a scalable model-based reinforcement learning (MBRL) framework designed to address sample inefficiency in single-task settings and poor generalization in multi-task domains. Existing model-based approaches like DreamerV3 rely on pixel-level generative models that neglect control-centric representations, while model-free meth… ▽ More We introduce Massively Multi-Task Model-Based Policy Optimization (M3PO), a scalable model-based reinforcement learning (MBRL) framework designed to address sample inefficiency in single-task settings and poor generalization in multi-task domains. Existing model-based approaches like DreamerV3 rely on pixel-level generative models that neglect control-centric representations, while model-free methods such as PPO suffer from high sample complexity and weak exploration. M3PO integrates an implicit world model, trained to predict task outcomes without observation reconstruction, with a hybrid exploration strategy that combines model-based planning and model-free uncertainty-driven bonuses. This eliminates the bias-variance trade-off in prior methods by using discrepancies between model-based and model-free value estimates to guide exploration, while maintaining stable policy updates through a trust-region optimizer. M3PO provides an efficient and robust alternative to existing model-based policy optimization approaches and achieves state-of-the-art performance across multiple benchmarks. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: 6 pages, 4 figures. Accepted at IEEE/RSJ IROS 2025. Full version, including appendix and implementation details

arXiv:2506.21520 [pdf, ps, other]

MADrive: Memory-Augmented Driving Scene Modeling

Authors: Polina Karpikova, Daniil Selikhanovych, Kirill Struminsky, Ruslan Musaev, Maria Golitsyna, Dmitry Baranchuk

Abstract: Recent advances in scene reconstruction have pushed toward highly realistic modeling of autonomous driving (AD) environments using 3D Gaussian splatting. However, the resulting reconstructions remain closely tied to the original observations and struggle to support photorealistic synthesis of significantly altered or novel driving scenarios. This work introduces MADrive, a memory-augmented reconst… ▽ More Recent advances in scene reconstruction have pushed toward highly realistic modeling of autonomous driving (AD) environments using 3D Gaussian splatting. However, the resulting reconstructions remain closely tied to the original observations and struggle to support photorealistic synthesis of significantly altered or novel driving scenarios. This work introduces MADrive, a memory-augmented reconstruction framework designed to extend the capabilities of existing scene reconstruction methods by replacing observed vehicles with visually similar 3D assets retrieved from a large-scale external memory bank. Specifically, we release MAD-Cars, a curated dataset of ${\sim}70$K 360° car videos captured in the wild and present a retrieval module that finds the most similar car instances in the memory bank, reconstructs the corresponding 3D assets from video, and integrates them into the target scene through orientation alignment and relighting. The resulting replacements provide complete multi-view representations of vehicles in the scene, enabling photorealistic synthesis of substantially altered configurations, as demonstrated in our experiments. Project page: https://yandex-research.github.io/madrive/ △ Less

Submitted 26 June, 2025; originally announced June 2025.

arXiv:2506.21170 [pdf, ps, other]

Compressed and Smooth Latent Space for Text Diffusion Modeling

Authors: Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, Dmitry Vetrov

Abstract: Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising alternative by enabling parallel generation and flexible control; however, their application to text generation is hindered by the high dimensionality of token-level… ▽ More Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising alternative by enabling parallel generation and flexible control; however, their application to text generation is hindered by the high dimensionality of token-level representations. We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion. This space is learned using an autoencoder trained simultaneously for token-level reconstruction and alignment with frozen activations from a pretrained language encoder, providing robust semantic grounding and enabling effective perturbation-based augmentations. Empirically, we demonstrate that text representations can be compressed by $8\times$ while maintaining generation quality comparable to token-level diffusion models. Furthermore, increasing the latent sequence length allows Cosmos to surpass both diffusion-based and autoregressive baselines. We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms. Cosmos achieves comparable or superior generation quality while offering more than $2\times$ faster inference. △ Less

Submitted 26 June, 2025; originally announced June 2025.

arXiv:2506.20890 [pdf, ps, other]

Multicontinuum Homogenization for Poroelasticity Model

Authors: Dmitry Ammosov, Mohammed Al-Kobaisi, Yalchin Efendiev

Abstract: In this paper, we derive multicontinuum poroelasticity models using the multicontinuum homogenization method. Poroelasticity models are widely used in many areas of science and engineering to describe coupled flow and mechanics processes in porous media. However, in many applications, the properties of poroelastic media possess high contrast, presenting serious computational challenges. It is well… ▽ More In this paper, we derive multicontinuum poroelasticity models using the multicontinuum homogenization method. Poroelasticity models are widely used in many areas of science and engineering to describe coupled flow and mechanics processes in porous media. However, in many applications, the properties of poroelastic media possess high contrast, presenting serious computational challenges. It is well known that standard homogenization approaches often fail to give an accurate solution due to the lack of macroscopic parameters. Multicontinuum approaches allow us to consider such cases by defining several average states known as continua. In the field of poroelasticity, multiple-network models arising from the multiple porous media theory are representatives of these approaches. In this work, we extend previous findings by deriving the generalized multicontinuum poroelasticity model. We apply the recently developed multicontinuum homogenization method and provide a rigorous derivation of multicontinuum equations. For this purpose, we formulate coupled constraint cell problems in oversampled regions to consider different homogenized effects. Then, we obtain a multicontinuum expansion of the fine-scale fields and derive the multicontinuum model supposing the smoothness of macroscopic variables. We present the most general version of equations and the simplified ones based on our numerical experiments. Numerical results are presented for different heterogeneous media cases and demonstrate the high accuracy of our proposed multicontinuum models. △ Less

Submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.20657 [pdf, ps, other]

doi 10.1145/3708035.3736049

SuperSONIC: Cloud-Native Infrastructure for ML Inferencing

Authors: Dmitry Kondratyev, Benedikt Riedel, Yuan-Tang Chou, Miles Cochran-Branson, Noah Paladino, David Schultz, Mia Liu, Javier Duarte, Philip Harris, Shih-Chieh Hsu

Abstract: The increasing computational demand from growing data rates and complex machine learning (ML) algorithms in large-scale scientific experiments has driven the adoption of the Services for Optimized Network Inference on Coprocessors (SONIC) approach. SONIC accelerates ML inference by offloading it to local or remote coprocessors to optimize resource utilization. Leveraging its portability to differe… ▽ More The increasing computational demand from growing data rates and complex machine learning (ML) algorithms in large-scale scientific experiments has driven the adoption of the Services for Optimized Network Inference on Coprocessors (SONIC) approach. SONIC accelerates ML inference by offloading it to local or remote coprocessors to optimize resource utilization. Leveraging its portability to different types of coprocessors, SONIC enhances data processing and model deployment efficiency for cutting-edge research in high energy physics (HEP) and multi-messenger astrophysics (MMA). We developed the SuperSONIC project, a scalable server infrastructure for SONIC, enabling the deployment of computationally intensive tasks to Kubernetes clusters equipped with graphics processing units (GPUs). Using NVIDIA Triton Inference Server, SuperSONIC decouples client workflows from server infrastructure, standardizing communication, optimizing throughput, load balancing, and monitoring. SuperSONIC has been successfully deployed for the CMS and ATLAS experiments at the CERN Large Hadron Collider (LHC), the IceCube Neutrino Observatory (IceCube), and the Laser Interferometer Gravitational-Wave Observatory (LIGO) and tested on Kubernetes clusters at Purdue University, the National Research Platform (NRP), and the University of Chicago. SuperSONIC addresses the challenges of the Cloud-native era by providing a reusable, configurable framework that enhances the efficiency of accelerator-based inference deployment across diverse scientific domains and industries. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: Submission to PEARC25 Conference

arXiv:2506.17001 [pdf, ps, other]

PersonalAI: Towards digital twins in the graph form

Authors: Mikhail Menschikov, Dmitry Evseev, Ruslan Kostoev, Ilya Perepechkin, Ilnaz Salimov, Victoria Dochkina, Petr Anokhin, Evgeny Burnaev, Nikita Semenov

Abstract: The challenge of personalizing language models, specifically the ability to account for a user's history during interactions, is of significant interest. Despite recent advancements in large language models (LLMs) and Retrieval Augmented Generation that have enhanced the factual base of LLMs, the task of retaining extensive personal information and using it to generate personalized responses remai… ▽ More The challenge of personalizing language models, specifically the ability to account for a user's history during interactions, is of significant interest. Despite recent advancements in large language models (LLMs) and Retrieval Augmented Generation that have enhanced the factual base of LLMs, the task of retaining extensive personal information and using it to generate personalized responses remains pertinent. To address this, we propose utilizing external memory in the form of knowledge graphs, which are constructed and updated by the LLM itself. We have expanded upon ideas of AriGraph architecture and for the first time introduced a combined graph featuring both standard edges and two types of hyperedges. Experiments conducted on the TriviaQA, HotpotQA and DiaASQ benchmarks indicates that this approach aids in making the process of graph construction and knowledge extraction unified and robust. Furthermore, we augmented the DiaASQ benchmark by incorporating parameters such as time into dialogues and introducing contradictory statements made by the same speaker at different times. Despite these modifications, the performance of the question-answering system remained robust, demonstrating the proposed architecture's ability to maintain and utilize temporal dependencies. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.15849 [pdf, ps, other]

PRISM-Loc: a Lightweight Long-range LiDAR Localization in Urban Environments with Topological Maps

Authors: Kirill Muravyev, Vasily Yuryev, Oleg Bulichev, Dmitry Yudin, Konstantin Yakovlev

Abstract: Localization in the environment is one of the crucial tasks of navigation of a mobile robot or a self-driving vehicle. For long-range routes, performing localization within a dense global lidar map in real time may be difficult, and the creation of such a map may require much memory. To this end, leveraging topological maps may be useful. In this work, we propose PRISM-Loc -- a topological map-bas… ▽ More Localization in the environment is one of the crucial tasks of navigation of a mobile robot or a self-driving vehicle. For long-range routes, performing localization within a dense global lidar map in real time may be difficult, and the creation of such a map may require much memory. To this end, leveraging topological maps may be useful. In this work, we propose PRISM-Loc -- a topological map-based approach for localization in large environments. The proposed approach leverages a twofold localization pipeline, which consists of global place recognition and estimation of the local pose inside the found location. For local pose estimation, we introduce an original lidar scan matching algorithm, which is based on 2D features and point-based optimization. We evaluate the proposed method on the ITLP-Campus dataset on a 3 km route, and compare it against the state-of-the-art metric map-based and place recognition-based competitors. The results of the experiments show that the proposed method outperforms its competitors both quality-wise and computationally-wise. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: This version was submitted and rejected from IROS 2025 conference

arXiv:2506.15313 [pdf, ps, other]

MapFM: Foundation Model-Driven HD Mapping with Multi-Task Contextual Learning

Authors: Leonid Ivanov, Vasily Yuryev, Dmitry Yudin

Abstract: In autonomous driving, high-definition (HD) maps and semantic maps in bird's-eye view (BEV) are essential for accurate localization, planning, and decision-making. This paper introduces an enhanced End-to-End model named MapFM for online vectorized HD map generation. We show significantly boost feature representation quality by incorporating powerful foundation model for encoding camera images. To… ▽ More In autonomous driving, high-definition (HD) maps and semantic maps in bird's-eye view (BEV) are essential for accurate localization, planning, and decision-making. This paper introduces an enhanced End-to-End model named MapFM for online vectorized HD map generation. We show significantly boost feature representation quality by incorporating powerful foundation model for encoding camera images. To further enrich the model's understanding of the environment and improve prediction quality, we integrate auxiliary prediction heads for semantic segmentation in the BEV representation. This multi-task learning approach provides richer contextual supervision, leading to a more comprehensive scene representation and ultimately resulting in higher accuracy and improved quality of the predicted vectorized HD maps. The source code is available at https://github.com/LIvanoff/MapFM. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: Preprint. Submitted. 12 pages, 4 figures

arXiv:2506.12774 [pdf, ps, other]

On the Vertices of Delta-modular Polyhedra

Authors: Bludov Mikhail, Gribanov Dmitry, Klimenko Maxim, Kupavskii Andrey, Lángi Zsolt, Rogozin Alexander, Voronov Vsevolod

Abstract: Let $P$ be a polytope defined by the system $A x \leq b$, where $A \in R^{m \times n}$, $b \in R^m$, and $\text{rank}(A) = n$. We give a short geometric proof of the following tight upper bound on the number of vertices of $P$:… ▽ More Let $P$ be a polytope defined by the system $A x \leq b$, where $A \in R^{m \times n}$, $b \in R^m$, and $\text{rank}(A) = n$. We give a short geometric proof of the following tight upper bound on the number of vertices of $P$: $$ n! \cdot \fracΔ{Δ_{\text{average}}} \cdot \text{vol}(B_2) \sim \frac{1}{\sqrt{πn}} \cdot \left(\frac{2 π}{e}\right)^{n/2} \cdot n^{n/2} \cdot \fracΔ{Δ_{\text{average}}}, $$ where $Δ$ is the maximum absolute value of $n \times n$ subdeterminants of $A$, and $Δ_{\text{average}}$ is the average absolute value of subdeterminants of $A$ corresponding to a triangulation of $P$'s normal fan. Assuming that $A$ is integer, such polyhedra are called $Δ$-modular polyhedra. Note that in the integer case, the bound can be simplified via the inequality $Δ_{\text{average}} \geq Δ_{\min} \geq 1$, where $Δ_{\min}$ is the minimum absolute value of subdeterminants of $A$ corresponding to feasible bases of $A x \leq b$. For this, we prove and use a symmetric variant of Macbeath's theorem. Additionally, we give a direct argument based on prior results in the field, showing that the graph diameter of $P$ is bounded by $O\bigl(n^3 \cdot \fracΔ{Δ_{\min}} \cdot \ln (n \fracΔ{Δ_{\min}}) \bigr)$. Thus, both characteristic of $P$ are linear in $Δ/Δ_{\min}$. From an algorithmic perspective, we demonstrate that: Given $A \in Q^{m \times n}$, $b \in Q^m$, and an initial feasible solution to $A x \leq b$, the convex hull of $P$ can be constructed in $O(n)^{n/2} \cdot m^2 \cdot \fracΔ{Δ_{\text{average}}}$ operations. For simple polyhedra, the dependence on $m$ reduces to linear; Given $A \in Z^{m \times n}$ and $b \in Q^m$, the number $|P \cap Z^n|$ can be computed in $O(n)^n \cdot \frac{Δ^4}{Δ_{\text{average}}}$ arithmetic operations. △ Less

Submitted 15 June, 2025; originally announced June 2025.

arXiv:2506.12770 [pdf, ps, other]

Solving tricky quantum optics problems with assistance from (artificial) intelligence

Authors: Manas Pandey, Bharath Hebbe Madhusudhana, Saikat Ghosh, Dmitry Budker

Abstract: The capabilities of modern artificial intelligence (AI) as a ``scientific collaborator'' are explored by engaging it with three nuanced problems in quantum optics: state populations in optical pumping, resonant transitions between decaying states (the Burshtein effect), and degenerate mirrorless lasing. Through iterative dialogue, the authors observe that AI models--when prompted and corrected--ca… ▽ More The capabilities of modern artificial intelligence (AI) as a ``scientific collaborator'' are explored by engaging it with three nuanced problems in quantum optics: state populations in optical pumping, resonant transitions between decaying states (the Burshtein effect), and degenerate mirrorless lasing. Through iterative dialogue, the authors observe that AI models--when prompted and corrected--can reason through complex scenarios, refine their answers, and provide expert-level guidance, closely resembling the interaction with an adept colleague. The findings highlight that AI democratizes access to sophisticated modeling and analysis, shifting the focus in scientific practice from technical mastery to the generation and testing of ideas, and reducing the time for completing research tasks from days to minutes. △ Less

Submitted 15 June, 2025; originally announced June 2025.

Comments: 9 pages, 3 figures

arXiv:2506.12022 [pdf, ps, other]

Sign-Rank of $k$-Hamming Distance is Constant

Authors: Mika Göös, Nathaniel Harms, Valentin Imbach, Dmitry Sokolov

Abstract: We prove that the sign-rank of the $k$-Hamming Distance matrix on $n$ bits is $2^{O(k)}$, independent of the number of bits $n$. This strongly refutes the conjecture of Hatami, Hatami, Pires, Tao, and Zhao (RANDOM 2022), and Hatami, Hosseini, and Meng (STOC 2023), repeated in several other papers, that the sign-rank should depend on $n$. This conjecture would have qualitatively separated margin fr… ▽ More We prove that the sign-rank of the $k$-Hamming Distance matrix on $n$ bits is $2^{O(k)}$, independent of the number of bits $n$. This strongly refutes the conjecture of Hatami, Hatami, Pires, Tao, and Zhao (RANDOM 2022), and Hatami, Hosseini, and Meng (STOC 2023), repeated in several other papers, that the sign-rank should depend on $n$. This conjecture would have qualitatively separated margin from sign-rank (or, equivalently, bounded-error from unbounded-error randomized communication). In fact, our technique gives constant sign-rank upper bounds for all matrices which reduce to $k$-Hamming Distance, as well as large-margin matrices recently shown to be irreducible to $k$-Hamming Distance. △ Less

Submitted 1 May, 2025; originally announced June 2025.

Comments: 19 pages, 6 figures

MSC Class: 68Q15 ACM Class: F.1.3

arXiv:2506.10801 [pdf, ps, other]

Dense Associative Memory with Epanechnikov Energy

Authors: Benjamin Hoover, Zhaoyang Shi, Krishnakumar Balasubramanian, Dmitry Krotov, Parikshit Ram

Abstract: We propose a novel energy function for Dense Associative Memory (DenseAM) networks, the log-sum-ReLU (LSR), inspired by optimal kernel density estimation. Unlike the common log-sum-exponential (LSE) function, LSR is based on the Epanechnikov kernel and enables exact memory retrieval with exponential capacity without requiring exponential separation functions. Moreover, it introduces abundant addit… ▽ More We propose a novel energy function for Dense Associative Memory (DenseAM) networks, the log-sum-ReLU (LSR), inspired by optimal kernel density estimation. Unlike the common log-sum-exponential (LSE) function, LSR is based on the Epanechnikov kernel and enables exact memory retrieval with exponential capacity without requiring exponential separation functions. Moreover, it introduces abundant additional \emph{emergent} local minima while preserving perfect pattern recovery -- a characteristic previously unseen in DenseAM literature. Empirical results show that LSR energy has significantly more local minima (memories) that have comparable log-likelihood to LSE-based models. Analysis of LSR's emergent memories on image datasets reveals a degree of creativity and novelty, hinting at this method's potential for both large-scale memory storage and generative tasks. △ Less

Submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.10632 [pdf, ps, other]

Hessian Geometry of Latent Space in Generative Models

Authors: Alexander Lobashev, Dmitry Guskov, Maria Larchenko, Mikhail Tamm

Abstract: This paper presents a novel method for analyzing the latent space geometry of generative models, including statistical physics models and diffusion models, by reconstructing the Fisher information metric. The method approximates the posterior distribution of latent variables given generated samples and uses this to learn the log-partition function, which defines the Fisher metric for exponential f… ▽ More This paper presents a novel method for analyzing the latent space geometry of generative models, including statistical physics models and diffusion models, by reconstructing the Fisher information metric. The method approximates the posterior distribution of latent variables given generated samples and uses this to learn the log-partition function, which defines the Fisher metric for exponential families. Theoretical convergence guarantees are provided, and the method is validated on the Ising and TASEP models, outperforming existing baselines in reconstructing thermodynamic quantities. Applied to diffusion models, the method reveals a fractal structure of phase transitions in the latent space, characterized by abrupt changes in the Fisher metric. We demonstrate that while geodesic interpolations are approximately linear within individual phases, this linearity breaks down at phase boundaries, where the diffusion model exhibits a divergent Lipschitz constant with respect to the latent space. These findings provide new insights into the complex structure of diffusion model latent spaces and their connection to phenomena like phase transitions. Our source code is available at https://github.com/alobashev/hessian-geometry-of-diffusion-models. △ Less

Submitted 12 June, 2025; originally announced June 2025.

Comments: ICML 2025

arXiv:2506.09625 [pdf, ps, other]

GLGENN: A Novel Parameter-Light Equivariant Neural Networks Architecture Based on Clifford Geometric Algebras

Authors: Ekaterina Filimoshina, Dmitry Shirokov

Abstract: We propose, implement, and compare with competitors a new architecture of equivariant neural networks based on geometric (Clifford) algebras: Generalized Lipschitz Group Equivariant Neural Networks (GLGENN). These networks are equivariant to all pseudo-orthogonal transformations, including rotations and reflections, of a vector space with any non-degenerate or degenerate symmetric bilinear form. W… ▽ More We propose, implement, and compare with competitors a new architecture of equivariant neural networks based on geometric (Clifford) algebras: Generalized Lipschitz Group Equivariant Neural Networks (GLGENN). These networks are equivariant to all pseudo-orthogonal transformations, including rotations and reflections, of a vector space with any non-degenerate or degenerate symmetric bilinear form. We propose a weight-sharing parametrization technique that takes into account the fundamental structures and operations of geometric algebras. Due to this technique, GLGENN architecture is parameter-light and has less tendency to overfitting than baseline equivariant models. GLGENN outperforms or matches competitors on several benchmarking equivariant tasks, including estimation of an equivariant function and a convex hull experiment, while using significantly fewer optimizable parameters. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: Accepted to ICML 2025

MSC Class: 68T07; 15A66

arXiv:2506.09440 [pdf, ps, other]

GigaChat Family: Efficient Russian Language Modeling Through Mixture of Experts Architecture

Authors: GigaChat team, Mamedov Valentin, Evgenii Kosarev, Gregory Leleytner, Ilya Shchuckin, Valeriy Berezovskiy, Daniil Smirnov, Dmitry Kozlov, Sergei Averkiev, Lukyanenko Ivan, Aleksandr Proshunin, Ainur Israfilova, Ivan Baskov, Artem Chervyakov, Emil Shakirov, Mikhail Kolesov, Daria Khomich, Darya Latortseva, Sergei Porkhun, Yury Fedorov, Oleg Kutuzov, Polina Kudriavtseva, Sofiia Soldatova, Kolodin Egor, Stanislav Pyatkin , et al. (9 additional authors not shown)

Abstract: Generative large language models (LLMs) have become crucial for modern NLP research and applications across various languages. However, the development of foundational models specifically tailored to the Russian language has been limited, primarily due to the significant computational resources required. This paper introduces the GigaChat family of Russian LLMs, available in various sizes, includi… ▽ More Generative large language models (LLMs) have become crucial for modern NLP research and applications across various languages. However, the development of foundational models specifically tailored to the Russian language has been limited, primarily due to the significant computational resources required. This paper introduces the GigaChat family of Russian LLMs, available in various sizes, including base models and instruction-tuned versions. We provide a detailed report on the model architecture, pre-training process, and experiments to guide design choices. In addition, we evaluate their performance on Russian and English benchmarks and compare GigaChat with multilingual analogs. The paper presents a system demonstration of the top-performing models accessible via an API, a Telegram bot, and a Web interface. Furthermore, we have released three open GigaChat models in open-source (https://huggingface.co/ai-sage), aiming to expand NLP research opportunities and support the development of industrial solutions for the Russian language. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: ACL-2025 System Demo

arXiv:2506.07216 [pdf, ps, other]

AugmentGest: Can Random Data Cropping Augmentation Boost Gesture Recognition Performance?

Authors: Nada Aboudeshish, Dmitry Ignatov, Radu Timofte

Abstract: Data augmentation is a crucial technique in deep learning, particularly for tasks with limited dataset diversity, such as skeleton-based datasets. This paper proposes a comprehensive data augmentation framework that integrates geometric transformations, random cropping, rotation, zooming and intensity-based transformations, brightness and contrast adjustments to simulate real-world variations. Ran… ▽ More Data augmentation is a crucial technique in deep learning, particularly for tasks with limited dataset diversity, such as skeleton-based datasets. This paper proposes a comprehensive data augmentation framework that integrates geometric transformations, random cropping, rotation, zooming and intensity-based transformations, brightness and contrast adjustments to simulate real-world variations. Random cropping ensures the preservation of spatio-temporal integrity while addressing challenges such as viewpoint bias and occlusions. The augmentation pipeline generates three augmented versions for each sample in addition to the data set sample, thus quadrupling the data set size and enriching the diversity of gesture representations. The proposed augmentation strategy is evaluated on three models: multi-stream e2eET, FPPR point cloud-based hand gesture recognition (HGR), and DD-Network. Experiments are conducted on benchmark datasets including DHG14/28, SHREC'17, and JHMDB. The e2eET model, recognized as the state-of-the-art for hand gesture recognition on DHG14/28 and SHREC'17. The FPPR-PCD model, the second-best performing model on SHREC'17, excels in point cloud-based gesture recognition. DD-Net, a lightweight and efficient architecture for skeleton-based action recognition, is evaluated on SHREC'17 and the Human Motion Data Base (JHMDB). The results underline the effectiveness and versatility of the proposed augmentation strategy, significantly improving model generalization and robustness across diverse datasets and architectures. This framework not only establishes state-of-the-art results on all three evaluated models but also offers a scalable solution to advance HGR and action recognition applications in real-world scenarios. The framework is available at https://github.com/NadaAbodeshish/Random-Cropping-augmentation-HGR △ Less

Submitted 8 June, 2025; originally announced June 2025.

arXiv:2506.05396 [pdf, ps, other]

Talk2SAM: Text-Guided Semantic Enhancement for Complex-Shaped Object Segmentation

Authors: Luka Vetoshkin, Dmitry Yudin

Abstract: Segmenting objects with complex shapes, such as wires, bicycles, or structural grids, remains a significant challenge for current segmentation models, including the Segment Anything Model (SAM) and its high-quality variant SAM-HQ. These models often struggle with thin structures and fine boundaries, leading to poor segmentation quality. We propose Talk2SAM, a novel approach that integrates textual… ▽ More Segmenting objects with complex shapes, such as wires, bicycles, or structural grids, remains a significant challenge for current segmentation models, including the Segment Anything Model (SAM) and its high-quality variant SAM-HQ. These models often struggle with thin structures and fine boundaries, leading to poor segmentation quality. We propose Talk2SAM, a novel approach that integrates textual guidance to improve segmentation of such challenging objects. The method uses CLIP-based embeddings derived from user-provided text prompts to identify relevant semantic regions, which are then projected into the DINO feature space. These features serve as additional prompts for SAM-HQ, enhancing its ability to focus on the target object. Beyond improving segmentation accuracy, Talk2SAM allows user-controllable segmentation, enabling disambiguation of objects within a single bounding box based on textual input. We evaluate our approach on three benchmarks: BIG, ThinObject5K, and DIS5K. Talk2SAM consistently outperforms SAM-HQ, achieving up to +5.9\% IoU and +8.3\% boundary IoU improvements. Our results demonstrate that incorporating natural language guidance provides a flexible and effective means for precise object segmentation, particularly in cases where traditional prompt-based methods fail. The source code is available on GitHub: https://github.com/richlukich/Talk2SAM △ Less

Submitted 3 June, 2025; originally announced June 2025.

Comments: 14 pages, 7 figures, Submitted to the conference

arXiv:2506.04839 [pdf, other]

Iterative Neural Rollback Chase-Pyndiah Decoding

Authors: Dmitry Artemasov, Oleg Nesterenkov, Kirill Andreev, Pavel Rybin, Alexey Frolov

Abstract: Iterative decoding is essential in modern communication systems, especially optical communications, where error-correcting codes such as turbo product codes (TPC) and staircase codes are widely employed. A key factor in achieving high error correction performance is the use of soft-decision decoding for component codes. However, implementing optimal maximum a posteriori (MAP) probability decoding… ▽ More Iterative decoding is essential in modern communication systems, especially optical communications, where error-correcting codes such as turbo product codes (TPC) and staircase codes are widely employed. A key factor in achieving high error correction performance is the use of soft-decision decoding for component codes. However, implementing optimal maximum a posteriori (MAP) probability decoding for commonly used component codes, such as BCH and Polar codes, is computationally prohibitive. Instead, practical systems rely on approximations, with the Chase-Pyndiah algorithm being a widely used suboptimal method. TPC are more powerful than their component codes and begin to function effectively at low signal-to-noise ratios. Consequently, during the initial iterations, the component codes do not perform well and introduce errors in the extrinsic information updates. This phenomenon limits the performance of TPC. This paper proposes a neural network-aided rollback Chase-Pyndiah decoding method to address this issue. A transformer-based neural network identifies cases where extrinsic updates are likely to introduce errors, triggering a rollback mechanism which prevents the update and keeps the component code message intact. Our results demonstrate that a neural network with a relatively small number of parameters can effectively distinguish destructive updates and improve decoding performance. We evaluate the proposed approach using TPC with (256, 239) extended BCH component codes. We show that the proposed method enhances the bit error rate performance of Chase-Pyndiah p=6 decoding, achieving a gain of approximately 0.145 dB in a TPC scheme with four full iterations, significantly outperforming conventional Chase p=7 decoding. △ Less

Submitted 5 June, 2025; originally announced June 2025.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2506.04505 [pdf, ps, other]

SGN-CIRL: Scene Graph-based Navigation with Curriculum, Imitation, and Reinforcement Learning

Authors: Nikita Oskolkov, Huzhenyu Zhang, Dmitry Makarov, Dmitry Yudin, Aleksandr Panov

Abstract: The 3D scene graph models spatial relationships between objects, enabling the agent to efficiently navigate in a partially observable environment and predict the location of the target object.This paper proposes an original framework named SGN-CIRL (3D Scene Graph-Based Reinforcement Learning Navigation) for mapless reinforcement learning-based robot navigation with learnable representation of ope… ▽ More The 3D scene graph models spatial relationships between objects, enabling the agent to efficiently navigate in a partially observable environment and predict the location of the target object.This paper proposes an original framework named SGN-CIRL (3D Scene Graph-Based Reinforcement Learning Navigation) for mapless reinforcement learning-based robot navigation with learnable representation of open-vocabulary 3D scene graph. To accelerate and stabilize the training of reinforcement learning-based algorithms, the framework also employs imitation learning and curriculum learning. The first one enables the agent to learn from demonstrations, while the second one structures the training process by gradually increasing task complexity from simple to more advanced scenarios. Numerical experiments conducted in the Isaac Sim environment showed that using a 3D scene graph for reinforcement learning significantly increased the success rate in difficult navigation cases. The code is open-sourced and available at: https://github.com/Xisonik/Aloha\_graph. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: 7 pages, 11 figures

arXiv:2506.04359 [pdf, ps, other]

cuVSLAM: CUDA accelerated visual odometry and mapping

Authors: Alexander Korovko, Dmitry Slepichev, Alexander Efitorov, Aigul Dzhumamuratova, Viktor Kuznetsov, Hesam Rabeti, Joydeep Biswas, Soha Pouya

Abstract: Accurate and robust pose estimation is a key requirement for any autonomous robot. We present cuVSLAM, a state-of-the-art solution for visual simultaneous localization and mapping, which can operate with a variety of visual-inertial sensor suites, including multiple RGB and depth cameras, and inertial measurement units. cuVSLAM supports operation with as few as one RGB camera to as many as 32 came… ▽ More Accurate and robust pose estimation is a key requirement for any autonomous robot. We present cuVSLAM, a state-of-the-art solution for visual simultaneous localization and mapping, which can operate with a variety of visual-inertial sensor suites, including multiple RGB and depth cameras, and inertial measurement units. cuVSLAM supports operation with as few as one RGB camera to as many as 32 cameras, in arbitrary geometric configurations, thus supporting a wide range of robotic setups. cuVSLAM is specifically optimized using CUDA to deploy in real-time applications with minimal computational overhead on edge-computing devices such as the NVIDIA Jetson. We present the design and implementation of cuVSLAM, example use cases, and empirical results on several state-of-the-art benchmarks demonstrating the best-in-class performance of cuVSLAM. △ Less

Submitted 8 July, 2025; v1 submitted 4 June, 2025; originally announced June 2025.

arXiv:2506.03073 [pdf, ps, other]

LEG-SLAM: Real-Time Language-Enhanced Gaussian Splatting for SLAM

Authors: Roman Titkov, Egor Zubkov, Dmitry Yudin, Jaafar Mahmoud, Malik Mohrat, Gennady Sidorov

Abstract: Modern Gaussian Splatting methods have proven highly effective for real-time photorealistic rendering of 3D scenes. However, integrating semantic information into this representation remains a significant challenge, especially in maintaining real-time performance for SLAM (Simultaneous Localization and Mapping) applications. In this work, we introduce LEG-SLAM -- a novel approach that fuses an opt… ▽ More Modern Gaussian Splatting methods have proven highly effective for real-time photorealistic rendering of 3D scenes. However, integrating semantic information into this representation remains a significant challenge, especially in maintaining real-time performance for SLAM (Simultaneous Localization and Mapping) applications. In this work, we introduce LEG-SLAM -- a novel approach that fuses an optimized Gaussian Splatting implementation with visual-language feature extraction using DINOv2 followed by a learnable feature compressor based on Principal Component Analysis, while enabling an online dense SLAM. Our method simultaneously generates high-quality photorealistic images and semantically labeled scene maps, achieving real-time scene reconstruction with more than 10 fps on the Replica dataset and 18 fps on ScanNet. Experimental results show that our approach significantly outperforms state-of-the-art methods in reconstruction speed while achieving competitive rendering quality. The proposed system eliminates the need for prior data preparation such as camera's ego motion or pre-computed static semantic maps. With its potential applications in autonomous robotics, augmented reality, and other interactive domains, LEG-SLAM represents a significant step forward in real-time semantic 3D Gaussian-based SLAM. Project page: https://titrom025.github.io/LEG-SLAM/ △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2506.02276 [pdf, ps, other]

Latent Stochastic Interpolants

Authors: Saurabh Singh, Dmitry Lagun

Abstract: Stochastic Interpolants (SI) are a powerful framework for generative modeling, capable of flexibly transforming between two probability distributions. However, their use in jointly optimized latent variable models remains unexplored as they require direct access to the samples from the two distributions. This work presents Latent Stochastic Interpolants (LSI) enabling joint learning in a latent sp… ▽ More Stochastic Interpolants (SI) are a powerful framework for generative modeling, capable of flexibly transforming between two probability distributions. However, their use in jointly optimized latent variable models remains unexplored as they require direct access to the samples from the two distributions. This work presents Latent Stochastic Interpolants (LSI) enabling joint learning in a latent space with end-to-end optimized encoder, decoder and latent SI models. We achieve this by developing a principled Evidence Lower Bound (ELBO) objective derived directly in continuous time. The joint optimization allows LSI to learn effective latent representations along with a generative process that transforms an arbitrary prior distribution into the encoder-defined aggregated posterior. LSI sidesteps the simple priors of the normal diffusion models and mitigates the computational demands of applying SI directly in high-dimensional observation spaces, while preserving the generative flexibility of the SI framework. We demonstrate the efficacy of LSI through comprehensive experiments on the standard large scale ImageNet generation benchmark. △ Less

Submitted 2 June, 2025; originally announced June 2025.

Comments: Under Review

arXiv:2506.01541 [pdf, ps, other]

Adaptive Destruction Processes for Diffusion Samplers

Authors: Timofei Gritsaev, Nikita Morozov, Kirill Tamogashev, Daniil Tiapkin, Sergey Samsonov, Alexey Naumov, Dmitry Vetrov, Nikolay Malkin

Abstract: This paper explores the challenges and benefits of a trainable destruction process in diffusion samplers -- diffusion-based generative models trained to sample an unnormalised density without access to data samples. Contrary to the majority of work that views diffusion samplers as approximations to an underlying continuous-time model, we view diffusion models as discrete-time policies trained to p… ▽ More This paper explores the challenges and benefits of a trainable destruction process in diffusion samplers -- diffusion-based generative models trained to sample an unnormalised density without access to data samples. Contrary to the majority of work that views diffusion samplers as approximations to an underlying continuous-time model, we view diffusion models as discrete-time policies trained to produce samples in very few generation steps. We propose to trade some of the elegance of the underlying theory for flexibility in the definition of the generative and destruction policies. In particular, we decouple the generation and destruction variances, enabling both transition kernels to be learned as unconstrained Gaussian densities. We show that, when the number of steps is limited, training both generation and destruction processes results in faster convergence and improved sampling quality on various benchmarks. Through a robust ablation study, we investigate the design choices necessary to facilitate stable training. Finally, we show the scalability of our approach through experiments on GAN latent space sampling for conditional image generation. △ Less

Submitted 2 June, 2025; originally announced June 2025.

arXiv:2505.23489 [pdf, ps, other]

SGD as Free Energy Minimization: A Thermodynamic View on Neural Network Training

Authors: Ildus Sadrtdinov, Ivan Klimov, Ekaterina Lobacheva, Dmitry Vetrov

Abstract: We present a thermodynamic interpretation of the stationary behavior of stochastic gradient descent (SGD) under fixed learning rates (LRs) in neural network training. We show that SGD implicitly minimizes a free energy function $F=U-TS$, balancing training loss $U$ and the entropy of the weights distribution $S$, with temperature $T$ determined by the LR. This perspective offers a new lens on why… ▽ More We present a thermodynamic interpretation of the stationary behavior of stochastic gradient descent (SGD) under fixed learning rates (LRs) in neural network training. We show that SGD implicitly minimizes a free energy function $F=U-TS$, balancing training loss $U$ and the entropy of the weights distribution $S$, with temperature $T$ determined by the LR. This perspective offers a new lens on why high LRs prevent training from converging to the loss minima and how different LRs lead to stabilization at different loss levels. We empirically validate the free energy framework on both underparameterized (UP) and overparameterized (OP) models. UP models consistently follow free energy minimization, with temperature increasing monotonically with LR, while for OP models, the temperature effectively drops to zero at low LRs, causing SGD to minimize the loss directly and converge to an optimum. We attribute this mismatch to differences in the signal-to-noise ratio of stochastic gradients near optima, supported by both a toy example and neural network experiments. △ Less

Submitted 29 May, 2025; originally announced May 2025.

Comments: First two authors contributed equally

arXiv:2505.23299 [pdf, ps, other]

Data-efficient Meta-models for Evaluation of Context-based Questions and Answers in LLMs

Authors: Julia Belikova, Konstantin Polev, Rauf Parchiev, Dmitry Simakov

Abstract: Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly deployed in industry applications, yet their reliability remains hampered by challenges in detecting hallucinations. While supervised state-of-the-art (SOTA) methods that leverage LLM hidden states -- such as activation tracing and representation analysis -- show promise, their dependence on extensively… ▽ More Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly deployed in industry applications, yet their reliability remains hampered by challenges in detecting hallucinations. While supervised state-of-the-art (SOTA) methods that leverage LLM hidden states -- such as activation tracing and representation analysis -- show promise, their dependence on extensively annotated datasets limits scalability in real-world applications. This paper addresses the critical bottleneck of data annotation by investigating the feasibility of reducing training data requirements for two SOTA hallucination detection frameworks: Lookback Lens, which analyzes attention head dynamics, and probing-based approaches, which decode internal model representations. We propose a methodology combining efficient classification algorithms with dimensionality reduction techniques to minimize sample size demands while maintaining competitive performance. Evaluations on standardized question-answering RAG benchmarks show that our approach achieves performance comparable to strong proprietary LLM-based baselines with only 250 training samples. These results highlight the potential of lightweight, data-efficient paradigms for industrial deployment, particularly in annotation-constrained scenarios. △ Less

Submitted 29 May, 2025; originally announced May 2025.

Showing 1–50 of 1,470 results for author: Dmitry