-
The Value of Personalized Recommendations: Evidence from Netflix
Authors:
Kevin Zielnicki,
Guy Aridor,
Aurélien Bibaut,
Allen Tran,
Winston Chou,
Nathan Kallus
Abstract:
Personalized recommendation systems shape much of user choice online, yet their targeted nature makes separating out the value of recommendation and the underlying goods challenging. We build a discrete choice model that embeds recommendation-induced utility, low-rank heterogeneity, and flexible state dependence and apply the model to viewership data at Netflix. We exploit idiosyncratic variation…
▽ More
Personalized recommendation systems shape much of user choice online, yet their targeted nature makes separating out the value of recommendation and the underlying goods challenging. We build a discrete choice model that embeds recommendation-induced utility, low-rank heterogeneity, and flexible state dependence and apply the model to viewership data at Netflix. We exploit idiosyncratic variation introduced by the recommendation algorithm to identify and separately value these components as well as to recover model-free diversion ratios that we can use to validate our structural model. We use the model to evaluate counterfactuals that quantify the incremental engagement generated by personalized recommendations. First, we show that replacing the current recommender system with a matrix factorization or popularity-based algorithm would lead to 4% and 12% reduction in engagement, respectively, and decreased consumption diversity. Second, most of the consumption increase from recommendations comes from effective targeting, not mechanical exposure, with the largest gains for mid-popularity goods (as opposed to broadly appealing or very niche goods).
△ Less
Submitted 10 November, 2025; v1 submitted 10 November, 2025;
originally announced November 2025.
-
Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning
Authors:
Yaochen Zhu,
Harald Steck,
Dawen Liang,
Yinhan He,
Vito Ostuni,
Jundong Li,
Nathan Kallus
Abstract:
Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats, and their ranking quality degrades sharply toward the end of the generated list.…
▽ More
Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats, and their ranking quality degrades sharply toward the end of the generated list. To this end, we propose ConvRec-R1, a two-stage framework for end-to-end training of LLM-based conversational recommender systems. In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline, which produces high-quality, catalog-grounded demonstrations from powerful blackbox LLMs to warm-start the RL training. In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization (GRPO) tailored to tasks with rank-style outputs. Rank-GRPO treats each rank in the recommendation list as the unit instead of token (too fine-grained) or sequence (too coarse), redefining rewards to remove non-causal credit assignment and introducing a rank-level importance ratio based on the geometric mean of rank-wise token probabilities to stabilize policy updates. Experiments on the public Reddit-v2 dataset show that ConvRec-R1 converges faster and achieves higher Recall and NDCG than GRPO-style baselines. Code and datasets are released at https://github.com/yaochenzhu/Rank-GRPO.
△ Less
Submitted 23 October, 2025; v1 submitted 22 October, 2025;
originally announced October 2025.
-
Does Weighting Improve Matrix Factorization for Recommender Systems?
Authors:
Alex Ayoub,
Samuel Robertson,
Dawen Liang,
Harald Steck,
Nathan Kallus
Abstract:
Matrix factorization is a widely used approach for top-N recommendation and collaborative filtering. When implemented on implicit feedback data (such as clicks), a common heuristic is to upweight the observed interactions. This strategy has been shown to improve performance for certain algorithms. In this paper, we conduct a systematic study of various weighting schemes and matrix factorization al…
▽ More
Matrix factorization is a widely used approach for top-N recommendation and collaborative filtering. When implemented on implicit feedback data (such as clicks), a common heuristic is to upweight the observed interactions. This strategy has been shown to improve performance for certain algorithms. In this paper, we conduct a systematic study of various weighting schemes and matrix factorization algorithms. Somewhat surprisingly, we find that training with unweighted data can perform comparably to, and sometimes outperform, training with weighted data, especially for large models. This observation challenges the conventional wisdom. Nevertheless, we identify cases where weighting can be beneficial, particularly for models with lower capacity and specific regularization schemes. We also derive efficient algorithms for exactly minimizing several weighted objectives that were previously considered computationally intractable. Our work provides a comprehensive analysis of the interplay between weighting, regularization, and model capacity in matrix factorization for recommender systems.
△ Less
Submitted 12 October, 2025;
originally announced October 2025.
-
DiFFPO: Training Diffusion LLMs to Reason Fast and Furious via Reinforcement Learning
Authors:
Hanyang Zhao,
Dawen Liang,
Wenpin Tang,
David Yao,
Nathan Kallus
Abstract:
We propose DiFFPO, Diffusion Fast and Furious Policy Optimization, a unified framework for training masked diffusion large language models (dLLMs) to reason not only better (furious), but also faster via reinforcement learning (RL). We first unify the existing baseline approach such as d1 by proposing to train surrogate policies via off-policy RL, whose likelihood is much more tractable as an appr…
▽ More
We propose DiFFPO, Diffusion Fast and Furious Policy Optimization, a unified framework for training masked diffusion large language models (dLLMs) to reason not only better (furious), but also faster via reinforcement learning (RL). We first unify the existing baseline approach such as d1 by proposing to train surrogate policies via off-policy RL, whose likelihood is much more tractable as an approximation to the true dLLM policy. This naturally motivates a more accurate and informative two-stage likelihood approximation combined with importance sampling correction, which leads to generalized RL algorithms with better sample efficiency and superior task performance. Second, we propose a new direction of joint training efficient samplers/controllers of dLLMs policy. Via RL, we incentivize dLLMs' natural multi-token prediction capabilities by letting the model learn to adaptively allocate an inference threshold for each prompt. By jointly training the sampler, we yield better accuracies with lower number of function evaluations (NFEs) compared to training the model only, obtaining the best performance in improving the Pareto frontier of the inference-time compute of dLLMs. We showcase the effectiveness of our pipeline by training open source large diffusion language models over benchmark math and planning tasks.
△ Less
Submitted 2 October, 2025;
originally announced October 2025.
-
Entropy After $\langle \texttt{/Think} \rangle$ for reasoning model early exiting
Authors:
Xi Wang,
James McInerney,
Lequn Wang,
Nathan Kallus
Abstract:
Large reasoning models show improved performance with longer chains of thought. However, recent work has highlighted (qualitatively) their tendency to overthink, continuing to revise answers even after reaching the correct solution. We quantitatively confirm this inefficiency by tracking Pass@1 for answers averaged over a large number of rollouts and find that the model often begins to always prod…
▽ More
Large reasoning models show improved performance with longer chains of thought. However, recent work has highlighted (qualitatively) their tendency to overthink, continuing to revise answers even after reaching the correct solution. We quantitatively confirm this inefficiency by tracking Pass@1 for answers averaged over a large number of rollouts and find that the model often begins to always produce the correct answer early in the reasoning, making extra reasoning a waste of tokens. To detect and prevent overthinking, we propose a simple and inexpensive novel signal -- Entropy After </Think> (EAT) -- for monitoring and deciding whether to exit reasoning early. By appending a stop thinking token (</think>) and monitoring the entropy of the following token as the model reasons, we obtain a trajectory that decreases and stabilizes when Pass@1 plateaus; thresholding its variance under an exponential moving average yields a practical stopping rule. Importantly, our approach enables adaptively allocating compute based on the EAT trajectory, allowing us to spend compute in a more efficient way compared with fixing the token budget for all questions. Empirically, on MATH500 and AIME2025, EAT reduces token usage by 13 - 21% without harming accuracy, and it remains effective in black box settings where logits from the reasoning model are not accessible, and EAT is computed with proxy models.
△ Less
Submitted 30 September, 2025;
originally announced September 2025.
-
Inverse Reinforcement Learning Using Just Classification and a Few Regressions
Authors:
Lars van der Laan,
Nathan Kallus,
Aurélien Bibaut
Abstract:
Inverse reinforcement learning (IRL) aims to explain observed behavior by uncovering an underlying reward. In the maximum-entropy or Gumbel-shocks-to-reward frameworks, this amounts to fitting a reward function and a soft value function that together satisfy the soft Bellman consistency condition and maximize the likelihood of observed actions. While this perspective has had enormous impact in imi…
▽ More
Inverse reinforcement learning (IRL) aims to explain observed behavior by uncovering an underlying reward. In the maximum-entropy or Gumbel-shocks-to-reward frameworks, this amounts to fitting a reward function and a soft value function that together satisfy the soft Bellman consistency condition and maximize the likelihood of observed actions. While this perspective has had enormous impact in imitation learning for robotics and understanding dynamic choices in economics, practical learning algorithms often involve delicate inner-loop optimization, repeated dynamic programming, or adversarial training, all of which complicate the use of modern, highly expressive function approximators like neural nets and boosting. We revisit softmax IRL and show that the population maximum-likelihood solution is characterized by a linear fixed-point equation involving the behavior policy. This observation reduces IRL to two off-the-shelf supervised learning problems: probabilistic classification to estimate the behavior policy, and iterative regression to solve the fixed point. The resulting method is simple and modular across function approximation classes and algorithms. We provide a precise characterization of the optimal solution, a generic oracle-based algorithm, finite-sample error bounds, and empirical results showing competitive or superior performance to MaxEnt IRL.
△ Less
Submitted 25 September, 2025;
originally announced September 2025.
-
Optimization of Epsilon-Greedy Exploration
Authors:
Ethan Che,
Hakan Ceylan,
James McInerney,
Nathan Kallus
Abstract:
Modern recommendation systems rely on exploration to learn user preferences for new items, typically implementing uniform exploration policies (e.g., epsilon-greedy) due to their simplicity and compatibility with machine learning (ML) personalization models. Within these systems, a crucial consideration is the rate of exploration - what fraction of user traffic should receive random item recommend…
▽ More
Modern recommendation systems rely on exploration to learn user preferences for new items, typically implementing uniform exploration policies (e.g., epsilon-greedy) due to their simplicity and compatibility with machine learning (ML) personalization models. Within these systems, a crucial consideration is the rate of exploration - what fraction of user traffic should receive random item recommendations and how this should evolve over time. While various heuristics exist for navigating the resulting exploration-exploitation tradeoff, selecting optimal exploration rates is complicated by practical constraints including batched updates, time-varying user traffic, short time horizons, and minimum exploration requirements. In this work, we propose a principled framework for determining the exploration schedule based on directly minimizing Bayesian regret through stochastic gradient descent (SGD), allowing for dynamic exploration rate adjustment via Model-Predictive Control (MPC). Through extensive experiments with recommendation datasets, we demonstrate that variations in the batch size across periods significantly influence the optimal exploration strategy. Our optimization methods automatically calibrate exploration to the specific problem setting, consistently matching or outperforming the best heuristic for each setting.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Simulation-Based Inference for Adaptive Experiments
Authors:
Brian M Cho,
Aurélien Bibaut,
Nathan Kallus
Abstract:
Multi-arm bandit experimental designs are increasingly being adopted over standard randomized trials due to their potential to improve outcomes for study participants, enable faster identification of the best-performing options, and/or enhance the precision of estimating key parameters. Current approaches for inference after adaptive sampling either rely on asymptotic normality under restricted ex…
▽ More
Multi-arm bandit experimental designs are increasingly being adopted over standard randomized trials due to their potential to improve outcomes for study participants, enable faster identification of the best-performing options, and/or enhance the precision of estimating key parameters. Current approaches for inference after adaptive sampling either rely on asymptotic normality under restricted experiment designs or underpowered martingale concentration inequalities that lead to weak power in practice. To bypass these limitations, we propose a simulation-based approach for conducting hypothesis tests and constructing confidence intervals for arm specific means and their differences. Our simulation-based approach uses positively biased nuisances to generate additional trajectories of the experiment, which we call \textit{simulation with optimism}. Using these simulations, we characterize the distribution potentially non-normal sample mean test statistic to conduct inference. We provide guarantees for (i) asymptotic type I error control, (ii) convergence of our confidence intervals, and (iii) asymptotic strong consistency of our estimator over a wide variety of common bandit designs. Our empirical results show that our approach achieves the desired coverage while reducing confidence interval widths by up to 50%, with drastic improvements for arms not targeted by the design.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Efficient Adaptive Experimentation with Noncompliance
Authors:
Miruna Oprescu,
Brian M Cho,
Nathan Kallus
Abstract:
We study the problem of estimating the average treatment effect (ATE) in adaptive experiments where treatment can only be encouraged -- rather than directly assigned -- via a binary instrumental variable. Building on semiparametric efficiency theory, we derive the efficiency bound for ATE estimation under arbitrary, history-dependent instrument-assignment policies, and show it is minimized by a va…
▽ More
We study the problem of estimating the average treatment effect (ATE) in adaptive experiments where treatment can only be encouraged -- rather than directly assigned -- via a binary instrumental variable. Building on semiparametric efficiency theory, we derive the efficiency bound for ATE estimation under arbitrary, history-dependent instrument-assignment policies, and show it is minimized by a variance-aware allocation rule that balances outcome noise and compliance variability. Leveraging this insight, we introduce AMRIV -- an Adaptive, Multiply-Robust estimator for Instrumental-Variable settings with variance-optimal assignment. AMRIV pairs (i) an online policy that adaptively approximates the optimal allocation with (ii) a sequential, influence-function-based estimator that attains the semiparametric efficiency bound while retaining multiply-robust consistency. We establish asymptotic normality, explicit convergence rates, and anytime-valid asymptotic confidence sequences that enable sequential inference. Finally, we demonstrate the practical effectiveness of our approach through empirical studies, showing that adaptive instrument assignment, when combined with the AMRIV estimator, yields improved efficiency and robustness compared to existing baselines.
△ Less
Submitted 28 October, 2025; v1 submitted 23 May, 2025;
originally announced May 2025.
-
Value-Guided Search for Efficient Chain-of-Thought Reasoning
Authors:
Kaiwen Wang,
Jin Peng Zhou,
Jonathan Chang,
Zhaolin Gao,
Nathan Kallus,
Kianté Brantley,
Wen Sun
Abstract:
In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of "step," which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it…
▽ More
In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of "step," which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-n. Moreover, VGS significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced.
△ Less
Submitted 30 September, 2025; v1 submitted 22 May, 2025;
originally announced May 2025.
-
From Reviews to Dialogues: Active Synthesis for Zero-Shot LLM-based Conversational Recommender System
Authors:
Rohan Surana,
Junda Wu,
Zhouhang Xie,
Yu Xia,
Harald Steck,
Dawen Liang,
Nathan Kallus,
Julian McAuley
Abstract:
Conversational recommender systems (CRS) typically require extensive domain-specific conversational datasets, yet high costs, privacy concerns, and data-collection challenges severely limit their availability. Although Large Language Models (LLMs) demonstrate strong zero-shot recommendation capabilities, practical applications often favor smaller, internally managed recommender models due to scala…
▽ More
Conversational recommender systems (CRS) typically require extensive domain-specific conversational datasets, yet high costs, privacy concerns, and data-collection challenges severely limit their availability. Although Large Language Models (LLMs) demonstrate strong zero-shot recommendation capabilities, practical applications often favor smaller, internally managed recommender models due to scalability, interpretability, and data privacy constraints, especially in sensitive or rapidly evolving domains. However, training these smaller models effectively still demands substantial domain-specific conversational data, which remains challenging to obtain. To address these limitations, we propose an active data augmentation framework that synthesizes conversational training data by leveraging black-box LLMs guided by active learning techniques. Specifically, our method utilizes publicly available non-conversational domain data, including item metadata, user reviews, and collaborative signals, as seed inputs. By employing active learning strategies to select the most informative seed samples, our approach efficiently guides LLMs to generate synthetic, semantically coherent conversational interactions tailored explicitly to the target domain. Extensive experiments validate that conversational data generated by our proposed framework significantly improves the performance of LLM-based CRS models, effectively addressing the challenges of building CRS in no- or low-resource scenarios.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
SNPL: Simultaneous Policy Learning and Evaluation for Safe Multi-Objective Policy Improvement
Authors:
Brian Cho,
Ana-Roxana Pop,
Ariel Evnine,
Nathan Kallus
Abstract:
To design effective digital interventions, experimenters face the challenge of learning decision policies that balance multiple objectives using offline data. Often, they aim to develop policies that maximize goal outcomes, while ensuring there are no undesirable changes in guardrail outcomes. To provide credible recommendations, experimenters must not only identify policies that satisfy the desir…
▽ More
To design effective digital interventions, experimenters face the challenge of learning decision policies that balance multiple objectives using offline data. Often, they aim to develop policies that maximize goal outcomes, while ensuring there are no undesirable changes in guardrail outcomes. To provide credible recommendations, experimenters must not only identify policies that satisfy the desired changes in goal and guardrail outcomes, but also offer probabilistic guarantees about the changes these policies induce. In practice, however, policy classes are often large, and digital experiments tend to produce datasets with small effect sizes relative to noise. In this setting, standard approaches such as data splitting or multiple testing often result in unstable policy selection and/or insufficient statistical power. In this paper, we provide safe noisy policy learning (SNPL), a novel approach that leverages the concept of algorithmic stability to address these challenges. Our method enables policy learning while simultaneously providing high-confidence guarantees using the entire dataset, avoiding the need for data-splitting. We present finite-sample and asymptotic versions of our algorithm that ensure the recommended policy satisfies high-probability guarantees for avoiding guardrail regressions and/or achieving goal outcome improvements. We test both variants of our approach approach empirically on a real-world application of personalizing SMS delivery. Our results on real-world data suggest that our approach offers dramatic improvements in settings with large policy classes and low signal-to-noise across both finite-sample and asymptotic safety guarantees, offering up to 300\% improvements in detection rates and 150\% improvements in policy gains at significantly smaller sample sizes.
△ Less
Submitted 21 March, 2025; v1 submitted 16 March, 2025;
originally announced March 2025.
-
$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training
Authors:
Jin Peng Zhou,
Kaiwen Wang,
Jonathan Chang,
Zhaolin Gao,
Nathan Kallus,
Kilian Q. Weinberger,
Kianté Brantley,
Wen Sun
Abstract:
Reinforcement learning (RL) post-training is crucial for LLM alignment and reasoning, but existing policy-based methods, such as PPO and DPO, can fall short of fixing shortcuts inherited from pre-training. In this work, we introduce $Q\sharp$, a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized $Q$ function. We propose to learn the optimal…
▽ More
Reinforcement learning (RL) post-training is crucial for LLM alignment and reasoning, but existing policy-based methods, such as PPO and DPO, can fall short of fixing shortcuts inherited from pre-training. In this work, we introduce $Q\sharp$, a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized $Q$ function. We propose to learn the optimal $Q$ function using distributional RL on an aggregated online dataset. Unlike prior value-based baselines that guide the model using unregularized $Q$-values, our method is theoretically principled and provably learns the optimal policy for the KL-regularized RL problem. Empirically, $Q\sharp$ outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy. Theoretically, we establish a reduction from KL-regularized RL to no-regret online learning, providing the first bounds for deterministic MDPs under only realizability. Thanks to distributional RL, our bounds are also variance-dependent and converge faster when the reference policy has small variance. In sum, our results highlight $Q\sharp$ as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees. The code can be found at https://github.com/jinpz/q_sharp.
△ Less
Submitted 19 October, 2025; v1 submitted 27 February, 2025;
originally announced February 2025.
-
Collaborative Retrieval for Large Language Model-based Conversational Recommender Systems
Authors:
Yaochen Zhu,
Chao Wan,
Harald Steck,
Dawen Liang,
Yesu Feng,
Nathan Kallus,
Jundong Li
Abstract:
Conversational recommender systems (CRS) aim to provide personalized recommendations via interactive dialogues with users. While large language models (LLMs) enhance CRS with their superior understanding of context-aware user preferences, they typically struggle to leverage behavioral data, which have proven to be important for classical collaborative filtering (CF)-based approaches. For this reas…
▽ More
Conversational recommender systems (CRS) aim to provide personalized recommendations via interactive dialogues with users. While large language models (LLMs) enhance CRS with their superior understanding of context-aware user preferences, they typically struggle to leverage behavioral data, which have proven to be important for classical collaborative filtering (CF)-based approaches. For this reason, we propose CRAG, Collaborative Retrieval Augmented Generation for LLM-based CRS. To the best of our knowledge, CRAG is the first approach that combines state-of-the-art LLMs with CF for conversational recommendations. Our experiments on two publicly available movie conversational recommendation datasets, i.e., a refined Reddit dataset (which we name Reddit-v2) as well as the Redial dataset, demonstrate the superior item coverage and recommendation performance of CRAG, compared to several CRS baselines. Moreover, we observe that the improvements are mainly due to better recommendation accuracy on recently released movies. The code and data are available at https://github.com/yaochenzhu/CRAG.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
GST-UNet: A Neural Framework for Spatiotemporal Causal Inference with Time-Varying Confounding
Authors:
Miruna Oprescu,
David K. Park,
Xihaier Luo,
Shinjae Yoo,
Nathan Kallus
Abstract:
Estimating causal effects from spatiotemporal observational data is essential in public health, environmental science, and policy evaluation, where randomized experiments are often infeasible. Existing approaches, however, either rely on strong structural assumptions or fail to handle key challenges such as interference, spatial confounding, temporal carryover, and time-varying confounding -- wher…
▽ More
Estimating causal effects from spatiotemporal observational data is essential in public health, environmental science, and policy evaluation, where randomized experiments are often infeasible. Existing approaches, however, either rely on strong structural assumptions or fail to handle key challenges such as interference, spatial confounding, temporal carryover, and time-varying confounding -- where covariates are influenced by past treatments and, in turn, affect future ones. We introduce GST-UNet (G-computation Spatio-Temporal UNet), a theoretically grounded neural framework that combines a U-Net-based spatiotemporal encoder with regression-based iterative G-computation to estimate location-specific potential outcomes under complex intervention sequences. GST-UNet explicitly adjusts for time-varying confounders and captures non-linear spatial and temporal dependencies, enabling valid causal inference from a single observed trajectory in data-scarce settings. We validate its effectiveness in synthetic experiments and in a real-world analysis of wildfire smoke exposure and respiratory hospitalizations during the 2018 California Camp Fire. Together, these results position GST-UNet as a principled and ready-to-use framework for spatiotemporal causal inference, advancing reliable estimation in policy-relevant and scientific domains.
△ Less
Submitted 28 October, 2025; v1 submitted 7 February, 2025;
originally announced February 2025.
-
Semiparametric Double Reinforcement Learning with Applications to Long-Term Causal Inference
Authors:
Lars van der Laan,
David Hubbard,
Allen Tran,
Nathan Kallus,
Aurélien Bibaut
Abstract:
Double Reinforcement Learning (DRL) enables efficient inference for policy values in nonparametric Markov decision processes (MDPs), but existing methods face two major obstacles: (1) they require stringent intertemporal overlap conditions on state trajectories, and (2) they rely on estimating high-dimensional occupancy density ratios. Motivated by problems in long-term causal inference, we extend…
▽ More
Double Reinforcement Learning (DRL) enables efficient inference for policy values in nonparametric Markov decision processes (MDPs), but existing methods face two major obstacles: (1) they require stringent intertemporal overlap conditions on state trajectories, and (2) they rely on estimating high-dimensional occupancy density ratios. Motivated by problems in long-term causal inference, we extend DRL to a semiparametric setting and develop doubly robust, automatic estimators for general linear functionals of the Q-function in infinite-horizon, time-homogeneous MDPs. By imposing structure on the Q-function, we relax the overlap conditions required by nonparametric methods and obtain efficiency gains. The second obstacle--density-ratio estimation--typically requires computationally expensive and unstable min-max optimization. To address both challenges, we introduce superefficient nonparametric estimators whose limiting variance falls below the generalized Cramer-Rao bound. These estimators treat the Q-function as a one-dimensional summary of the state-action process, reducing high-dimensional overlap requirements to a single-dimensional condition. The procedure is simple to implement: estimate and calibrate the Q-function using fitted Q-iteration, then plug the result into the target functional, thereby avoiding density-ratio estimation altogether.
△ Less
Submitted 12 November, 2025; v1 submitted 12 January, 2025;
originally announced January 2025.
-
Reward Maximization for Pure Exploration: Minimax Optimal Good Arm Identification for Nonparametric Multi-Armed Bandits
Authors:
Brian Cho,
Dominik Meier,
Kyra Gan,
Nathan Kallus
Abstract:
In multi-armed bandits, the tasks of reward maximization and pure exploration are often at odds with each other. The former focuses on exploiting arms with the highest means, while the latter may require constant exploration across all arms. In this work, we focus on good arm identification (GAI), a practical bandit inference objective that aims to label arms with means above a threshold as quickl…
▽ More
In multi-armed bandits, the tasks of reward maximization and pure exploration are often at odds with each other. The former focuses on exploiting arms with the highest means, while the latter may require constant exploration across all arms. In this work, we focus on good arm identification (GAI), a practical bandit inference objective that aims to label arms with means above a threshold as quickly as possible. We show that GAI can be efficiently solved by combining a reward-maximizing sampling algorithm with a novel nonparametric anytime-valid sequential test for labeling arm means. We first establish that our sequential test maintains error control under highly nonparametric assumptions and asymptotically achieves the minimax optimal e-power, a notion of power for anytime-valid tests. Next, by pairing regret-minimizing sampling schemes with our sequential test, we provide an approach that achieves minimax optimal stopping times for labeling arms with means above a threshold, under an error probability constraint. Our empirical results validate our approach beyond the minimax setting, reducing the expected number of samples for all stopping times by at least 50% across both synthetic and real-world settings.
△ Less
Submitted 20 October, 2024;
originally announced October 2024.
-
Adjusting Regression Models for Conditional Uncertainty Calibration
Authors:
Ruijiang Gao,
Mingzhang Yin,
James McInerney,
Nathan Kallus
Abstract:
Conformal Prediction methods have finite-sample distribution-free marginal coverage guarantees. However, they generally do not offer conditional coverage guarantees, which can be important for high-stakes decisions. In this paper, we propose a novel algorithm to train a regression function to improve the conditional coverage after applying the split conformal prediction procedure. We establish an…
▽ More
Conformal Prediction methods have finite-sample distribution-free marginal coverage guarantees. However, they generally do not offer conditional coverage guarantees, which can be important for high-stakes decisions. In this paper, we propose a novel algorithm to train a regression function to improve the conditional coverage after applying the split conformal prediction procedure. We establish an upper bound for the miscoverage gap between the conditional coverage and the nominal coverage rate and propose an end-to-end algorithm to control this upper bound. We demonstrate the efficacy of our method empirically on synthetic and real-world datasets.
△ Less
Submitted 25 September, 2024;
originally announced September 2024.
-
The Central Role of the Loss Function in Reinforcement Learning
Authors:
Kaiwen Wang,
Nathan Kallus,
Wen Sun
Abstract:
This paper illustrates the central role of loss functions in data-driven decision making, providing a comprehensive survey on their influence in cost-sensitive classification (CSC) and reinforcement learning (RL). We demonstrate how different regression loss functions affect the sample efficiency and adaptivity of value-based decision making algorithms. Across multiple settings, we prove that algo…
▽ More
This paper illustrates the central role of loss functions in data-driven decision making, providing a comprehensive survey on their influence in cost-sensitive classification (CSC) and reinforcement learning (RL). We demonstrate how different regression loss functions affect the sample efficiency and adaptivity of value-based decision making algorithms. Across multiple settings, we prove that algorithms using the binary cross-entropy loss achieve first-order bounds scaling with the optimal policy's cost and are much more efficient than the commonly used squared loss. Moreover, we prove that distributional algorithms using the maximum likelihood loss achieve second-order bounds scaling with the policy variance and are even sharper than first-order bounds. This in particular proves the benefits of distributional RL. We hope that this paper serves as a guide analyzing decision making algorithms with varying loss functions, and can inspire the reader to seek out better loss functions to improve any decision making algorithm.
△ Less
Submitted 4 April, 2025; v1 submitted 19 September, 2024;
originally announced September 2024.
-
CSPI-MT: Calibrated Safe Policy Improvement with Multiple Testing for Threshold Policies
Authors:
Brian M Cho,
Ana-Roxana Pop,
Kyra Gan,
Sam Corbett-Davies,
Israel Nir,
Ariel Evnine,
Nathan Kallus
Abstract:
When modifying existing policies in high-risk settings, it is often necessary to ensure with high certainty that the newly proposed policy improves upon a baseline, such as the status quo. In this work, we consider the problem of safe policy improvement, where one only adopts a new policy if it is deemed to be better than the specified baseline with at least pre-specified probability. We focus on…
▽ More
When modifying existing policies in high-risk settings, it is often necessary to ensure with high certainty that the newly proposed policy improves upon a baseline, such as the status quo. In this work, we consider the problem of safe policy improvement, where one only adopts a new policy if it is deemed to be better than the specified baseline with at least pre-specified probability. We focus on threshold policies, a ubiquitous class of policies with applications in economics, healthcare, and digital advertising. Existing methods rely on potentially underpowered safety checks and limit the opportunities for finding safe improvements, so too often they must revert to the baseline to maintain safety. We overcome these issues by leveraging the most powerful safety test in the asymptotic regime and allowing for multiple candidates to be tested for improvement over the baseline. We show that in adversarial settings, our approach controls the rate of adopting a policy worse than the baseline to the pre-specified error level, even in moderate sample sizes. We present CSPI and CSPI-MT, two novel heuristics for selecting cutoff(s) to maximize the policy improvement from baseline. We demonstrate through both synthetic and external datasets that our approaches improve both the detection rates of safe policies and the realized improvement, particularly under stringent safety requirements and low signal-to-noise conditions.
△ Less
Submitted 21 August, 2024;
originally announced August 2024.
-
Estimating Heterogeneous Treatment Effects by Combining Weak Instruments and Observational Data
Authors:
Miruna Oprescu,
Nathan Kallus
Abstract:
Accurately predicting conditional average treatment effects (CATEs) is crucial in personalized medicine and digital platform analytics. Since the treatments of interest often cannot be directly randomized, observational data is leveraged to learn CATEs, but this approach can incur significant bias from unobserved confounding. One strategy to overcome these limitations is to leverage instrumental v…
▽ More
Accurately predicting conditional average treatment effects (CATEs) is crucial in personalized medicine and digital platform analytics. Since the treatments of interest often cannot be directly randomized, observational data is leveraged to learn CATEs, but this approach can incur significant bias from unobserved confounding. One strategy to overcome these limitations is to leverage instrumental variables (IVs) as latent quasi-experiments, such as randomized intent-to-treat assignments or randomized product recommendations. This approach, on the other hand, can suffer from low compliance, $\textit{i.e.}$, IV weakness. Some subgroups may even exhibit zero compliance, meaning we cannot instrument for their CATEs at all. In this paper, we develop a novel approach to combine IV and observational data to enable reliable CATE estimation in the presence of unobserved confounding in the observational data and low compliance in the IV data, including no compliance for some subgroups. We propose a two-stage framework that first learns $\textit{biased}$ CATEs from the observational data, and then applies a compliance-weighted correction using IV data, effectively leveraging IV strength variability across covariates. We characterize the convergence rates of our method and validate its effectiveness through a simulation study. Additionally, we demonstrate its utility with real data by analyzing the heterogeneous effects of 401(k) plan participation on wealth.
△ Less
Submitted 1 November, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
Contextual Linear Optimization with Partial Feedback
Authors:
Yichun Hu,
Nathan Kallus,
Xiaojie Mao,
Yanchen Wu
Abstract:
Contextual linear optimization (CLO) uses predictive contextual features to reduce uncertainty in random cost coefficients in the objective and thereby improve decision-making performance. A canonical example is the stochastic shortest path problem with random edge costs (e.g., travel time) and contextual features (e.g., lagged traffic, weather). While existing work on CLO assumes fully observed c…
▽ More
Contextual linear optimization (CLO) uses predictive contextual features to reduce uncertainty in random cost coefficients in the objective and thereby improve decision-making performance. A canonical example is the stochastic shortest path problem with random edge costs (e.g., travel time) and contextual features (e.g., lagged traffic, weather). While existing work on CLO assumes fully observed cost coefficient vectors, in many applications the decision maker observes only partial feedback corresponding to each chosen decision in the history. In this paper, we study both a bandit-feedback setting (e.g., only the overall travel time of each historical path is observed) and a semi-bandit-feedback setting (e.g., travel times of the individual segments on each chosen path are additionally observed). We propose a unified class of offline learning algorithms for CLO with different types of feedback, following a powerful induced empirical risk minimization (IERM) framework that integrates estimation and optimization. We provide a novel fast-rate regret bound for IERM that allows for misspecified model classes and flexible choices of estimation methods. To solve the partial-feedback IERM, we also tailor computationally tractable surrogate losses. A byproduct of our theory of independent interest is the fast-rate regret bound for IERM with full feedback and a misspecified policy class. We compare the performance of different methods numerically using stochastic shortest path examples on simulated and real data and provide practical insights from the empirical results.
△ Less
Submitted 9 November, 2025; v1 submitted 26 May, 2024;
originally announced May 2024.
-
Reindex-Then-Adapt: Improving Large Language Models for Conversational Recommendation
Authors:
Zhankui He,
Zhouhang Xie,
Harald Steck,
Dawen Liang,
Rahul Jha,
Nathan Kallus,
Julian McAuley
Abstract:
Large language models (LLMs) are revolutionizing conversational recommender systems by adeptly indexing item content, understanding complex conversational contexts, and generating relevant item titles. However, controlling the distribution of recommended items remains a challenge. This leads to suboptimal performance due to the failure to capture rapidly changing data distributions, such as item p…
▽ More
Large language models (LLMs) are revolutionizing conversational recommender systems by adeptly indexing item content, understanding complex conversational contexts, and generating relevant item titles. However, controlling the distribution of recommended items remains a challenge. This leads to suboptimal performance due to the failure to capture rapidly changing data distributions, such as item popularity, on targeted conversational recommendation platforms. In conversational recommendation, LLMs recommend items by generating the titles (as multiple tokens) autoregressively, making it difficult to obtain and control the recommendations over all items. Thus, we propose a Reindex-Then-Adapt (RTA) framework, which converts multi-token item titles into single tokens within LLMs, and then adjusts the probability distributions over these single-token item titles accordingly. The RTA framework marries the benefits of both LLMs and traditional recommender systems (RecSys): understanding complex queries as LLMs do; while efficiently controlling the recommended item distributions in conversational recommendations as traditional RecSys do. Our framework demonstrates improved accuracy metrics across three different conversational recommendation datasets and two adaptation settings
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
Efficient and Sharp Off-Policy Evaluation in Robust Markov Decision Processes
Authors:
Andrew Bennett,
Nathan Kallus,
Miruna Oprescu,
Wen Sun,
Kaiwen Wang
Abstract:
We study the evaluation of a policy under best- and worst-case perturbations to a Markov decision process (MDP), using transition observations from the original MDP, whether they are generated under the same or a different policy. This is an important problem when there is the possibility of a shift between historical and future environments, $\textit{e.g.}$ due to unmeasured confounding, distribu…
▽ More
We study the evaluation of a policy under best- and worst-case perturbations to a Markov decision process (MDP), using transition observations from the original MDP, whether they are generated under the same or a different policy. This is an important problem when there is the possibility of a shift between historical and future environments, $\textit{e.g.}$ due to unmeasured confounding, distributional shift, or an adversarial environment. We propose a perturbation model that allows changes in the transition kernel densities up to a given multiplicative factor or its reciprocal, extending the classic marginal sensitivity model (MSM) for single time-step decision-making to infinite-horizon RL. We characterize the sharp bounds on policy value under this model $\unicode{x2013}$ $\textit{i.e.}$, the tightest possible bounds based on transition observations from the original MDP $\unicode{x2013}$ and we study the estimation of these bounds from such transition observations. We develop an estimator with several important guarantees: it is semiparametrically efficient, and remains so even when certain necessary nuisance functions, such as worst-case Q-functions, are estimated at slow, nonparametric rates. Our estimator is also asymptotically normal, enabling straightforward statistical inference using Wald confidence intervals. Moreover, when certain nuisances are estimated inconsistently, the estimator still provides valid, albeit possibly not sharp, bounds on the policy value. We validate these properties in numerical simulations. The combination of accounting for environment shifts from train to test (robustness), being insensitive to nuisance-function estimation (orthogonality), and addressing the challenge of learning from finite samples (inference) together leads to credible and reliable policy evaluation.
△ Less
Submitted 1 November, 2024; v1 submitted 29 March, 2024;
originally announced April 2024.
-
Variation Due to Regularization Tractably Recovers Bayesian Deep Learning
Authors:
James McInerney,
Nathan Kallus
Abstract:
Uncertainty quantification in deep learning is crucial for safe and reliable decision-making in downstream tasks. Existing methods quantify uncertainty at the last layer or other approximations of the network which may miss some sources of uncertainty in the model. To address this gap, we propose an uncertainty quantification method for large networks based on variation due to regularization. Esse…
▽ More
Uncertainty quantification in deep learning is crucial for safe and reliable decision-making in downstream tasks. Existing methods quantify uncertainty at the last layer or other approximations of the network which may miss some sources of uncertainty in the model. To address this gap, we propose an uncertainty quantification method for large networks based on variation due to regularization. Essentially, predictions that are more (less) sensitive to the regularization of network parameters are less (more, respectively) certain. This principle can be implemented by deterministically tweaking the training loss during the fine-tuning phase and reflects confidence in the output as a function of all layers of the network. We show that regularization variation (RegVar) provides rigorous uncertainty estimates that, in the infinitesimal limit, exactly recover the Laplace approximation in Bayesian deep learning. We demonstrate its success in several deep learning architectures, showing it can scale tractably with the network size while maintaining or improving uncertainty quantification quality. Our experiments across multiple datasets show that RegVar not only identifies uncertain predictions effectively but also provides insights into the stability of learned representations.
△ Less
Submitted 24 April, 2025; v1 submitted 15 March, 2024;
originally announced March 2024.
-
A Reductions Approach to Risk-Sensitive Reinforcement Learning with Optimized Certainty Equivalents
Authors:
Kaiwen Wang,
Dawen Liang,
Nathan Kallus,
Wen Sun
Abstract:
We study risk-sensitive RL where the goal is learn a history-dependent policy that optimizes some risk measure of cumulative rewards. We consider a family of risks called the optimized certainty equivalents (OCE), which captures important risk measures such as conditional value-at-risk (CVaR), entropic risk and Markowitz's mean-variance. In this setting, we propose two meta-algorithms: one grounde…
▽ More
We study risk-sensitive RL where the goal is learn a history-dependent policy that optimizes some risk measure of cumulative rewards. We consider a family of risks called the optimized certainty equivalents (OCE), which captures important risk measures such as conditional value-at-risk (CVaR), entropic risk and Markowitz's mean-variance. In this setting, we propose two meta-algorithms: one grounded in optimism and another based on policy gradients, both of which can leverage the broad suite of risk-neutral RL algorithms in an augmented Markov Decision Process (MDP). Via a reductions approach, we leverage theory for risk-neutral RL to establish novel OCE bounds in complex, rich-observation MDPs. For the optimism-based algorithm, we prove bounds that generalize prior results in CVaR RL and that provide the first risk-sensitive bounds for exogenous block MDPs. For the gradient-based algorithm, we establish both monotone improvement and global convergence guarantees under a discrete reward assumption. Finally, we empirically show that our algorithms learn the optimal history-dependent policy in a proof-of-concept MDP, where all Markovian policies provably fail.
△ Less
Submitted 27 February, 2025; v1 submitted 10 March, 2024;
originally announced March 2024.
-
Is Cosine-Similarity of Embeddings Really About Similarity?
Authors:
Harald Steck,
Chaitanya Ekanadham,
Nathan Kallus
Abstract:
Cosine-similarity is the cosine of the angle between two vectors, or equivalently the dot product between their normalizations. A popular application is to quantify semantic similarity between high-dimensional objects by applying cosine-similarity to a learned low-dimensional feature embedding. This can work better but sometimes also worse than the unnormalized dot-product between embedded vectors…
▽ More
Cosine-similarity is the cosine of the angle between two vectors, or equivalently the dot product between their normalizations. A popular application is to quantify semantic similarity between high-dimensional objects by applying cosine-similarity to a learned low-dimensional feature embedding. This can work better but sometimes also worse than the unnormalized dot-product between embedded vectors in practice. To gain insight into this empirical observation, we study embeddings derived from regularized linear models, where closed-form solutions facilitate analytical insights. We derive analytically how cosine-similarity can yield arbitrary and therefore meaningless `similarities.' For some linear models the similarities are not even unique, while for others they are implicitly controlled by the regularization. We discuss implications beyond linear models: a combination of different regularizations are employed when learning deep models; these have implicit and unintended effects when taking cosine-similarities of the resulting embeddings, rendering results opaque and possibly arbitrary. Based on these insights, we caution against blindly using cosine-similarity and outline alternatives.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
Switching the Loss Reduces the Cost in Batch (Offline) Reinforcement Learning
Authors:
Alex Ayoub,
Kaiwen Wang,
Vincent Liu,
Samuel Robertson,
James McInerney,
Dawen Liang,
Nathan Kallus,
Csaba Szepesvári
Abstract:
We propose training fitted Q-iteration with log-loss (FQI-log) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-log scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving small-cost bo…
▽ More
We propose training fitted Q-iteration with log-loss (FQI-log) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-log scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving small-cost bounds, i.e. bounds that scale with the optimal achievable cost, in batch RL. Moreover, we empirically verify that FQI-log uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.
△ Less
Submitted 1 August, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Applied Causal Inference Powered by ML and AI
Authors:
Victor Chernozhukov,
Christian Hansen,
Nathan Kallus,
Martin Spindler,
Vasilis Syrgkanis
Abstract:
An introduction to the emerging fusion of machine learning and causal inference. The book presents ideas from classical structural equation models (SEMs) and their modern AI equivalent, directed acyclical graphs (DAGs) and structural causal models (SCMs), and covers Double/Debiased Machine Learning methods to do inference in such models using modern predictive tools.
An introduction to the emerging fusion of machine learning and causal inference. The book presents ideas from classical structural equation models (SEMs) and their modern AI equivalent, directed acyclical graphs (DAGs) and structural causal models (SCMs), and covers Double/Debiased Machine Learning methods to do inference in such models using modern predictive tools.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning
Authors:
Kaiwen Wang,
Owen Oertell,
Alekh Agarwal,
Nathan Kallus,
Wen Sun
Abstract:
In this paper, we prove that Distributional Reinforcement Learning (DistRL), which learns the return distribution, can obtain second-order bounds in both online and offline RL in general settings with function approximation. Second-order bounds are instance-dependent bounds that scale with the variance of return, which we prove are tighter than the previously known small-loss bounds of distributio…
▽ More
In this paper, we prove that Distributional Reinforcement Learning (DistRL), which learns the return distribution, can obtain second-order bounds in both online and offline RL in general settings with function approximation. Second-order bounds are instance-dependent bounds that scale with the variance of return, which we prove are tighter than the previously known small-loss bounds of distributional RL. To the best of our knowledge, our results are the first second-order bounds for low-rank MDPs and for offline RL. When specializing to contextual bandits (one-step RL problem), we show that a distributional learning based optimism algorithm achieves a second-order worst-case regret bound, and a second-order gap dependent bound, simultaneously. We also empirically demonstrate the benefit of DistRL in contextual bandits on real-world datasets. We highlight that our analysis with DistRL is relatively simple, follows the general framework of optimism in the face of uncertainty and does not require weighted regression. Our results suggest that DistRL is a promising framework for obtaining second-order bounds in general RL settings, thus further reinforcing the benefits of DistRL.
△ Less
Submitted 11 February, 2024;
originally announced February 2024.
-
Peeking with PEAK: Sequential, Nonparametric Composite Hypothesis Tests for Means of Multiple Data Streams
Authors:
Brian Cho,
Kyra Gan,
Nathan Kallus
Abstract:
We propose a novel nonparametric sequential test for composite hypotheses for means of multiple data streams. Our proposed method, \emph{peeking with expectation-based averaged capital} (PEAK), builds upon the testing-by-betting framework and provides a non-asymptotic $α$-level test across any stopping time. Our contributions are two-fold: (1) we propose a novel betting scheme and provide theoreti…
▽ More
We propose a novel nonparametric sequential test for composite hypotheses for means of multiple data streams. Our proposed method, \emph{peeking with expectation-based averaged capital} (PEAK), builds upon the testing-by-betting framework and provides a non-asymptotic $α$-level test across any stopping time. Our contributions are two-fold: (1) we propose a novel betting scheme and provide theoretical guarantees on type-I error control, power, and asymptotic growth rate/$e$-power in the setting of a single data stream; (2) we introduce PEAK, a generalization of this betting scheme to multiple streams, that (i) avoids using wasteful union bounds via averaging, (ii) is a test of power one under mild regularity conditions on the sampling scheme of the streams, and (iii) reduces computational overhead when applying the testing-as-betting approaches for pure-exploration bandit problems. We illustrate the practical benefits of PEAK using both synthetic and real-world HeartSteps datasets. Our experiments show that PEAK provides up to an 85\% reduction in the number of samples before stopping compared to existing stopping rules for pure-exploration bandit problems, and matches the performance of state-of-the-art sequential tests while improving upon computational complexity.
△ Less
Submitted 2 June, 2024; v1 submitted 8 February, 2024;
originally announced February 2024.
-
Multi-Armed Bandits with Interference
Authors:
Su Jia,
Peter Frazier,
Nathan Kallus
Abstract:
Experimentation with interference poses a significant challenge in contemporary online platforms. Prior research on experimentation with interference has concentrated on the final output of a policy. The cumulative performance, while equally crucial, is less well understood. To address this gap, we introduce the problem of {\em Multi-armed Bandits with Interference} (MABI), where the learner assig…
▽ More
Experimentation with interference poses a significant challenge in contemporary online platforms. Prior research on experimentation with interference has concentrated on the final output of a policy. The cumulative performance, while equally crucial, is less well understood. To address this gap, we introduce the problem of {\em Multi-armed Bandits with Interference} (MABI), where the learner assigns an arm to each of $N$ experimental units over a time horizon of $T$ rounds. The reward of each unit in each round depends on the treatments of {\em all} units, where the influence of a unit decays in the spatial distance between units. Furthermore, we employ a general setup wherein the reward functions are chosen by an adversary and may vary arbitrarily across rounds and units. We first show that switchback policies achieve an optimal {\em expected} regret $\tilde O(\sqrt T)$ against the best fixed-arm policy. Nonetheless, the regret (as a random variable) for any switchback policy suffers a high variance, as it does not account for $N$. We propose a cluster randomization policy whose regret (i) is optimal in {\em expectation} and (ii) admits a high probability bound that vanishes in $N$.
△ Less
Submitted 15 July, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
Clustered Switchback Designs for Experimentation Under Spatio-temporal Interference
Authors:
Su Jia,
Nathan Kallus,
Christina Lee Yu
Abstract:
We consider experimentation in the presence of non-stationarity, inter-unit (spatial) interference, and carry-over effects (temporal interference), where we wish to estimate the global average treatment effect (GATE), the difference between average outcomes having exposed all units at all times to treatment or to control. We suppose spatial interference is described by a graph, where a unit's outc…
▽ More
We consider experimentation in the presence of non-stationarity, inter-unit (spatial) interference, and carry-over effects (temporal interference), where we wish to estimate the global average treatment effect (GATE), the difference between average outcomes having exposed all units at all times to treatment or to control. We suppose spatial interference is described by a graph, where a unit's outcome depends on its neighborhood's treatments, and that temporal interference is described by an MDP, where the transition kernel under either treatment (action) satisfies a rapid mixing condition. We propose a clustered switchback design, where units are grouped into clusters and time steps are grouped into blocks, and each whole cluster-block combination is assigned a single random treatment. Under this design, we show that for graphs that admit good clustering, a truncated Horvitz-Thompson estimator achieves a $\tilde O(1/NT)$ mean squared error (MSE), matching the lower bound up to logarithmic terms for sparse graphs. Our results simultaneously generalize the results from \citet{hu2022switchback,ugander2013graph} and \citet{leung2022rate}. Simulation studies validate the favorable performance of our approach.
△ Less
Submitted 26 March, 2025; v1 submitted 24 December, 2023;
originally announced December 2023.
-
Low-Rank MDPs with Continuous Action Spaces
Authors:
Andrew Bennett,
Nathan Kallus,
Miruna Oprescu
Abstract:
Low-Rank Markov Decision Processes (MDPs) have recently emerged as a promising framework within the domain of reinforcement learning (RL), as they allow for provably approximately correct (PAC) learning guarantees while also incorporating ML algorithms for representation learning. However, current methods for low-rank MDPs are limited in that they only consider finite action spaces, and give vacuo…
▽ More
Low-Rank Markov Decision Processes (MDPs) have recently emerged as a promising framework within the domain of reinforcement learning (RL), as they allow for provably approximately correct (PAC) learning guarantees while also incorporating ML algorithms for representation learning. However, current methods for low-rank MDPs are limited in that they only consider finite action spaces, and give vacuous bounds as $|\mathcal{A}| \to \infty$, which greatly limits their applicability. In this work, we study the problem of extending such methods to settings with continuous actions, and explore multiple concrete approaches for performing this extension. As a case study, we consider the seminal FLAMBE algorithm (Agarwal et al., 2020), which is a reward-agnostic method for PAC RL with low-rank MDPs. We show that, without any modifications to the algorithm, we obtain a similar PAC bound when actions are allowed to be continuous. Specifically, when the model for transition functions satisfies a Hölder smoothness condition w.r.t. actions, and either the policy class has a uniformly bounded minimum density or the reward function is also Hölder smooth, we obtain a polynomial PAC bound that depends on the order of smoothness.
△ Less
Submitted 1 April, 2024; v1 submitted 6 November, 2023;
originally announced November 2023.
-
Off-Policy Evaluation for Large Action Spaces via Policy Convolution
Authors:
Noveen Sachdeva,
Lequn Wang,
Dawen Liang,
Nathan Kallus,
Julian McAuley
Abstract:
Developing accurate off-policy estimators is crucial for both evaluating and optimizing for new policies. The main challenge in off-policy estimation is the distribution shift between the logging policy that generates data and the target policy that we aim to evaluate. Typically, techniques for correcting distribution shift involve some form of importance sampling. This approach results in unbiase…
▽ More
Developing accurate off-policy estimators is crucial for both evaluating and optimizing for new policies. The main challenge in off-policy estimation is the distribution shift between the logging policy that generates data and the target policy that we aim to evaluate. Typically, techniques for correcting distribution shift involve some form of importance sampling. This approach results in unbiased value estimation but often comes with the trade-off of high variance, even in the simpler case of one-step contextual bandits. Furthermore, importance sampling relies on the common support assumption, which becomes impractical when the action space is large. To address these challenges, we introduce the Policy Convolution (PC) family of estimators. These methods leverage latent structure within actions -- made available through action embeddings -- to strategically convolve the logging and target policies. This convolution introduces a unique bias-variance trade-off, which can be controlled by adjusting the amount of convolution. Our experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC, especially when either the action space or policy mismatch becomes large, with gains of up to 5 - 6 orders of magnitude over existing estimators.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
Large Language Models as Zero-Shot Conversational Recommenders
Authors:
Zhankui He,
Zhouhang Xie,
Rahul Jha,
Harald Steck,
Dawen Liang,
Yesu Feng,
Bodhisattwa Prasad Majumder,
Nathan Kallus,
Julian McAuley
Abstract:
In this paper, we present empirical studies on conversational recommendation tasks using representative large language models in a zero-shot setting with three primary contributions. (1) Data: To gain insights into model behavior in "in-the-wild" conversational recommendation scenarios, we construct a new dataset of recommendation-related conversations by scraping a popular discussion website. Thi…
▽ More
In this paper, we present empirical studies on conversational recommendation tasks using representative large language models in a zero-shot setting with three primary contributions. (1) Data: To gain insights into model behavior in "in-the-wild" conversational recommendation scenarios, we construct a new dataset of recommendation-related conversations by scraping a popular discussion website. This is the largest public real-world conversational recommendation dataset to date. (2) Evaluation: On the new dataset and two existing conversational recommendation datasets, we observe that even without fine-tuning, large language models can outperform existing fine-tuned conversational recommendation models. (3) Analysis: We propose various probing tasks to investigate the mechanisms behind the remarkable performance of large language models in conversational recommendation. We analyze both the large language models' behaviors and the characteristics of the datasets, providing a holistic understanding of the models' effectiveness, limitations and suggesting directions for the design of future conversational recommenders
△ Less
Submitted 19 August, 2023;
originally announced August 2023.
-
Source Condition Double Robust Inference on Functionals of Inverse Problems
Authors:
Andrew Bennett,
Nathan Kallus,
Xiaojie Mao,
Whitney Newey,
Vasilis Syrgkanis,
Masatoshi Uehara
Abstract:
We consider estimation of parameters defined as linear functionals of solutions to linear inverse problems. Any such parameter admits a doubly robust representation that depends on the solution to a dual linear inverse problem, where the dual solution can be thought as a generalization of the inverse propensity function. We provide the first source condition double robust inference method that ens…
▽ More
We consider estimation of parameters defined as linear functionals of solutions to linear inverse problems. Any such parameter admits a doubly robust representation that depends on the solution to a dual linear inverse problem, where the dual solution can be thought as a generalization of the inverse propensity function. We provide the first source condition double robust inference method that ensures asymptotic normality around the parameter of interest as long as either the primal or the dual inverse problem is sufficiently well-posed, without knowledge of which inverse problem is the more well-posed one. Our result is enabled by novel guarantees for iterated Tikhonov regularized adversarial estimators for linear inverse problems, over general hypothesis spaces, which are developments of independent interest.
△ Less
Submitted 25 July, 2023;
originally announced July 2023.
-
JoinGym: An Efficient Query Optimization Environment for Reinforcement Learning
Authors:
Kaiwen Wang,
Junxiong Wang,
Yueying Li,
Nathan Kallus,
Immanuel Trummer,
Wen Sun
Abstract:
Join order selection (JOS) is the problem of ordering join operations to minimize total query execution cost and it is the core NP-hard combinatorial optimization problem of query optimization. In this paper, we present JoinGym, a lightweight and easy-to-use query optimization environment for reinforcement learning (RL) that captures both the left-deep and bushy variants of the JOS problem. Compar…
▽ More
Join order selection (JOS) is the problem of ordering join operations to minimize total query execution cost and it is the core NP-hard combinatorial optimization problem of query optimization. In this paper, we present JoinGym, a lightweight and easy-to-use query optimization environment for reinforcement learning (RL) that captures both the left-deep and bushy variants of the JOS problem. Compared to existing query optimization environments, the key advantages of JoinGym are usability and significantly higher throughput which we accomplish by simulating query executions entirely offline. Under the hood, JoinGym simulates a query plan's cost by looking up intermediate result cardinalities from a pre-computed dataset. We release a novel cardinality dataset for $3300$ SQL queries based on real IMDb workloads which may be of independent interest, e.g., for cardinality estimation. Finally, we extensively benchmark four RL algorithms and find that their cost distributions are heavy-tailed, which motivates future work in risk-sensitive RL. In sum, JoinGym enables users to rapidly prototype RL algorithms on realistic database problems without needing to setup and run live systems.
△ Less
Submitted 17 October, 2023; v1 submitted 21 July, 2023;
originally announced July 2023.
-
The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning
Authors:
Kaiwen Wang,
Kevin Zhou,
Runzhe Wu,
Nathan Kallus,
Wen Sun
Abstract:
While distributional reinforcement learning (DistRL) has been empirically effective, the question of when and why it is better than vanilla, non-distributional RL has remained unanswered. This paper explains the benefits of DistRL through the lens of small-loss bounds, which are instance-dependent bounds that scale with optimal achievable cost. Particularly, our bounds converge much faster than th…
▽ More
While distributional reinforcement learning (DistRL) has been empirically effective, the question of when and why it is better than vanilla, non-distributional RL has remained unanswered. This paper explains the benefits of DistRL through the lens of small-loss bounds, which are instance-dependent bounds that scale with optimal achievable cost. Particularly, our bounds converge much faster than those from non-distributional approaches if the optimal cost is small. As warmup, we propose a distributional contextual bandit (DistCB) algorithm, which we show enjoys small-loss regret bounds and empirically outperforms the state-of-the-art on three real-world tasks. In online RL, we propose a DistRL algorithm that constructs confidence sets using maximum likelihood estimation. We prove that our algorithm enjoys novel small-loss PAC bounds in low-rank MDPs. As part of our analysis, we introduce the $\ell_1$ distributional eluder dimension which may be of independent interest. Then, in offline RL, we show that pessimistic DistRL enjoys small-loss PAC bounds that are novel to the offline setting and are more robust to bad single-policy coverage.
△ Less
Submitted 22 September, 2023; v1 submitted 25 May, 2023;
originally announced May 2023.
-
Provable Offline Preference-Based Reinforcement Learning
Authors:
Wenhao Zhan,
Masatoshi Uehara,
Nathan Kallus,
Jason D. Lee,
Wen Sun
Abstract:
In this paper, we investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback where feedback is available in the form of preference between trajectory pairs rather than explicit rewards. Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offl…
▽ More
In this paper, we investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback where feedback is available in the form of preference between trajectory pairs rather than explicit rewards. Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE. We consider the general reward setting where the reward can be defined over the whole trajectory and provide a novel guarantee that allows us to learn any target policy with a polynomial number of samples, as long as the target policy is covered by the offline data. This guarantee is the first of its kind with general function approximation. To measure the coverage of the target policy, we introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability coefficient. We also establish lower bounds that highlight the necessity of such concentrability and the difference from standard RL, where state-action-wise rewards are directly observed. We further extend and analyze our algorithm when the feedback is given over action pairs.
△ Less
Submitted 29 September, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
B-Learner: Quasi-Oracle Bounds on Heterogeneous Causal Effects Under Hidden Confounding
Authors:
Miruna Oprescu,
Jacob Dorn,
Marah Ghoummaid,
Andrew Jesson,
Nathan Kallus,
Uri Shalit
Abstract:
Estimating heterogeneous treatment effects from observational data is a crucial task across many fields, helping policy and decision-makers take better actions. There has been recent progress on robust and efficient methods for estimating the conditional average treatment effect (CATE) function, but these methods often do not take into account the risk of hidden confounding, which could arbitraril…
▽ More
Estimating heterogeneous treatment effects from observational data is a crucial task across many fields, helping policy and decision-makers take better actions. There has been recent progress on robust and efficient methods for estimating the conditional average treatment effect (CATE) function, but these methods often do not take into account the risk of hidden confounding, which could arbitrarily and unknowingly bias any causal estimate based on observational data. We propose a meta-learner called the B-Learner, which can efficiently learn sharp bounds on the CATE function under limits on the level of hidden confounding. We derive the B-Learner by adapting recent results for sharp and valid bounds of the average treatment effect (Dorn et al., 2021) into the framework given by Kallus & Oprescu (2023) for robust and model-agnostic learning of conditional distributional treatment effects. The B-Learner can use any function estimator such as random forests and deep neural networks, and we prove its estimates are valid, sharp, efficient, and have a quasi-oracle property with respect to the constituent estimators under more general conditions than existing methods. Semi-synthetic experimental comparisons validate the theoretical findings, and we use real-world data to demonstrate how the method might be used in practice.
△ Less
Submitted 13 June, 2023; v1 submitted 20 April, 2023;
originally announced April 2023.
-
Minimax Instrumental Variable Regression and $L_2$ Convergence Guarantees without Identification or Closedness
Authors:
Andrew Bennett,
Nathan Kallus,
Xiaojie Mao,
Whitney Newey,
Vasilis Syrgkanis,
Masatoshi Uehara
Abstract:
In this paper, we study nonparametric estimation of instrumental variable (IV) regressions. Recently, many flexible machine learning methods have been developed for instrumental variable estimation. However, these methods have at least one of the following limitations: (1) restricting the IV regression to be uniquely identified; (2) only obtaining estimation error rates in terms of pseudometrics (…
▽ More
In this paper, we study nonparametric estimation of instrumental variable (IV) regressions. Recently, many flexible machine learning methods have been developed for instrumental variable estimation. However, these methods have at least one of the following limitations: (1) restricting the IV regression to be uniquely identified; (2) only obtaining estimation error rates in terms of pseudometrics (\emph{e.g.,} projected norm) rather than valid metrics (\emph{e.g.,} $L_2$ norm); or (3) imposing the so-called closedness condition that requires a certain conditional expectation operator to be sufficiently smooth. In this paper, we present the first method and analysis that can avoid all three limitations, while still permitting general function approximation. Specifically, we propose a new penalized minimax estimator that can converge to a fixed IV solution even when there are multiple solutions, and we derive a strong $L_2$ error rate for our estimator under lax conditions. Notably, this guarantee only needs a widely-used source condition and realizability assumptions, but not the so-called closedness condition. We argue that the source condition and the closedness condition are inherently conflicting, so relaxing the latter significantly improves upon the existing literature that requires both conditions. Our estimator can achieve this improvement because it builds on a novel formulation of the IV estimation problem as a constrained optimization problem.
△ Less
Submitted 10 February, 2023;
originally announced February 2023.
-
Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR
Authors:
Kaiwen Wang,
Nathan Kallus,
Wen Sun
Abstract:
In this paper, we study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $τ$. Starting with multi-arm bandits (MABs), we show the minimax CVaR regret rate is $Ω(\sqrt{τ^{-1}AK})$, where $A$ is the number of actions and $K$ is the number of episodes, and that it is achieved by an Upper Confidence Bound algorithm with a nov…
▽ More
In this paper, we study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $τ$. Starting with multi-arm bandits (MABs), we show the minimax CVaR regret rate is $Ω(\sqrt{τ^{-1}AK})$, where $A$ is the number of actions and $K$ is the number of episodes, and that it is achieved by an Upper Confidence Bound algorithm with a novel Bernstein bonus. For online RL in tabular Markov Decision Processes (MDPs), we show a minimax regret lower bound of $Ω(\sqrt{τ^{-1}SAK})$ (with normalized cumulative rewards), where $S$ is the number of states, and we propose a novel bonus-driven Value Iteration procedure. We show that our algorithm achieves the optimal regret of $\widetilde O(\sqrt{τ^{-1}SAK})$ under a continuity assumption and in general attains a near-optimal regret of $\widetilde O(τ^{-1}\sqrt{SAK})$, which is minimax-optimal for constant $τ$. This improves on the best available bounds. By discretizing rewards appropriately, our algorithms are computationally efficient.
△ Less
Submitted 24 May, 2023; v1 submitted 6 February, 2023;
originally announced February 2023.
-
Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage
Authors:
Masatoshi Uehara,
Nathan Kallus,
Jason D. Lee,
Wen Sun
Abstract:
In offline reinforcement learning (RL) we have no opportunity to explore so we must make assumptions that the data is sufficient to guide picking a good policy, taking the form of assuming some coverage, realizability, Bellman completeness, and/or hard margin (gap). In this work we propose value-based algorithms for offline RL with PAC guarantees under just partial coverage, specifically, coverage…
▽ More
In offline reinforcement learning (RL) we have no opportunity to explore so we must make assumptions that the data is sufficient to guide picking a good policy, taking the form of assuming some coverage, realizability, Bellman completeness, and/or hard margin (gap). In this work we propose value-based algorithms for offline RL with PAC guarantees under just partial coverage, specifically, coverage of just a single comparator policy, and realizability of soft (entropy-regularized) Q-function of the single policy and a related function defined as a saddle point of certain minimax optimization problem. This offers refined and generally more lax conditions for offline RL. We further show an analogous result for vanilla Q-functions under a soft margin condition. To attain these guarantees, we leverage novel minimax learning algorithms to accurately estimate soft or vanilla Q-functions with $L^2$-convergence guarantees. Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
△ Less
Submitted 13 November, 2023; v1 submitted 5 February, 2023;
originally announced February 2023.
-
Smooth Non-Stationary Bandits
Authors:
Su Jia,
Qian Xie,
Nathan Kallus,
Peter I. Frazier
Abstract:
In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time. However, in practice, environments often change {\em smoothly}, so such algorithms may incur higher-tha…
▽ More
In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time. However, in practice, environments often change {\em smoothly}, so such algorithms may incur higher-than-necessary regret. We study a non-stationary bandits problem where each arm's mean reward sequence can be embedded into a $β$-Hölder function, i.e., a function that is $(β-1)$-times Lipschitz-continuously differentiable. The non-stationarity becomes more smooth as $β$ increases. When $β=1$, this corresponds to the non-smooth regime, where \cite{besbes2014stochastic} established a minimax regret of $\tilde Θ(T^{2/3})$. We show the first separation between the smooth (i.e., $β\ge 2$) and non-smooth (i.e., $β=1$) regimes by presenting a policy with $\tilde O(k^{4/5} T^{3/5})$ regret on any $k$-armed, $2$-Hölder instance. We complement this result by showing that the minimax regret on the $β$-Hölder family of instances is $Ω(T^{(β+1)/(2β+1)})$ for any integer $β\ge 1$. This matches our upper bound for $β=2$ up to logarithmic factors. Furthermore, we validated the effectiveness of our policy through a comprehensive numerical study using real-world click-through rate data.
△ Less
Submitted 17 November, 2024; v1 submitted 29 January, 2023;
originally announced January 2023.
-
A Review of Off-Policy Evaluation in Reinforcement Learning
Authors:
Masatoshi Uehara,
Chengchun Shi,
Nathan Kallus
Abstract:
Reinforcement learning (RL) is one of the most vibrant research frontiers in machine learning and has been recently applied to solve a number of challenging problems. In this paper, we primarily focus on off-policy evaluation (OPE), one of the most fundamental topics in RL. In recent years, a number of OPE methods have been developed in the statistics and computer science literature. We provide a…
▽ More
Reinforcement learning (RL) is one of the most vibrant research frontiers in machine learning and has been recently applied to solve a number of challenging problems. In this paper, we primarily focus on off-policy evaluation (OPE), one of the most fundamental topics in RL. In recent years, a number of OPE methods have been developed in the statistics and computer science literature. We provide a discussion on the efficiency bound of OPE, some of the existing state-of-the-art OPE methods, their statistical properties and some other related research directions that are currently actively explored.
△ Less
Submitted 12 December, 2022;
originally announced December 2022.
-
The Implicit Delta Method
Authors:
Nathan Kallus,
James McInerney
Abstract:
Epistemic uncertainty quantification is a crucial part of drawing credible conclusions from predictive models, whether concerned about the prediction at a given point or any downstream evaluation that uses the model as input. When the predictive model is simple and its evaluation differentiable, this task is solved by the delta method, where we propagate the asymptotically-normal uncertainty in th…
▽ More
Epistemic uncertainty quantification is a crucial part of drawing credible conclusions from predictive models, whether concerned about the prediction at a given point or any downstream evaluation that uses the model as input. When the predictive model is simple and its evaluation differentiable, this task is solved by the delta method, where we propagate the asymptotically-normal uncertainty in the predictive model through the evaluation to compute standard errors and Wald confidence intervals. However, this becomes difficult when the model and/or evaluation becomes more complex. Remedies include the bootstrap, but it can be computationally infeasible when training the model even once is costly. In this paper, we propose an alternative, the implicit delta method, which works by infinitesimally regularizing the training loss of the predictive model to automatically assess downstream uncertainty. We show that the change in the evaluation due to regularization is consistent for the asymptotic variance of the evaluation estimator, even when the infinitesimal change is approximated by a finite difference. This provides both a reliable quantification of uncertainty in terms of standard errors as well as permits the construction of calibrated confidence intervals. We discuss connections to other approaches to uncertainty quantification, both Bayesian and frequentist, and demonstrate our approach empirically.
△ Less
Submitted 11 November, 2022;
originally announced November 2022.
-
Provable Safe Reinforcement Learning with Binary Feedback
Authors:
Andrew Bennett,
Dipendra Misra,
Nathan Kallus
Abstract:
Safety is a crucial necessity in many applications of reinforcement learning (RL), whether robotic, automotive, or medical. Many existing approaches to safe RL rely on receiving numeric safety feedback, but in many cases this feedback can only take binary values; that is, whether an action in a given state is safe or unsafe. This is particularly true when feedback comes from human experts. We ther…
▽ More
Safety is a crucial necessity in many applications of reinforcement learning (RL), whether robotic, automotive, or medical. Many existing approaches to safe RL rely on receiving numeric safety feedback, but in many cases this feedback can only take binary values; that is, whether an action in a given state is safe or unsafe. This is particularly true when feedback comes from human experts. We therefore consider the problem of provable safe RL when given access to an offline oracle providing binary feedback on the safety of state, action pairs. We provide a novel meta algorithm, SABRE, which can be applied to any MDP setting given access to a blackbox PAC RL algorithm for that setting. SABRE applies concepts from active learning to reinforcement learning to provably control the number of queries to the safety oracle. SABRE works by iteratively exploring the state space to find regions where the agent is currently uncertain about safety. Our main theoretical results shows that, under appropriate technical assumptions, SABRE never takes unsafe actions during training, and is guaranteed to return a near-optimal safe policy with high probability. We provide a discussion of how our meta-algorithm may be applied to various settings studied in both theoretical and empirical frameworks.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
Future-Dependent Value-Based Off-Policy Evaluation in POMDPs
Authors:
Masatoshi Uehara,
Haruka Kiyohara,
Andrew Bennett,
Victor Chernozhukov,
Nan Jiang,
Nathan Kallus,
Chengchun Shi,
Wen Sun
Abstract:
We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators and fitted-Q evaluation suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs.…
▽ More
We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators and fitted-Q evaluation suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs. Future-dependent value functions play similar roles as classical value functions in fully-observable MDPs. We derive a new Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is consistent as long as futures and histories contain sufficient information about latent states, and the Bellman completeness. Finally, we extend our methods to learning of dynamics and establish the connection between our approach and the well-known spectral learning methods in POMDPs.
△ Less
Submitted 14 November, 2023; v1 submitted 26 July, 2022;
originally announced July 2022.
-
Learning Bellman Complete Representations for Offline Policy Evaluation
Authors:
Jonathan D. Chang,
Kaiwen Wang,
Nathan Kallus,
Wen Sun
Abstract:
We study representation learning for Offline Reinforcement Learning (RL), focusing on the important task of Offline Policy Evaluation (OPE). Recent work shows that, in contrast to supervised learning, realizability of the Q-function is not enough for learning it. Two sufficient conditions for sample-efficient OPE are Bellman completeness and coverage. Prior work often assumes that representations…
▽ More
We study representation learning for Offline Reinforcement Learning (RL), focusing on the important task of Offline Policy Evaluation (OPE). Recent work shows that, in contrast to supervised learning, realizability of the Q-function is not enough for learning it. Two sufficient conditions for sample-efficient OPE are Bellman completeness and coverage. Prior work often assumes that representations satisfying these conditions are given, with results being mostly theoretical in nature. In this work, we propose BCRL, which directly learns from data an approximately linear Bellman complete representation with good coverage. With this learned representation, we perform OPE using Least Square Policy Evaluation (LSPE) with linear functions in our learned representation. We present an end-to-end theoretical analysis, showing that our two-stage algorithm enjoys polynomial sample complexity provided some representation in the rich class considered is linear Bellman complete. Empirically, we extensively evaluate our algorithm on challenging, image-based continuous control tasks from the Deepmind Control Suite. We show our representation enables better OPE compared to previous representation learning methods developed for off-policy RL (e.g., CURL, SPR). BCRL achieve competitive OPE error with the state-of-the-art method Fitted Q-Evaluation (FQE), and beats FQE when evaluating beyond the initial state distribution. Our ablations show that both linear Bellman complete and coverage components of our method are crucial.
△ Less
Submitted 12 July, 2022;
originally announced July 2022.