Introduction

Plasma disruptions, events where control of the plasma is lost, stands as a major barrier to realizing economical fusion energy through tokamak devices. Unmitigated disruptions in existing tokamak pilot plant designs, such as ARC, may disrupt machine operations for months or longer1. Even at the energies of upcoming burning plasma devices, such as SPARC and ITER, disruptions represent a major challenge, driving costly design requirements and posing risks of major delays to tokamak operations2,3. To overcome the challenge of disruptions, considerable advances are needed in avoidance, mitigation, resilience, and recovery (AMRR) techniques. Mitigation involves suddenly terminating the plasma discharge with techniques such as massive gas injection (MGI)4. Given its crucial role in machine protection, considerable amounts of effort has been allocated towards developing mitigation actuators and disruption prediction algorithms to trigger said mitigation actuators5,6,7. While robust, reliable mitigation is crucial for investment protection, it is still expected to induce considerable stress on tokamak components and should be considered a method of last resort8. Thus, avoidance techniques to terminate the plasma safely would be preferable.

To safely terminate the plasma, the plasma current and stored energy must be reduced to a benign level; this is often referred to as a ramp-down. The primary figure of merit for a successful ramp-down is often the plasma current Ip at which the plasma terminates as both structural loading and the risk posed by runaway electrons scale non-linearly with Ip9,10. In recent years, works have developed both Sequential Quadratic Programming (SQP) and Bayesian Optimization approaches for designing feed-forward ramp-down trajectories that are programmed into the machine11,12,13. However, many challenges remain unaddressed such as demonstrating scalability to allow for the generation of many trajectories to handle different scenarios, and the optimization of trajectories that are robust to sources of uncertainty. In addition, active adjustment of the plasma trajectory through real-time feedback remains an underdeveloped area with calls for further advancement14.

The problem setting of ramp-downs faces the considerable challenge of physics uncertainty. While recent work successfully demonstrated the application of reinforcement learning (RL) to the problem of shape control15, the physics relevant to the problem, ideal Magnetohydrodynamics (MHD), is a relatively well-understood and well-simulated aspect of tokamak plasma physics. Disruption avoidance, however, is more challenging as it involves complex physical phenomena such as turbulent transport for which accurate simulation from physics-based principles remains an open research problem. The highest fidelity simulations of core turbulence performed today with nonlinear gyrokinetics take millions of CPU-hours to simulate a steady-state condition and, even then, require boundary condition assumptions on quantities such as the pedestal pressure and impurity densities16. The extreme computational requirements of physics-based simulations stands in stark contrast to tokamak operations requirements. When unexpected physics or technical issues arise during tokamak operations, dynamics models, trajectories, and controllers need to be updated as soon as possible to ensure subsequent discharges are successful. As a consequence, scenario and control design is often performed with highly simplified dynamics models through empirical power-law scalings and linear systems obtained with techniques from classical system identification17. In fact, both the ITER and SPARC nominal operating scenarios were designed using empirical plasma operational contour (POPCON) scalings with scenario validation performed with higher fidelity simulations18,19,20.

This challenge has motivated the development of data-driven dynamics models of existing devices for the purposes of reinforcement learning, but standard sequence-to-sequence models in the deep learning toolbox, such as recurrent neural networks (RNNs) and transformers, are generally not sample efficient. This is a major challenge as burning plasma tokamaks must work reliably with as few discharges of training data as possible, given the high cost of disruptions; existing works often use more than 104 discharges of data21,22. Sample-efficient dynamics modeling methods are needed.

This paper aims to advance the state-of-the-art on three fronts: sample-efficient dynamics model learning, active ramp-down control, and optimizing feed-forward trajectories with robustness to dynamics uncertainty. To address the model learning problem, we train a simple hybrid dynamics model that contains both physics and machine learning components on a relatively small dataset of simulations of SPARC PRD ramp-downs with the control-oriented simulator RAPTOR23,24. This dynamics model is fully implemented in the machine learning framework JAX25, and is thus fully differentiable and GPU-capable, enabling massive parallelization across physics assumptions. We then wrap this dynamics model in an OpenAI Gym26 environment to create what we call “PopDownGym”, which is depicted in Fig. 1a. We use PopDownGym to address the matters of feed-forward trajectory optimization and feed-back control in a unified reinforcement learning (RL) approach. As a first step towards experimental demonstration of active ramp-down control, we train a policy on PopDownGym with Proximal Policy Optimization (PPO) and demonstrate its transfer to RAPTOR where it successfully ramps down a simulated plasma while avoiding user specified limits. We address the issue of validation by demonstrating how a constraint-conditioned policy can be used to design a library of trajectories to handle different conditions by performing policy rollout on simulators under different physics settings. Finally, we bridge the gap between reinforcement learning and more conventional trajectory optimization approaches by using the RL training environment to perform robust optimization of a feed-forward trajectory with an evolutionary algorithm.

Fig. 1: An overview of the methods developed in this paper.
figure 1

a Depicts the developed methodology for calibrating a dynamics model using the techniques over-viewed in the Methods subsection on “Neural Differential Equations and Differentiable Simulation''. The dynamics model calibrated in this work is detailed in the Methods subsection titled “Dynamics Model''. This dynamics model forms the core of the training environment “PopDownGym'', which also contains a reward function and random variables. b Depicts the training workflow for the constraint-conditioned policy using Proximal Policy Optimization (PPO). c Depicts the workflow for applying the policy to generate a library of trajectories, which is demonstrated in Fig. 2. d Depicts the demonstration of running the policy as a controller in closed-loop against RAPTOR, with results shown in Fig. 3. e Depicts how the training environment can be used for robust trajectory optimization, which is demonstrated in Fig. 4.

Results

We use the simulator RAPTOR configured to the SPARC primary reference discharge (PRD)18, an 8.7 megaampere (MA) H-mode scenario projected to reach an energy gain, Q, of approximately 11, as our testbed. We first generate a dataset of 481 ramp-down simulations with varying plasma current and auxiliary heating ramp-rates. 336 of these ramp-down simulations are then used to train the PopDownGym dynamics model while the rest are used for validation.

As shown in Table 1, the policy receives an observation from an eight dimensional observation space to select an action vector from a four dimensional action space. It has the goal of ramping down the plasma current to below two MA, while avoiding user-specified constraints on eight quantities which the device operator will want to limit. While information on the risk profile of SPARC as a function of plasma current is not available, this goal current was chosen as ITER disruptions below 3MA are expected to be benign27. As discussed further in the Methods section, we train the control policy to accept user-specified constraint limits as inputs. This allows the constraints to be adjusted at inference time without retraining.

Table 1 Observation and action spaces with bounds

This section highlights two uses cases for the trained policy: 1) as a trajectory design assistant to design feed-forward trajectories for a wide range of physics conditions, and 2) as a real-time controller for actively avoiding disruptive limits. As a step towards demonstrating 2) experimentally, we transfer the control policy to RAPTOR. Finally, we demonstrate that the same gym environment can be leveraged to perform robust optimization of trajectories (i.e. optimization of feedforward trajectories that perform well under different conditions).

Application 1: policies as trajectory design assistants

It is desirable for burning plasma devices such as SPARC and ITER to achieve their core mission with as few discharges and disruptions as possible. In this context, the application of a learned control policy, which generally takes a non-trivial amount of trial and error to transfer to reality, is not desirable without extensive validation and operational history on existing devices. However, we demonstrate that the learned control policy can still be used as a trajectory design assistant to design a library of feedforward trajectories to handle a variety of different scenarios. These feedforward trajectories can then be validated against higher fidelity simulations before being programmed into the machine. Figure 1c depicts this workflow.

Constraint-conditioned policies

Due to uncertainty about disruptive and machine limits, a key feature in a useful trajectory design assistant is the ability for the user to adjust the constraints at inference time. For example, there is uncertainty about what values of βp may be correlated with dangerous neoclassical tearing modes (NTMs) and what values of li or the Shafranov coefficient may be correlated with a vertical displacement event (VDE). After new information about such limits is acquired through tokamak operations, it is desirable to redesign trajectories given this new information as quickly as possible to enable operational continuity.

To address this challenge, we train a constraint-conditioned policy that allows the user to specify the constraint boundaries at inference time. Figure 2 shows a library of trajectories generated with two sets of constraint settings; one more relaxed, the other more stringent. The new trajectories are generated without any policy re-training, thus, they can be rapidly generated with a single policy rollout. In our demonstration, this is done within seconds. In Fig. 2, the policy mostly does a good job of trying to satisfy the constraints, with a small amount of violation in li at the end of the trajectory in the more stringent constraint case. We find that the policy struggles to satisfy the more stringent \(\frac{d{W}_{th}}{dt}\) requirement, even in experiments where we tried increasing the penalty of constraint violation. We hypothesize this is due to it being physically infeasible to avoid the large stored energy change caused by the confinement regimes transition in certain cases.

Fig. 2: Demonstration of application 1: policies as trajectory design assistant.
figure 2

Figures above show an example of libraries of action trajectories generated under flexible (a) and more stringent (b) constraints. Each trajectory corresponds to one possible sample from the distribution of initial conditions and physics parameters, specified in Table 2. The green shaded region in the Ip plots denotes the target plasma current of below 2 MA, while the red shaded regions denote the constraint values. The constraint values conditioned on for each scenario in relation to the minimum and maximum values used for training are visualized as red bars on the right of each plot.

Application 2: demonstration of ramp-down feedback control in RAPTOR

One use case for this learned policy is to serve as a controller that uses real-time observations of the plasma state to make decisions about how to ramp-down the plasma. As a step towards experimentally demonstrating such a capability, we demonstrate the successful transfer of the control policy to RAPTOR as a feedback controller. Figure 1 shows the setup for our sim2sim demonstrations. The control policy receives observations from RAPTOR at each time and decides on an action. To smooth out the effects of numerical spikes from the simulation, we employ an action buffer such that the average of the five latest actions is input into RAPTOR. Similar techniques are employed in other reinforcement learning applications where noise and outlier measurements need to be handled28. We do not directly account for experimental measurement noise and partial observability, but note that the policy can be deployed in conjunction with a previously developed real-time state observer to mitigate this issue29.

Figure 3 shows results of one such demonstration compared against a naive baseline feedforward trajectory, where many of the constraints are violated. The policy overall does a good job of avoiding the constraint boundaries, but there are imperfections. Like in the stringent constraint case in Fig. 2, we observe a small amount of li constraint violation at the end of the ramp-down. In the action space, this can be attributed to the policy deciding to extremely quickly ramp the plasma current once it is close to its goal, which is not necessarily undesirable. We also observe spikes in \(\frac{d}{dt}{W}_{th}\) and \(\frac{d}{dt}{B}_{v}\) at the H to L mode transition, which we attribute to the numerics of boundary condition switching at the H to L mode transition. After the spike, the values quickly settle to values within the specified constraints. Similar spikes in \(\frac{d}{dt}{B}_{v}\) also occur towards the end of the simulation.

Fig. 3: Demonstration of application 2: ramp-down feedback control.
figure 3

A comparison of RAPTOR simulation results from a naive baseline feed-forward trajectory (red) against the PPO trained policy running in closed loop (blue). The goal and constrain trajectories are shown in (a), while the action trajectories are shown in (b). The PPO policy clearly yields a considerable reduction in constraint violation for βp, li, Γ. While the nominal SPARC PRD ramp-down has a constant ramp-rate of 1MA/s, our baseline was selected to have the same average ramp-rate as the PPO policy to better highlight how more complex time traces can yield a reduction in constraint violation while keeping the same ramp-down time. Note that while the nominal action space is the rates of change of Paux and gs, the action trajectories plot (right) shows the time-integrated values for interpertability. Simulation of the stored energy and vertical field rate of change involve spikes at sawtooth and confinement regime transition events which are due to RAPTORs approach to handling boundary conditions at these events.

The policy exhibits some notable qualitative behaviors. For one, it begins with a large plasma current ramp-rate, but decreases the ramp-rate as the plasma state enters the regime where the H to L mode transition is expected. This is an intuitively correct behavior for two reasons: 1) the H to L mode transition can lead to a large \(\frac{d}{dt}{B}_{v}\), and one way to decrease its value is to decrease the ramp-rate, and 2) βp starts approaching the user-specified limit, which happens when the current decreases faster than the stored thermal energy.

Application 3: feed-forward optimization with robustness to physics uncertainty

A natural question that arises when using a trained control policy to generate feedforward trajectories is whether we might instead directly optimize feedforward trajectories, skipping the policy representation step entirely. This approach has a number of potential benefits, particularly given the risk involved in transferring an untested learned control policy to hardware. The main difficulty that arises here is that while a feedback policy is able to observe and adjust to disturbances and uncertainty in the dynamics, purely feedforward trajectories must be designed to be robust to those disturbances without feedback.

To address this challenge, we implement a robust feedforward trajectory optimization pipeline that uses a large number of parallel simulations, each with different physics assumptions (i.e. random choices for the parameters in Table 2). We optimize a single feedforward trajectory to minimize the average constraint violation across all of these simulations. The optimized trajectory and its performance across a range of random parameter values are shown in Fig. 4, showing that the optimized feedforward trajectory achieves similar hitting time and constraint satisfaction rates to the feedback policy learned using RL. The trajectory transferred to RAPTOR is also shown in Fig. 4, where we see it does a similarly good job of avoiding the constraint limits as the RL trained policy, with the exception of the li limit, which it violates towards the end of the trajectory, likely due to a sim2sim gap between the training and test environments. Given that the feed-forward trajectory experiences this issue, but the learned policy manages to mostly avoid this limit, this result highlights the advantages of real-time feedback for making course corrections.

Table 2 Randomized parameter ranges used for policy training
Fig. 4: Demonstration of application 3: robust feed-forward optimization.
figure 4

a Shows results from parallel PopDownGym instances running the same feed-forward trajectory found via robust optimization. b Shows results from RAPTOR simulation running the feed-forward trajectory. c Shows the feed-forward trajectory used in both cases.

Sensitivity analysis

Our reinforcement learning system also enables sensitivity analysis with respect to constraint limits and physics uncertainty. The former is enabled by the constraint-conditioned policy which allows the exploration of the optimal ramp-down for different constraint settings, and the latter is enabled by the gym environments’ massive parallelization. Such information is critical for both device design and operations as it helps designers, operators, and physics modelers make decisions on how to allocate scarce resources. For example, our results in Fig. 5 show that the optimal time to ramp-down is highly sensitive to the li constraint, thus indicating that improvements to the vertical stability (VS) system is critical for enabling fast, safe ramp-downs. On the physics uncertainty front, Fig. 5 indicates, for example, strong sensitivity to the H-factor, but relatively low sensitivity to the ratio of electron temperature to ion temperature. While such results require validation, they provide information on where to allocate the efforts of extremely computationally expensive simulations and device experiments.

Fig. 5: Sensitivity of time to goal.
figure 5

This plot shows the sensitivity of the time to goal with respect to constraint limit settings (a) and physics uncertainty (b), with the 95% confidence intervals corresponding to the physics uncertainty. In (b), the given physics parameter is fixed while the other physics uncertainty parameters are sampled randomly.

Robustness analysis

Similar to the sensitivity analysis carried out for the constraint-conditioned policy learned via PPO, we can examine the robustness of the static feedforward trajectory to varying physical parameters. Figure 6 shows a sensitivity analysis indicating which parameters have the strongest effect on the constraint satisfaction of the feedforward trajectory; we see that H-factor has the strongest effect, with low H-factor making constraint violation more likely.

Fig. 6: Sensitivity of constraint satisfaction.
figure 6

This plot shows the effect of varying physical parameters on constraint satisfaction for the feedforward trajectory. Raw data points are shown in gray, and a quadratic best fit and 95% confidence interval are shown in orange.

Discussion

This paper develops a reinforcement learning approach to designing control policies and robust trajectories for the ramp-down phase. We place a particular emphasis on meeting several key needs of burning plasma tokamaks such as SPARC and ITER. Namely, we develop a methodology that: 1) requires relatively few discharges of training data, 2) is adaptable between discharges, and 3) enables control validation by generating a library of trajectories offline.

To train the policy, we propose a new sample-efficient approach to build hybrid physics and machine learning models using the frameworks of neural differential equations and differentiable simulation. The sample efficiency gained from the physics structure of the model enabled a successful demonstration of sim2sim policy transfer while training the dynamics model on only  ≈300 simulations, in contrast to prior works which used order 104 simulations. The model is fully implemented in JAX, enabling massive GPU parallelism across physics uncertainty.

We also address the issue of control validation for near-term burning plasma tokamaks such as SPARC and ITER and demonstrate two ways in which fusion experiments may leverage the benefits of reinforcement learning in a way more compatible with the sensitive context. First, we further leverage GPU parallelism during policy training to parallelize across constraint settings to train a constraint-conditioned policy. This constraint-conditioned policy enables the user to adjust constraint settings at inference time and generate new trajectories in seconds with a single policy rollout. We propose that the constraint-conditioned policy may be a useful trajectory design assistant for tokamak operators, as unexpected developments during tokamak operations often demands the adjustment of settings. Since the constraint-conditioned policy is being used to design trajectories offline, its results can be checked and validated by human operators and against validation simulations. It may also be integrated with other, more classical, techniques such as model predictive control (MPC)30. Second, we demonstrate that the gym environment used in conjuction with other methods for feed-forward optimization. As a demonstration, we use an evolutionary algorithm to perform robust optimization of a feed-forward trajectory on parallel simulators with different physics settings. The resulting feed-forward trajectory is transferred to RAPTOR and avoids most constraint limits, but does violate the li constraint towards the end of the trajectory by a significant amount, an issue which the policy run in closed-loop with RAPTOR largely avoids. This result highlights one benefit of having a policy make corrections in real-time, a benefit which can be evaluated against the risks of running a learned policy on an expensive machine.

The biggest limitation of this work is the use of a relatively low fidelity plasma simulator as a testbed, the choice of which was motivated by the rapid iteration cycles enabled by RAPTOR, which proved critical for advancing and iterating the large number of new techniques presented in this work. Ongoing work is progressing to replace the “RAPTOR” box used in this paper with an existing tokamak. In that setting, data from the tokamak will be used to train a dynamics model on which the control policy is then trained. Future work should also apply this approach to the highest fidelity simulations possible of upcoming devices such as SPARC and ITER. It is the hope of the authors that such a research programme will enable the practical application of the powerful tools of machine learning to make progress on the challenging and critical problem of disruptions.

Methods

The first three subsections aim to provide the reader with an overview of the techniques this paper draws from, with subsequent subsections providing further technical detail. The first subsection aims to define the framework and capabilities of RL, and show how it can be applied to the problem of trajectory optimization, which is typically approached from the perspective of optimal control. We also provide a brief explanation of how we train a constraint-conditioned policy by drawing upon techniques from the goal-conditioned RL literature31. The key necessary ingredient for the RL workflow is a time-dependent dynamics model, so in the second subsection we discuss the challenges involved in plasma dynamics modeling. In the third subsection, we propose the framework of neural differential equations and differentiable simulation to address the challenges of time-dependent plasma dynamics modeling by synthesizing physics-based principles with data and provide a brief introduction to that paradigm. Following these higher level subsections, the fourth subsection specifies the reward function used in this work, the fifth subsection details the dynamics model used as the training environment, the sixth subsection details the RL algorithms used, and the final subsection details RAPTOR simulation settings.

Reinforcement learning and optimal control

Reinforcement learning is typically concerned with solving Partially Observable Markov Decision Processes (POMDPs). A key component of a POMDP is the dynamics model, f, which may be stochastic. This model maps a system state vector xt (e.g., stored energy and density) and an action vector at (e.g., auxiliary heating and fueling) to the system state vector at the next time step:

$${{{\bf{x}}}}_{t+1}=f({{{\bf{x}}}}_{t},{{{\bf{a}}}}_{t})$$
(1)

To account for the fact that certain quantities are derived from the system state (e.g. the average pressure can be derived by energy and volume), and to account for the real world challenges of imperfect sensing, POMDPs also involve an observation function, O, that generates an observation ot from the state and action: ot = O(xtat). Finally, a POMDP involves a reward function r(xtat) that should be maximized, and which is usually designed by a human to specify the desired outcome. For example, in the context of video games, the reward function is designed to express wins and losses. In the context of the ramp-down problem addressed by this paper, we design the reward function to reward de-energizing the plasma while avoiding disruptive and machine limits, the details of which are provided in the Methods.

There are two typical POMDP solution types. The first is a control policy π, which determines the optimal action to take given an observation: at = π(ot). In the modern deep reinforcement learning paradigm, the control policy is typically implemented as a deep neural network. Given that it can decide the action given an observation, a control policy can serve as a real-time feed-back controller.

The other solution type is an action trajectory, which is a sequence of actions across a T step time horizon : [a0a1, …, aT−1]. This action trajectory can then be programmed into a machine. The computational problem of finding optimal trajectories is often referred to as trajectory optimization and has largely been advanced by the field of optimal control. In fact, the POMDP formulation is directly analogous to standard formulations of stochastic optimal control32.

While control policies trained through RL are often thought of as products for real-time control, a process known as policy rollout can also be used to generate trajectories by running the control policy against the dynamics model f and observation function O. This process is shown in Fig. 7. We would like to note the other key distinction between the RL and optimal control frameworks, which is that the latter explicitly supports constraints. However, as shown in the Methods section, constraints can be implemented in reward functions via penalty terms. In fact, many nonlinear optimization methods used in optimal control implement constraints as cost terms under the hood with methods such as augmented lagrangians and barrier functions33,34.

Fig. 7: Graphic depicting policy rollout.
figure 7

Here, the process of a control policy π generating trajectories of action at, state xt, and observation ot is visualized.

In this work, we also leverage techniques from goal-conditioned reinforcement learning31, which enables the training of policies that can be instructed at inference time to switch its task. We leverage this capability to train a policy that knows how to handle conditions with different constraint limits; we call the resulting policy a constraint-conditioned policy. This can be accomplished by simply conditioning the reward function on the constraint limits \({{\mathscr{L}}}\), adding \({{\mathscr{L}}}\) as an additional input to the policy, and parallelizing training environments across different samples of \({{\mathscr{L}}}\). This methodology is depicted in Fig. 1b. In this setting, the reward function and policy are modified to be:

$$r({{{\bf{x}}}}_{t},{{{\bf{a}}}}_{t};{{\mathscr{L}}})\quad {{{\bf{a}}}}_{t}=\pi ({{{\bf{o}}}}_{t};{{\mathscr{L}}})$$
(2)

Challenges of plasma dynamics modeling

In the context of plasma control, defining the dynamics function f and observation function O are considerable challenges. While accurate full-discharge simulation of plasmas from first principles is itself an open research problem, application of RL techniques generally requires stable, massively parallelized simulation with randomized physics parmeters and thus imposes even more stringent software requirements. These challenging requirements have motivated work on developing neural network dynamics models trained on machine data21,22. However, the relative paucity of available fusion data makes sample-efficiency and robustness of these models a considerable challenge.

At the same time, reactor and control design are often performed using relatively simple 0-D models which are used to find design points that are then validated using higher fidelity transport simulations. In the context of reactor design, such models are often referred to as Plasma Operating Contour (POPCON) models35 and include well-established physics, such as reactivity curves, with empirical scalings, such as τE scalings. Despite their simplicity, such models have produced predictions of key performance metrics such as energy gain and τE that are in line with nonlinear gyrokinetic simulations. In the context of control design, classical system identification techniques are often employed to fit linear dynamical systems to empirical data; controllers are then designed on these linear models36. Despite the complexity of plasma dynamics, such techniques have proven effective for control design in certain cases24,37. In recent years, advances in control-oriented simulation research has gradually improved the fidelity of models that can be used for control design, including efforts such as RAPTOR24, COTSIM38, and METIS39. However, these simulators typically still require assumptions on quantities such as the edge boundary condition, and are not yet deployed for routine use in tokamak control.

Neural differential equations and differentiable simulation

While classical system identification techniques have yielded successes24,37, further advances are needed. Fortunately, recent advances made by the scientific machine learning community in differentiable simulation and neural differential equations offers a powerful new toolbox for complex nonlinear system identification40,41,42. This work applies such advances to train POPCON-like dynamics models that combine physics structure with machine learning to predict dynamics from relatively few samples. In contrast to standard POPCON models, the model is time-dependent and its free parameters can be updated to new empirical data, and, in contrast to classical system identification techniques, the model can be highly nonlinear.

Consider the following nonlinear dynamical system where fθ is a neural network with free parameters θ:

$$\frac{d{{\bf{x}}}}{dt}={f}_{\theta }({{\bf{x}}})$$
(3)

Such an equation is known as a neural differential equation (NDE). Both training and inference of NDEs involves numerically integrating the neural network from its initial condition x(0) across some time horizon to some terminal value. While standard feed-forward neural networks with parameters θ, NNθ, have the following input-output relationship:

$${{\bf{y}}}={{{\rm{NN}}}}_{\theta }({{\bf{x}}})$$
(4)

NDEs have the following input-output relationship:

$${{\bf{y}}}={{\rm{diffeqsolve}}}\, ({f}_{\theta },{{{\bf{x}}}}_{0})$$
(5)

where diffeqsolve is a numerical integration scheme such as explicit or implicit Runge-Kutta methods. Then, given data \(\hat{{{\bf{y}}}}\), and a loss function defining error between simulation and data \(L({{\bf{y}}},\hat{{{\bf{y}}}})\), adjoint methods40,41 can be used to efficiently compute the gradient of this loss function with respect to the model parameters \({\nabla }_{\theta }L({{\bf{y}}},\hat{{{\bf{y}}}})\). While we have thus centered this discussion around the case where fθ is a neural network, the same methodology applies to all functions fθ that are differentiable with respect to θ. Historically, the differentiability requirements have limited the complexity of fθ. However, modern automatic differentiation frameworks allow for highly complex choices of fθ that can also include physics. This more general setting is often referred to as differentiable simulation and enables the development of fθ models that contain both physics and neural network components. The particular fθ model employed in this work is detailed in the “Dynamics Model” subsection in the Methods, and the application of this methodology for tuning the dynamics model is depicted in Fig. 1a.

Reward function

The constraint-conditioned reward function is defined as:

$$r({{\mathbf{x}}}_t, {{\mathbf{a}}}_t; {{\mathscr{L}}}) = \underbrace{{\mathbb{1}}_{I_{p,t}\leq 2{{\mbox{MA}}}}({{\mathbf{x}}}_t)}_{{{{\mathrm{Reward}}}\ {{\mathrm{for}}}\ {{\mathrm{reaching}}}\ {{\mathrm{goal}}}}} + \underbrace{I({{\mathbf{x}}}_t)}_{{{{\mathrm{Guide}}}\ {{\mathrm{term}}}\ {{\mathrm{for}}}\ {{\mathrm{current}}}}} + \underbrace{\sum\limits_{i=1}^{n_{cons}} b_i({{\mathbf{x}}}_t; {{\mathscr{L}}}_i)}_{{{{\mathrm{Penalty}}}\ {{\mathrm{terms}}}\ {{\mathrm{for}}}\ {{\mathrm{constraints}}}}}$$
(6)

where the first term provides a fixed reward upon reaching the goal of two MA of current, the second term I(xt) rewards decreases in current to help guide the policy towards the goal and is given by:

$$I({{{\bf{x}}}}_{t})=-1+\frac{4}{\exp (0.5{I}_{p,t})+2.0+\exp (-0.5{I}_{p,t})}$$
(7)

This reward term was inspired by one used for tracking error in quadruped locomotion28 with coefficients chosen through manual tuning. Finally, the summation term consists of penalty terms for violating user-specified constraints on undesirable limits such as disruptive limits. In this work, all constraints in state are implemented in the following form:

$${g}_{i}({{{\bf{x}}}}_{t})\le {{{\mathscr{L}}}}_{i}$$
(8)

where gi(xt) is the ith quantity to be constrained, expressed as a function of state, and \({{{\mathscr{L}}}}_{i}\) is the user-specified limit for the constraint. For example, the Greenwald Limit would be implemented with the following g and \({{{\mathscr{L}}}}_{i}\):

$${g}_{i}({{{\bf{x}}}}_{t})={\bar{n}}_{e}\frac{\pi {a}^{2}}{{I}_{p}}\quad {{{\mathscr{L}}}}_{i}=1$$
(9)

There are a number of different approaches for implementing constraints in POMDPs43,44. In our formulation, each constraint is replaced with the following “reward barrier function”, which we find to solve our problem well in practice:

$${b}_{i}({{{\bf{x}}}}_{t};{{{\mathscr{L}}}}_{i})=\,{{\mbox{log}}}\,\left(1-\sigma \left(\,{{\mbox{clip}}}\,\left(\frac{{g}_{i}({{{\bf{x}}}}_{t})}{{{{\mathscr{L}}}}_{i}}\right)\right)\right)$$
(10)

where σ( ) is a sigmoid and clip( ) is a clip function. This reward term is largely inspired by the logarithmic barrier functions from classical interior point methods33 with the sigmoid and clip functions ensuring a lower bound on reward to prevent numerical divergence. The net effect of this function is to take on a value close to zero when gi(xt) is away from \({{{\mathscr{L}}}}_{i}\), but then rapidly fall off to a negative value when gi(xt) approaches \({{{\mathscr{L}}}}_{i}\). Supplementary Fig. S2 in the Supplementary Information visualizes this function.

Dynamics model

Randomized parameters

One of the goals of this dynamics model was to explore the efficacy of training on highly simplified models, but with massively parallelized, randomized physics simulation. The randomized parameters and their domains of randomization are shown in Table 2. In many cases, accurate simulations of these parameters would require supercomputers. Instead, we take the approach of designing policies and trajectories with robustness to variations in these parameters.

Geometry evolution parameter

To enable the full training environment to run on GPU in the absence of a GPU-capable Grad-Shafranov solver, we use a pre-computed set of ideal MHD equilibrium shapes and restrict the policy to only decide on how quickly to evolve through said shapes. To do so, we define a “progress variable” gs [0, 1] where 0 corresponds to the first shape and 1 corresponds to the last. The policy then decides on the rate of change of gs at each time step. Key quantities used in the model are interpolated between the shapes and thus become functions of gs. The quantities used are: V(gs), the plasma volume, \(\frac{\partial V}{\partial \rho }(\rho ,{g}_{s})\), the derivative of volume with respect to the toroidal flux coordinate, κa(gs), the area elongation, a(gs), the minor radius, δ(gs), the triangularity. From here on out, the gs argument is suppressed for notational simplicity.

Power and particle balance

We make use of simple power and particle balance equations17:

$$\frac{d{W}_{th}}{dt}=\frac{-{W}_{th}}{{\tau }_{E}}+{P}_{Ohm}+{P}_{\alpha }-{k}_{rad}{P}_{brems}+{P}_{aux}$$
(11)
$$\frac{d{N}_{i}}{dt}=\frac{-{N}_{i}}{{k}_{N}{\tau }_{E}}+fuel$$
(12)

The Ohmic power, POhm, alpha heating power, Pα, and Bremsstrahlung radiation power Pbrems terms require profile shapes to be accurately calculated. We take the approach of mapping volume averaged quantities to profiles shapes learned via principal components analysis (PCA) on the RAPTOR dataset. First, volume averaged electron and ion temperatures and densities are obtained from state variables as such:

$$\frac{2{W}_{th}}{3V}=\langle p\rangle \approx \langle {T}_{e}\rangle \langle {n}_{e}\rangle +\langle {T}_{i}\rangle \langle {n}_{i}\rangle =\langle {T}_{e}\rangle \frac{\langle {n}_{i}\rangle }{{k}_{dilution}}+\frac{\langle {T}_{e}\rangle }{{k}_{te\_ti}}\langle {n}_{i}\rangle $$
(13)
$$\langle {n}_{i}\rangle = \frac{{N}_{i}}{V},\quad \langle {T}_{e}\rangle =\frac{2{W}_{th}}{3V}{\left(\frac{\langle {n}_{i}\rangle }{{k}_{dilution}}+\frac{\langle {n}_{i}\rangle }{{k}_{te\_ti}}\right)}^{-1},\quad \langle {n}_{e}\rangle =\frac{\langle {n}_{i}\rangle }{{k}_{dilution}},\\ \langle {T}_{i}\rangle =\frac{\langle {T}_{e}\rangle }{{k}_{te\_ti}}$$
(14)

Observe that the only unknown in the set of equations above is the volume-averaged electron temperature 〈Te〉. We represent each of the four profiles with two single-component PCA basis: one for H-mode and one for L-mode. Below, the Te profle is used as an example, but the same methodology applies for the other profiles:

$${T}_{e}(\rho )=cv(\rho )+b(\rho )$$
(15)

where v(ρ) and b(ρ) are learned via PCA on the RAPTOR simulation dataset. The volume integral and its approximation with Legendre-Gauss quadrature can then be computed as:

$$\langle {T}_{e}\rangle =\frac{1}{V}\int_{0}^{1}\left[cv(\rho )+b(\rho )\right]\frac{\partial V}{\partial \rho }(\rho )d\rho \\ \approx \frac{1}{{\sum }_{j = 1}^{{n}_{gauss}}{w}_{j}\frac{\partial V}{\partial \rho }({\rho }_{j})}\left[c\mathop{\sum }_{j=1}^{{n}_{gauss}}{w}_{j}v({\rho }_{j})\frac{\partial V}{\partial \rho }({\rho }_{j})+{\sum }_{j=1}^{{n}_{gauss}}{w}_{j}b({\rho }_{j})\frac{\partial V}{\partial \rho }({\rho }_{j})\right]$$
(16)

where ngauss is the number of Legendre–Gauss points, wj are the weights, and ρj are the Legendre–Gauss points. Rearranging the above equation for c we get:

$$c\approx \frac{\langle {T}_{e}\rangle {\sum }_{j = 1}^{{n}_{gauss}}{w}_{j}\frac{\partial V}{\partial \rho }({\rho }_{j})-\mathop{\sum }_{j = 1}^{{n}_{gauss}}{w}_{j}b({\rho }_{j})\frac{\partial V}{\partial \rho }({\rho }_{j})}{\mathop{\sum }_{j = 1}^{{n}_{gauss}}{w}_{j}v({\rho }_{j})\frac{\partial V}{\partial \rho }({\rho }_{j})}$$
(17)

Now everything on the right hand-side is a function of state or a pre-computed quantity. Since v(ρ) and b(ρ) are constant when running the simulation, c fully parameterizes the Te(ρ) profile.

H-mode and L-mode

We use the IPB98 scaling for H-mode45 and IPB8946 for L-mode. The H to L mode transition threshold is subject to considerable uncertainty, making our approach of randomizing across physics uncertainty particularly relevant. In this model, we treat the HL back-transition threshold as a constant multiple of the established the LH threshold from47, with the constant randomized across simulations:

$${P}_{HL}={k}_{HL}2.15{e}^{0.107}{n}_{e,20}^{0.782}{B}_{T}^{0.772}{a}^{0.975}{R}^{0.999}$$
(18)

Internal inductance evolution

Romero et al.48 derived a simple three-state ODE system for the coupled internal inductance and plasma current evolution. We make use of their equations for the internal inductance and plasma current evolution, which are exact, and replace their voltage dynamics, which are ad-hoc, with a neural network, resulting in the following ODE system:

$$\frac{d{L}_{i}}{dt}=\frac{2}{{I}_{p}}(-{V}_{ind}-{V}_{R})$$
(19)
$$\frac{d{I}_{p}}{dt}=\frac{1}{{L}_{i}}(2{V}_{ind}+{V}_{R})$$
(20)
$$\frac{d{V}_{R}}{dt}=NN({I}_{p},{L}_{i},{V}_{R},{V}_{ind},\langle {T}_{e}\rangle )$$
(21)

Algorithms

Optimizing constraint-conditioned policies with proximal policy optimization

We apply PPO49, a state of the art on-policy reinforcement learning algorithm to solve for the constraint-conditioned policy that maximizes the reward function (6). A fully connected network with three hidden layers of 256 units each is used for both the policy and value networks with hyperbolic tangent activation functions. The environment observations are concatenated with the desired constraint values to form the input vector for the policy network. The value network additionally takes the current simulation parameters as input to improve performance50,51. All input observations and output actions are normalized to the range [−1.0, +1.0].

We exploit the capabilities of JAX and collect transitions from 2048 environments in parallel. During each training step, it took 0.652 (±0.024) seconds to collect experience and 0.048 (±0.002) seconds to update the policy and value networks. Training converged in under two hours on a single Nvidia RTX 4090 GPU. Inference of just the control policy took an average of 38 μs, and a full roll-out of the policy along with the training environment took 15.4 ms. All compute times were determined after Just-In-Time (JIT) compilation.

Robust feedforward trajectory optimization with evolutionary strategies

We apply a highly parallelizable black-box optimization algorithm known as evolutionary strategies (ES) to optimize robust feedforward trajectories52. In contrast to policy gradient methods like PPO, which consider the expected discounted reward, ES requires an end-to-end objective function that assigns a reward to a given fixed-horizon trajectory; we maximize:

$${U}_{dense}({{{\bf{a}}}}_{0:T})={{\mathbb{E}}}_{\theta }\left[\frac{1}{T}\mathop{\sum }_{t=0}^{T}r({{{\bf{x}}}}_{t},{{\bf{a}}})\right]$$
(22)

where the expectation is with respect to the random parameters θ sampled uniformly over the ranges given in Table 2.

The feedforward trajectory is represented as a cubic interpolation between 10 equally spaced control points for each action. Trajectories are rolled out for 100 timesteps with 0.05 s timestep. We use a population size of 256, approximate the expectation in Eq. (22) using 103 random samples, and run the evolutionary strategy for 250 generations. The learning rate is initialized at 10−1 and decays with rate 0.995 to a minimum of 10−3. We use the ES defined by Salimans et al.52 as implemented in JAX in the evosax library53.

RAPTOR simulation settings

RAPTOR is configured to evolve the TeTi, and current profiles by solving the corresponding transport equations. Following prior work, the particle transport equations are not solved11, but rather a separate instance of the simple particle balance model (12) is implemented RAPTOR side and receives the fueling commands. This separate model is configured with the maximum kN value of 9.0 used in the training model to challenge the policy as gas injection can be used to directly slow down the density decrease, but actuators that can directly speed up the density decrease do not exist. This approach is motivated by the challenges of modeling plasma density dynamics17; the goal instead is to provide reasonable density set-points for real-time density controllers. The H-L back-transition is triggered when the RAPTOR simulated power conducted to scrape-off layer falls below the HL back-transition threshold.