Comparison of "online" and "offline" agent-enviroment interactions

Hi all,

I have had a problem for months with converging Tensorforce on my case and I wonder what causes this issue. I'll explain.

My environment is a C++  fluid flow simulation code that the Tensorforce interacts with. I have implemented this interaction in two ways, namely, "offline" and "online". In offline mode, the C++ code works just as a black box. The agent reads the state, sends the action to the code, and receives some reward. Therefore, in each DRL time step, the C++ code must stop and restart which is not ideal for practical expensive simulations.

Conversely, in online mode, the agent is exported as SavedModel and loaded within the C++ code. This way the environment can simulate a whole episode, without any interruption which is much more computationally efficient. I have verified that the loaded agent in the C++ code works perfectly and produces exact actions as in the offline mode.

In offline mode, I use the Runner utility, and all the DRL computations are handled by the Runner and I actually get a pretty good convergence and after 500 episodes the agent reaches an optimal state. However, I believe there is no way to use the runner in online mode (is there?). So, for the online mode, I use the experience-update interface. The recorded episode is fed to the agent as an experience and the agent is updated afterward. Now the problem is that in the online mode, not only the convergence is slower, but the agent will never reach the same optimum state and acts always sub-optimally (even with thousands of episodes).

Figure below shows the typical convergence graph that I get with the two modes. I have actually run the comparison tens of times with many different settings and configs, but the online mode convergence is always slower and never reaches the actual optimum state. I have even tried to replicate the runner configuration in the experience-update mode (in terms of episode batches to keep for the experiences and update frequency) but I had no luck.

![online_offline_comparison](https://github.com/tensorforce/tensorforce/assets/103576002/e37fe238-ddd2-4fea-98df-6b558b63cbab)

Has anyone had any experience with this? Is there some fundamental deficiency in the experience-update mode compared to the runner utility? I know that it is stated in the documentation of the act-experience-update interaction that 
> a few stateful network layers will not be updated correctly in independent-mode (currently, exponential_normalization)

but I am not sure if that could be the cause, as I do not use state preprocessing. 

Any help is greatly appreciated as I have been struggling with this for a long time.
Saeed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Comparison of "online" and "offline" agent-enviroment interactions #903

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Comparison of "online" and "offline" agent-enviroment interactions #903

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions