+
Skip to content

Comparison of "online" and "offline" agent-enviroment interactions #903

Open
@salehisaeed

Description

@salehisaeed

Hi all,

I have had a problem for months with converging Tensorforce on my case and I wonder what causes this issue. I'll explain.

My environment is a C++ fluid flow simulation code that the Tensorforce interacts with. I have implemented this interaction in two ways, namely, "offline" and "online". In offline mode, the C++ code works just as a black box. The agent reads the state, sends the action to the code, and receives some reward. Therefore, in each DRL time step, the C++ code must stop and restart which is not ideal for practical expensive simulations.

Conversely, in online mode, the agent is exported as SavedModel and loaded within the C++ code. This way the environment can simulate a whole episode, without any interruption which is much more computationally efficient. I have verified that the loaded agent in the C++ code works perfectly and produces exact actions as in the offline mode.

In offline mode, I use the Runner utility, and all the DRL computations are handled by the Runner and I actually get a pretty good convergence and after 500 episodes the agent reaches an optimal state. However, I believe there is no way to use the runner in online mode (is there?). So, for the online mode, I use the experience-update interface. The recorded episode is fed to the agent as an experience and the agent is updated afterward. Now the problem is that in the online mode, not only the convergence is slower, but the agent will never reach the same optimum state and acts always sub-optimally (even with thousands of episodes).

Figure below shows the typical convergence graph that I get with the two modes. I have actually run the comparison tens of times with many different settings and configs, but the online mode convergence is always slower and never reaches the actual optimum state. I have even tried to replicate the runner configuration in the experience-update mode (in terms of episode batches to keep for the experiences and update frequency) but I had no luck.

online_offline_comparison

Has anyone had any experience with this? Is there some fundamental deficiency in the experience-update mode compared to the runner utility? I know that it is stated in the documentation of the act-experience-update interaction that

a few stateful network layers will not be updated correctly in independent-mode (currently, exponential_normalization)

but I am not sure if that could be the cause, as I do not use state preprocessing.

Any help is greatly appreciated as I have been struggling with this for a long time.
Saeed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载