Support episodes of different lengths and running times

- [ ] I have marked all applicable categories:
    + [ ] exception-raising bug
    + [ ] RL algorithm bug
    + [ ] documentation request (i.e. "X is missing from the documentation.")
    + [x] new feature request
- [x] I have visited the [source website](https://github.com/thu-ml/tianshou/)
- [x] I have searched through the [issue tracker](https://github.com/thu-ml/tianshou/issues) for duplicates
- [x] I have mentioned version numbers, operating system and environment, where applicable:
  ```python
  import tianshou, gymnasium as gym, torch, numpy, sys
  print(tianshou.__version__, gym.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)
  
  >> 0.5.0 0.28.1 1.12.1 1.24.3 3.9.16 (main, Mar  8 2023, 14:00:05) [GCC 11.2.0] linux
  ```

First I'd like to thank you for your significant contribution to such an easy-to-use, efficient, and powerful RL library, which helped my own research quite a lot.

Recently I tried to solve a new problem with PPO, and the custom environment built for this problem has the following properties:
1. The reward for each action is computed after the whole episode ends.
2. The length of an episode can vary and depends on the actions taken by the RL algorithm/agent, which is bounded above by a given threshold.

Then, I found that the current collector is not able to support such features directly. So I overridden the Collector module and adapted my environment as follows:
1. Cache the trajectories into a temporary buffer instead of directly adding them to the buffer upon obtaining them.
2. Store the reward planned to be assigned to each action in the info of the batch in an episode's last step.
3. Modify the reward assigned to each action according to the info stored in the batch at the last step of an episode.
4. Once each environment has finished a new episode, add all trajectories in which rewards of all actions have been modified.
5. Besides, I banned removing "surplus environments" and adapted the way of computing ep_rew and ep_len accordingly.

Now the training process is going well, but I found the training speed is lower than that when the episodes are of equal length. I guess this is what happens:
1. It takes a considerable amount of time to compute the reward to be assigned to each action (e.g., I need to run the Dijkstra algorithm to find the shortest path for all pairs of the 200 nodes on the graph first) at the last step of an episode.
2. While the other steps in an episode can be conducted very quickly, basically I need only to record the action and update some state variables.
3. By 1, 2, and the fact that not all environments end the current episode at the same step, some environments that finish their non-ended steps promptly have to wait till the end of the slow reward computation process occurred in other environments, whose number is empirically 1 based on my observation. As a result, the multi-process ability of the CPU was not fully exploited even though I used SubprocVectorEnv and ShmemVectorEnv.

Do you have any idea of boosting the training speed of my program? Or it would be much better if a more efficient collector that supports environments with episodes of various lengths is on your schedule.

Thanks again for your awesome contributions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support episodes of different lengths and running times #946

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support episodes of different lengths and running times #946

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions