-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
- I have marked all applicable categories:
- exception-raising bug
- RL algorithm bug
- documentation request (i.e. "X is missing from the documentation.")
- new feature request
- I have visited the source website
- I have searched through the issue tracker for duplicates
- I have mentioned version numbers, operating system and environment, where applicable:
import tianshou, gymnasium as gym, torch, numpy, sys print(tianshou.__version__, gym.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform) >> 0.5.0 0.28.1 1.12.1 1.24.3 3.9.16 (main, Mar 8 2023, 14:00:05) [GCC 11.2.0] linux
First I'd like to thank you for your significant contribution to such an easy-to-use, efficient, and powerful RL library, which helped my own research quite a lot.
Recently I tried to solve a new problem with PPO, and the custom environment built for this problem has the following properties:
- The reward for each action is computed after the whole episode ends.
- The length of an episode can vary and depends on the actions taken by the RL algorithm/agent, which is bounded above by a given threshold.
Then, I found that the current collector is not able to support such features directly. So I overridden the Collector module and adapted my environment as follows:
- Cache the trajectories into a temporary buffer instead of directly adding them to the buffer upon obtaining them.
- Store the reward planned to be assigned to each action in the info of the batch in an episode's last step.
- Modify the reward assigned to each action according to the info stored in the batch at the last step of an episode.
- Once each environment has finished a new episode, add all trajectories in which rewards of all actions have been modified.
- Besides, I banned removing "surplus environments" and adapted the way of computing ep_rew and ep_len accordingly.
Now the training process is going well, but I found the training speed is lower than that when the episodes are of equal length. I guess this is what happens:
- It takes a considerable amount of time to compute the reward to be assigned to each action (e.g., I need to run the Dijkstra algorithm to find the shortest path for all pairs of the 200 nodes on the graph first) at the last step of an episode.
- While the other steps in an episode can be conducted very quickly, basically I need only to record the action and update some state variables.
- By 1, 2, and the fact that not all environments end the current episode at the same step, some environments that finish their non-ended steps promptly have to wait till the end of the slow reward computation process occurred in other environments, whose number is empirically 1 based on my observation. As a result, the multi-process ability of the CPU was not fully exploited even though I used SubprocVectorEnv and ShmemVectorEnv.
Do you have any idea of boosting the training speed of my program? Or it would be much better if a more efficient collector that supports environments with episodes of various lengths is on your schedule.
Thanks again for your awesome contributions.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status