Clearer separation between the trainer and the algorithm and refactoring of policy classes

- [ ] I have marked all applicable categories:
    + [ ] exception-raising bug
    + [ ] RL algorithm bug
    + [ ] documentation request (i.e. "X is missing from the documentation.")
    + [ ] new feature request
    + [x] design request (i.e. "X should be changed to Y.")
- [x] I have visited the [source website](https://github.com/thu-ml/tianshou/)
- [x] I have searched through the [issue tracker](https://github.com/thu-ml/tianshou/issues) for duplicates
- [ ] I have mentioned version numbers, operating system and environment, where applicable:
  ```python
  import tianshou, gymnasium as gym, torch, numpy, sys
  print(tianshou.__version__, gym.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)
  ```

While Tianshou's idea of encapsulating everything learning related into a single method of a base policy class is a nice idea to keep algorithms more or less self contained, it has lead to a cluttered trainer and the need for kwargs during the update call to remain functional (see #949). Additionally, while tianshou uses RL jargon such as policies, they’re not really doing what, in RL terms, policies are doing, i.e., mapping from states to actions. Rather, tianshou interweaves the RL policy with parts of RL algorithms. Also, the trainer implements some parts which might be algorithm specific, like sampling transitions, and then passing them to tianshou-policies for updates.

In order to create a cleaner framework, I think the following changes would improve things:

- Separate trainer from algorithm logic
As already mentioned, the trainer currently holds the logic for training loops, logging, and additionally holds some algorithm logic. Half of the if statements and asserts in the ```__next__``` function are necessary because the trainer doesn’t really know what type of RL it’s doing. Instead, I think it makes much more sense to move algorithm logic to new algorithm classes and let the trainer solely be responsible for running the algorithm, logging, and checkpointing. There would also be only one trainer, instead of the OnPolicyTrainer, etc. Eventually, the trainer may even serve as the factory for the high level API.

- Separate algorithms from policies
While this adds an additional layer of abstraction/complexity, things will eventually become cleaner. Right now, loads of keywords and arguments can be given to the trainer and they might not even be compatible. A BaseAlgorithm class would allow for more explicit arguments given to a certain algorithm and additionally allow for unconventional algorithms such as ARS. This would also allow for the algorithm to be independent of code regarding the type of policy (continuous/discrete) The policy class then is merely the function that maps from observations to actions.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clearer separation between the trainer and the algorithm and refactoring of policy classes #1034

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clearer separation between the trainer and the algorithm and refactoring of policy classes #1034

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions