这是indexloc提供的服务,不要输入任何密码
Skip to content

Clearer separation between the trainer and the algorithm and refactoring of policy classes #1034

@maxhuettenrauch

Description

@maxhuettenrauch
  • I have marked all applicable categories:
    • exception-raising bug
    • RL algorithm bug
    • documentation request (i.e. "X is missing from the documentation.")
    • new feature request
    • design request (i.e. "X should be changed to Y.")
  • I have visited the source website
  • I have searched through the issue tracker for duplicates
  • I have mentioned version numbers, operating system and environment, where applicable:
    import tianshou, gymnasium as gym, torch, numpy, sys
    print(tianshou.__version__, gym.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)

While Tianshou's idea of encapsulating everything learning related into a single method of a base policy class is a nice idea to keep algorithms more or less self contained, it has lead to a cluttered trainer and the need for kwargs during the update call to remain functional (see #949). Additionally, while tianshou uses RL jargon such as policies, they’re not really doing what, in RL terms, policies are doing, i.e., mapping from states to actions. Rather, tianshou interweaves the RL policy with parts of RL algorithms. Also, the trainer implements some parts which might be algorithm specific, like sampling transitions, and then passing them to tianshou-policies for updates.

In order to create a cleaner framework, I think the following changes would improve things:

  • Separate trainer from algorithm logic
    As already mentioned, the trainer currently holds the logic for training loops, logging, and additionally holds some algorithm logic. Half of the if statements and asserts in the __next__ function are necessary because the trainer doesn’t really know what type of RL it’s doing. Instead, I think it makes much more sense to move algorithm logic to new algorithm classes and let the trainer solely be responsible for running the algorithm, logging, and checkpointing. There would also be only one trainer, instead of the OnPolicyTrainer, etc. Eventually, the trainer may even serve as the factory for the high level API.

  • Separate algorithms from policies
    While this adds an additional layer of abstraction/complexity, things will eventually become cleaner. Right now, loads of keywords and arguments can be given to the trainer and they might not even be compatible. A BaseAlgorithm class would allow for more explicit arguments given to a certain algorithm and additionally allow for unconventional algorithms such as ARS. This would also allow for the algorithm to be independent of code regarding the type of policy (continuous/discrete) The policy class then is merely the function that maps from observations to actions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    blockedCan't be worked on for nowbreaking changesChanges in public interfaces. Includes small changes or changes in keysmajorLarge changes that cannot or should not be broken down into smaller onesrefactoringNo change to functionalitytentativeUp to discussion, may be dismissed

    Type

    No type

    Projects

    Status

    To do

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions