Release 1.1.0

This is a pre-release (beta) of the 2.0.0 version of tianshou, the full release notes and an updated documentation will be created at the release of the non-beta version.

The code on master can be considered stable, see the changelog for changes. Only minor changes, and likely no breaking ones, will be added for the full 2.0.0 release.

The main things missing are an enhanced benchmarking and documentation

This is the last major release before version 2.0.0

It solves the regression in data collection performance, introduces several fixes, and importantly, adds support for determinism testing, which is used to ensure that the refactoring in the upcoming 2.0.0 release does not affect any aspect of training or inference

Changes/Improvements

trainer:
- Custom scoring now supported for selecting the best model. #1202
highlevel:
- DiscreteSACExperimentBuilder: Expose method with_actor_factory_default #1248 #1250
- ActorFactoryDefault: Fix parameters for hidden sizes and activation not being
  passed on in the discrete case (affects with_actor_factory_default method of experiment builders)
- ExperimentConfig: Do not inherit from other classes, as this breaks automatic handling by
  jsonargparse when the class is used to define interfaces (as in high-level API examples)
- AutoAlphaFactoryDefault: Differentiate discrete and continuous action spaces
  and allow coefficient to be modified, adding an informative docstring
  (previous implementation was reasonable only for continuous action spaces)
  - Adjust usage in atari_sac_hl example accordingly.
- NPGAgentFactory, TRPOAgentFactory: Fix optimizer instantiation including the actor parameters
  (which was misleadingly suggested in the docstring in the respective policy classes; docstrings were fixed),
  as the actor parameters are intended to be handled via natural gradients internally
data:
- ReplayBuffer: Fix collection of empty episodes being disallowed
- Collection was slow due to isinstance checks on Protocols and due to Buffer integrity validation. This was solved
  by no longer performing isinstance on Protocols and by making the integrity validation disabled by default.
Tests:
- We have introduced extensive determinism tests which allow to validate whether
  training processes deterministically compute the same results across different development branches.
  This is an important step towards ensuring reproducibility and consistency, which will be
  instrumental in supporting Tianshou developers in their work, especially in the context of
  algorithm development and evaluation.

Release 1.1.0

Highlights

Evaluation Package

This release introduces a new package evaluation that integrates best
practices for running experiments (seeding test and train environmets) and for
evaluating them using the rliable
library. This should be especially useful for algorithm developers for comparing
performances and creating meaningful visualizations. This functionality is
currently in alpha state and will be further improved in the next releases.
You will need to install tianshou with the extra eval to use it.

The creation of multiple experiments with varying random seeds has been greatly
facilitated. Moreover, the ExpLauncher interface has been introduced and
implemented with several backends to support the execution of multiple
experiments in parallel.

An example for this using the high-level interfaces can be found
here, examples that use low-level
interfaces will follow soon.

Improvements in Batch

Apart from that, several important
extensions have been added to internal data structures, most notably to Batch.
Batches now implement __eq__ and can be meaningfully compared. Applying
operations in a nested fashion has been significantly simplified, and checking
for NaNs and dropping them is now possible.

One more notable change is that torch Distribution objects are now sliced when
slicing a batch. Previously, when a Batch with say 10 actions and a dist
corresponding to them was sliced to [:3], the dist in the result would still
correspond to all 10 actions. Now, the dist is also "sliced" to be the
distribution of the first 3 actions.

A detailed list of changes can be found below.

Changes/Improvements

evaluation: New package for repeating the same experiment with multiple
seeds and aggregating the results. #1074 #1141 #1183
data:
- Batch:
  - Add methods to_dict and to_list_of_dicts. #1063 #1098
  - Add methods to_numpy_ and to_torch_. #1098, #1117
  - Add __eq__ (semantic equality check). #1098
  - keys() deprecated in favor of get_keys() (needed to make iteration
    consistent with naming) #1105.
  - Major: new methods for applying functions to values, to check for NaNs
    and drop them, and to set values. #1181
  - Slicing a batch with a torch distribution now also slices the
    distribution. #1181
- data.collector:
  - Collector:
    - Introduced BaseCollector as a base class for all collectors.
      #1123
    - Add method close #1063
    - Method reset is now more granular (new flags controlling
      behavior). #1063
  - CollectStats: Add convenience
    constructor with_autogenerated_stats. #1063
trainer:
- Trainers can now control whether collectors should be reset prior to
  training. #1063
policy:
- introduced attribute in_training_step that is controlled by the trainer.
  #1123
- policy automatically set to eval mode when collecting and to train
  mode when updating. #1123
- Extended interface of compute_action to also support array-like inputs
  #1169
highlevel:
- SamplingConfig:
  - Add support for batch_size=None. #1077
  - Add training_seed for explicit seeding of training and test
    environments, the test_seed is inferred from training_seed. #1074
- experiment:
  - Experiment now has a name attribute, which can be set
    using ExperimentBuilder.with_name and
    which determines the default run name and therefore the persistence
    subdirectory.
    It can still be overridden in Experiment.run(), the new parameter
    name being run_name rather than
    experiment_name (although the latter will still be interpreted
    correctly). #1074 #1131
  - Add class ExperimentCollection for the convenient execution of
    multiple experiment runs #1131
  - The World object, containing all low-level objects needed for experimentation,
    can now be extracted from an Experiment instance. This enables customizing
    the experiment prior to its execution, bridging the low and high-level interfaces. #1187
  - ExperimentBuilder:
    - Add method build_seeded_collection for the sound creation of
      multiple
      experiments with varying random seeds #1131
    - Add method copy to facilitate the creation of multiple
      experiments from a single builder #1131
- env:
  - Added new VectorEnvType called SUBPROC_SHARED_MEM_AUTO and used in
    for Atari and Mujoco venv creation. #1141
utils:
- logger:
  - Loggers can now restore the logged data into python by using the
    new restore_logged_data method. #1074
  - Wandb logger extended #1183
- net.continuous.Critic:
  - Add flag apply_preprocess_net_to_obs_only to allow the
    preprocessing network to be applied to the observations only (without
    the actions concatenated), which is essential for the case where we
    want
    to reuse the actor's preprocessing network #1128
- torch_utils (new module)
  - Added context managers torch_train_mode
    and policy_within_training_step #1123
- print
  - DataclassPPrintMixin now supports outputting a string, not just
    printing the pretty repr. #1141

Fixes

highlevel:
- CriticFactoryReuseActor: Enable the Critic
  flag apply_preprocess_net_to_obs_only for continuous critics,
  fixing the case where we want to reuse an actor's preprocessing network
  for the critic (affects usages
  of the experiment builder method with_critic_factory_use_actor with
  continuous environments) #1128
- Policy parameter action_scaling value "default" was not correctly
  transformed to a Boolean value for
  algorithms SAC, DDPG, TD3 and REDQ. The value "default" being truthy
  caused action scaling to be enabled
  even for discrete action spaces. #1191
atari_network.DQN:
- Fix constructor input validation #1128
- Fix output_dim not being set if features_only=True
  and output_dim_added_layer is not None #1128
PPOPolicy:
- Fix max_batchsize not being used in logp_old computation
  inside process_fn #1168
Fix Batch.__eq__ to allow comparing Batches with scalar array values #1185

Internal Improvements

Collectors rely less on state, the few stateful things are stored explicitly
instead of through a .data attribute. #1063
Introduced a first iteration of a naming convention for vars in Collectors.
#1063
Generally improved readability of Collector code and associated tests (still
quite some way to go). #1063
Improved typing for exploration_noise and within Collector. #1063
Better variable names related to model outputs (logits, dist input etc.).
#1032
Improved typing for actors and critics, using Tianshou classes
like Actor, ActorProb, etc.,
instead of just nn.Module. #1032
Added interfaces for most Actor and Critic classes to enforce the presence
of forward methods. #1032
Simplified PGPolicy forward by unifying the dist_fn interface (see
associated breaking change). #1032
Use .mode of distribution instead of relying on knowledge of the
distribution type. #1032
Exception no longer raised on len of empty Batch. #1084
tests and examples are covered by mypy. #1077
NetBase is more used, stricter typing by making it generic. #1077
Use explicit multiprocessing context for creating Pipe in subproc.py.
#1102
Improved documentation and naming in many places

Breaking Changes

data:
- Collector:
  - Removed .data attribute. #1063
  - Collectors no longer reset the environment on initialization.
    Instead, the user might have to call reset expicitly or
    pass reset_before_collect=True . #1063
  - Removed no_grad argument from collect method (was unused in
    tianshou). #1123
- Batch:
  - Fixed iter(Batch(...) which now behaves the same way
    as Batch(...).__iter__().
    Can be considered a bugfix. #1063
  - The methods to_numpy and to_torch in are not in-place anymore
    (use to_numpy_ or to_torch_ instead). #1098, #1117
  - The method Batch.is_empty has been removed. Instead, the user can
    simply check for emptiness of Batch by using len on dicts. #1144
  - Stricter cat_, only concatenation of batches with the same structure
    is allowed. #1181
  - to_torch and to_numpy are no longer static methods.
    So Batch.to_numpy(batch) should be replaced by batch.to_numpy().
    #1200
utils:
- logger:
  - BaseLogger.prepare_dict_for_logging is now abstract. #1074
  - Removed deprecated and unused BasicLogger (only affects users who
    subclassed it). #1074
- utils.net:
  - Recurrent now receives and returns
    a RecurrentStateBatch instead of a dict. #1077
- Modules with code that was copied from sensAI have been replaced by
  imports from new dependency sensAI-utils:
  - tianshou.utils.logging is replaced with sensai.util.logging
  - tianshou.utils.string is replaced with sensai.util.string
  - tianshou.utils.pickle is replaced with sensai.util.pickle
env:
- All VectorEnvs now return a numpy array of info-dic...

@Trinkle23897

Release 1.0.0

This release focuses on updating and improving Tianshou internals (in particular, code quality) while creating relatively few breaking changes (apart from things like the python and dependencies' versions).

We view it as a significant step for transforming Tianshou into the go-to place both for RL researchers, as well as for RL practitioners working on industry projects.

This is the first release after the appliedAI Institute (the TransferLab division) has decided to further develop Tianshou and provide long-term support.

Breaking Changes

dropped support of python<3.11
dropped support of gym, from now on only Gymnasium envs are supported
removed functions like offpolicy_trainer in favor of OffpolicyTrainer(...).run() (this affects all example scripts)
several breaking changes related to removing **kwargs from signatures, renamings of internal attributes (like critic1 -> critic)
Outputs of training methods are now dataclasses instead of dicts

Functionality Extensions

Major

High level interfaces for experiments, demonstrated by the new example scripts with names ending in _hl.py

Minor

Method to compute action directly from a policy's observation, can be used for unrolling
Support for custom keys in ReplayBuffer
Support for CalQL as part of CQL
Support for explicit setting of multiprocessing context for SubprocEnvWorker
critic2 no longer has to be explicitly constructed and passed if it is supposed to be the same network as critic (formerly critic1)

Internal Improvements

Build and Docs

Completely changed the build pipeline. Tianshou now uses poetry, black, ruff, poethepoet, nbqa and other niceties.
Notebook tutorials are now part of the repository (previously they were in a drive). They were fixed and are executed during the build as integration tests, in addition to serving as documentation. Parts of the content have been improved.
Documentation is now built with jupyter book. JavaScript code has been slightly improved, JS dependencies are included as part of the repository.
Many improvements in docstrings

Typing

Adding BatchPrototypes to cover the fields needed and returned by methods relying on batches in a backwards compatible way
Removing **kwargs from policies' constructors
Overall, much stricter and more correct typing. Removing kwargs and replacing dicts by dataclasses in several places.
Making use of Generic to express different kinds of stats that can be returned by learn and update
Improved typing in tests and examples, close to passing mypy

General

Reduced duplication, improved readability and simplified code in several places
Use dist.mode instead of inferring loc or argmax from the dist_fn input

Contributions

The OG creators

@Trinkle23897 participated in almost all aspects of the coordination and reviewed most of the merged PRs
@nuance1979 participated in several discussions

From appliedAI

The team working on this release of Tianshou consisted of @opcode81 @MischaPanch @maxhuettenrauch @carlocagnetta @bordeauxred

External contributions

@BFAnas participated in several discussions and contributed the CalQL implementation, extending the pre-processing logic.
@dantp-ai fixed many mypy issues and improved the tests
@arnaujc91 improved the logic of computing deterministic actions
Many other contributors, among them many new ones participated in this release. The Tianshou team is very grateful for your contributions!

@Markus28

Enhancement

Gymnasium Integration (#789, @Markus28)
Implement args/kwargs for init of norm_layers and activation (#788, @janofsun)
Add "act" to preprocess_fn call in collector. (#801, @jamartinh)
Various update (#803, #826, @Trinkle23897)

Bug fix

Fix a bug in batch._is_batch_set (#825, @zbenmo)
Fix a bug in HERReplayBuffer (#817, @sunkafei)

@Juno-T

Enhancement

Hindsight Experience Replay as a replay buffer (#753, @Juno-T)
Fix Atari PPO example (#780, @nuance1979)
Update experiment details of MuJoCo benchmark (#779, @ChenDRAG)
Tiny change since the tests are more than unit tests (#765, @fzyzcjy)

Bug Fix

Multi-agent: gym->gymnasium; render() update (#769, @WillDudley)
Updated atari wrappers (#781, @Markus28)
Fix info not pass issue in PGPolicy (#787, @Trinkle23897)

@Markus28

Enhancement

Changes to support Gym 0.26.0 (#748, @Markus28)
Added pre-commit (#752, @Markus28)
Added support for new PettingZoo API (#751, @Markus28)
Fix docs tictactoc dummy vector env (#749, @5cat)

Bug fix

Fix 2 bugs and refactor RunningMeanStd to support dict obs norm (#695, @Trinkle23897)
Do not allow async simulation for test collector (#705, @cwher)
Fix venv wrapper reset retval error with gym env (#712, @Trinkle23897)

@Trinkle23897

Bug Fix

Fix save_checkpoint_fn return value to checkpoint_path (#659, @Trinkle23897)
Fix an off-by-one bug in trainer iterator (#659, @Trinkle23897)
Fix a bug in Discrete SAC evaluation; default to deterministic mode (#657, @nuance1979)
Fix a bug in trainer about test reward not logged because self.env_step is not set for offline setting (#660, @nuance1979)
Fix exception with watching pistonball environments (#663, @ycheng517)
Use env.np_random.integers instead of env.np_random.randint in Atari examples (#613, @ycheng517)

API Change

Upgrade gym to >=0.23.1, support seed and return_info arguments for reset (#613, @ycheng517)

New Features

Add BranchDQN for large discrete action spaces (#618, @BFAnas)
Add show_progress option for trainer (#641, @michalgregor)
Added support for clipping to DQNPolicy (#642, @michalgregor)
Implement TD3+BC for offline RL (#660, @nuance1979)
Add multiDiscrete to discrete gym action space wrapper (#664, @BFAnas)

Enhancement

Use envpool in vizdoom example (#634, @Trinkle23897)
Add Atari (discrete) SAC examples (#657, @nuance1979)

@ChenDRAG

Bug fix

Fix action scaling bug in SAC (#591, @ChenDRAG)

Enhancement

Add write_flush in two loggers, fix argument passing in WandbLogger (#581, @Trinkle23897)
Update Multi-agent RL docs and upgrade pettingzoo (#595, @ycheng517)
Add learning rate scheduler to BasePolicy (#598, @alexnikulkov)
Add Jupyter notebook tutorials using Google Colaboratory (#599, @ChenDRAG)
Unify utils.network: change action_dim to action_shape (#602, @Squeemos)
Update Mujoco bemchmark's webpage (#606, @ChenDRAG)
Add Atari results (#600, @gogoduan) (#616, @ChenDRAG)
Convert RL Unplugged Atari datasets to tianshou ReplayBuffer (#621, @nuance1979)
Implement REDQ (#623, @Jimenius)
Improve data loading from D4RL and convert RL Unplugged to D4RL format (#624, @nuance1979)
Add vecenv wrappers for obs_norm to support running mujoco experiment with envpool (#628, @Trinkle23897)

Bug Fix

Add map_action_inverse for fixing the error of storing random action (#568)

API Change

Update WandbLogger implementation and update Atari examples, use Tensorboard SummaryWritter as core with wandb.init(..., sync_tensorboard=True) (#558, #562)
Rename save_fn to save_best_fn to avoid ambiguity (#575)
(Internal) Add tianshou.utils.deprecation for a unified deprecation wrapper. (#575)

New Features

Implement Generative Adversarial Imitation Learning (GAIL), add Mujoco examples (#550)
Add Trainers as generators: OnpolicyTrainer, OffpolicyTrainer, and OfflineTrainer; remove duplicated code and merge into base trainer (#559)

Enhancement

Add imitation baselines for offline RL (#566)

Releases: thu-ml/tianshou

Beta release of 2.0.0

Uh oh!

v1.2.0

Changes/Improvements

Uh oh!

v1.1.0

Release 1.1.0

Highlights

Evaluation Package

Improvements in Batch

Changes/Improvements

Fixes

Internal Improvements

Breaking Changes

Contributors

Uh oh!

1.0.0 - High level API, Improved Interfaces and Typing

Release 1.0.0

Breaking Changes

Functionality Extensions

Major

Minor

Internal Improvements

Build and Docs

Typing

General

Contributions

The OG creators

From appliedAI

External contributions

Contributors

Uh oh!

0.5.0: Gymnasium Support

Enhancement

Bug fix

Contributors

Uh oh!

0.4.11

Enhancement

Bug Fix

Contributors

Uh oh!

0.4.10

Enhancement

Bug fix

Contributors

Uh oh!

0.4.9

Bug Fix

API Change

New Features

Enhancement

Contributors

Uh oh!

0.4.8

Bug fix

Enhancement

Contributors

Uh oh!

0.4.7

Bug Fix

API Change

New Features

Enhancement

Uh oh!