Releases: thu-ml/tianshou
Beta release of 2.0.0
This is a pre-release (beta) of the 2.0.0 version of tianshou, the full release notes and an updated documentation will be created at the release of the non-beta version.
The code on master can be considered stable, see the changelog for changes. Only minor changes, and likely no breaking ones, will be added for the full 2.0.0 release.
The main things missing are an enhanced benchmarking and documentation
v1.2.0
This is the last major release before version 2.0.0
It solves the regression in data collection performance, introduces several fixes, and importantly, adds support for determinism testing, which is used to ensure that the refactoring in the upcoming 2.0.0 release does not affect any aspect of training or inference
Changes/Improvements
trainer
:- Custom scoring now supported for selecting the best model. #1202
highlevel
:DiscreteSACExperimentBuilder
: Expose methodwith_actor_factory_default
#1248 #1250ActorFactoryDefault
: Fix parameters for hidden sizes and activation not being
passed on in the discrete case (affectswith_actor_factory_default
method of experiment builders)ExperimentConfig
: Do not inherit from other classes, as this breaks automatic handling by
jsonargparse
when the class is used to define interfaces (as in high-level API examples)AutoAlphaFactoryDefault
: Differentiate discrete and continuous action spaces
and allow coefficient to be modified, adding an informative docstring
(previous implementation was reasonable only for continuous action spaces)- Adjust usage in
atari_sac_hl
example accordingly.
- Adjust usage in
NPGAgentFactory
,TRPOAgentFactory
: Fix optimizer instantiation including the actor parameters
(which was misleadingly suggested in the docstring in the respective policy classes; docstrings were fixed),
as the actor parameters are intended to be handled via natural gradients internally
data
:ReplayBuffer
: Fix collection of empty episodes being disallowed- Collection was slow due to
isinstance
checks on Protocols and due to Buffer integrity validation. This was solved
by no longer performingisinstance
on Protocols and by making the integrity validation disabled by default.
- Tests:
- We have introduced extensive determinism tests which allow to validate whether
training processes deterministically compute the same results across different development branches.
This is an important step towards ensuring reproducibility and consistency, which will be
instrumental in supporting Tianshou developers in their work, especially in the context of
algorithm development and evaluation.
- We have introduced extensive determinism tests which allow to validate whether
v1.1.0
Release 1.1.0
Highlights
Evaluation Package
This release introduces a new package evaluation
that integrates best
practices for running experiments (seeding test and train environmets) and for
evaluating them using the rliable
library. This should be especially useful for algorithm developers for comparing
performances and creating meaningful visualizations. This functionality is
currently in alpha state and will be further improved in the next releases.
You will need to install tianshou with the extra eval
to use it.
The creation of multiple experiments with varying random seeds has been greatly
facilitated. Moreover, the ExpLauncher
interface has been introduced and
implemented with several backends to support the execution of multiple
experiments in parallel.
An example for this using the high-level interfaces can be found
here, examples that use low-level
interfaces will follow soon.
Improvements in Batch
Apart from that, several important
extensions have been added to internal data structures, most notably to Batch
.
Batches now implement __eq__
and can be meaningfully compared. Applying
operations in a nested fashion has been significantly simplified, and checking
for NaNs and dropping them is now possible.
One more notable change is that torch Distribution
objects are now sliced when
slicing a batch. Previously, when a Batch with say 10 actions and a dist
corresponding to them was sliced to [:3]
, the dist
in the result would still
correspond to all 10 actions. Now, the dist is also "sliced" to be the
distribution of the first 3 actions.
A detailed list of changes can be found below.
Changes/Improvements
evaluation
: New package for repeating the same experiment with multiple
seeds and aggregating the results. #1074 #1141 #1183data
:Batch
:- Add methods
to_dict
andto_list_of_dicts
. #1063 #1098 - Add methods
to_numpy_
andto_torch_
. #1098, #1117 - Add
__eq__
(semantic equality check). #1098 keys()
deprecated in favor ofget_keys()
(needed to make iteration
consistent with naming) #1105.- Major: new methods for applying functions to values, to check for NaNs
and drop them, and to set values. #1181 - Slicing a batch with a torch distribution now also slices the
distribution. #1181
- Add methods
data.collector
:
trainer
:- Trainers can now control whether collectors should be reset prior to
training. #1063
- Trainers can now control whether collectors should be reset prior to
policy
:highlevel
:SamplingConfig
:experiment
:Experiment
now has aname
attribute, which can be set
usingExperimentBuilder.with_name
and
which determines the default run name and therefore the persistence
subdirectory.
It can still be overridden inExperiment.run()
, the new parameter
name beingrun_name
rather than
experiment_name
(although the latter will still be interpreted
correctly). #1074 #1131- Add class
ExperimentCollection
for the convenient execution of
multiple experiment runs #1131 - The
World
object, containing all low-level objects needed for experimentation,
can now be extracted from anExperiment
instance. This enables customizing
the experiment prior to its execution, bridging the low and high-level interfaces. #1187 ExperimentBuilder
:
env
:- Added new
VectorEnvType
calledSUBPROC_SHARED_MEM_AUTO
and used in
for Atari and Mujoco venv creation. #1141
- Added new
utils
:logger
:net.continuous.Critic
:- Add flag
apply_preprocess_net_to_obs_only
to allow the
preprocessing network to be applied to the observations only (without
the actions concatenated), which is essential for the case where we
want
to reuse the actor's preprocessing network #1128
- Add flag
torch_utils
(new module)- Added context managers
torch_train_mode
andpolicy_within_training_step
#1123
- Added context managers
print
DataclassPPrintMixin
now supports outputting a string, not just
printing the pretty repr. #1141
Fixes
highlevel
:CriticFactoryReuseActor
: Enable the Critic
flagapply_preprocess_net_to_obs_only
for continuous critics,
fixing the case where we want to reuse an actor's preprocessing network
for the critic (affects usages
of the experiment builder methodwith_critic_factory_use_actor
with
continuous environments) #1128- Policy parameter
action_scaling
value"default"
was not correctly
transformed to a Boolean value for
algorithms SAC, DDPG, TD3 and REDQ. The value"default"
being truthy
caused action scaling to be enabled
even for discrete action spaces. #1191
atari_network.DQN
:PPOPolicy
:- Fix
max_batchsize
not being used inlogp_old
computation
insideprocess_fn
#1168
- Fix
- Fix
Batch.__eq__
to allow comparing Batches with scalar array values #1185
Internal Improvements
Collector
s rely less on state, the few stateful things are stored explicitly
instead of through a.data
attribute. #1063- Introduced a first iteration of a naming convention for vars in
Collector
s.
#1063 - Generally improved readability of Collector code and associated tests (still
quite some way to go). #1063 - Improved typing for
exploration_noise
and within Collector. #1063 - Better variable names related to model outputs (logits, dist input etc.).
#1032 - Improved typing for actors and critics, using Tianshou classes
likeActor
,ActorProb
, etc.,
instead of justnn.Module
. #1032 - Added interfaces for most
Actor
andCritic
classes to enforce the presence
offorward
methods. #1032 - Simplified
PGPolicy
forward by unifying thedist_fn
interface (see
associated breaking change). #1032 - Use
.mode
of distribution instead of relying on knowledge of the
distribution type. #1032 - Exception no longer raised on
len
of emptyBatch
. #1084 - tests and examples are covered by
mypy
. #1077 NetBase
is more used, stricter typing by making it generic. #1077- Use explicit multiprocessing context for creating
Pipe
insubproc.py
.
#1102 - Improved documentation and naming in many places
Breaking Changes
data
:Collector
:Batch
:- Fixed
iter(Batch(...)
which now behaves the same way
asBatch(...).__iter__()
.
Can be considered a bugfix. #1063 - The methods
to_numpy
andto_torch
in are not in-place anymore
(useto_numpy_
orto_torch_
instead). #1098, #1117 - The method
Batch.is_empty
has been removed. Instead, the user can
simply check for emptiness of Batch by usinglen
on dicts. #1144 - Stricter
cat_
, only concatenation of batches with the same structure
is allowed. #1181 to_torch
andto_numpy
are no longer static methods.
SoBatch.to_numpy(batch)
should be replaced bybatch.to_numpy()
.
#1200
- Fixed
utils
:logger
:utils.net
:Recurrent
now receives and returns
aRecurrentStateBatch
instead of a dict. #1077
- Modules with code that was copied from sensAI have been replaced by
imports from new dependency sensAI-utils:tianshou.utils.logging
is replaced withsensai.util.logging
tianshou.utils.string
is replaced withsensai.util.string
tianshou.utils.pickle
is replaced withsensai.util.pickle
env
:- All VectorEnvs now return a numpy array of info-dic...
1.0.0 - High level API, Improved Interfaces and Typing
Release 1.0.0
This release focuses on updating and improving Tianshou internals (in particular, code quality) while creating relatively few breaking changes (apart from things like the python and dependencies' versions).
We view it as a significant step for transforming Tianshou into the go-to place both for RL researchers, as well as for RL practitioners working on industry projects.
This is the first release after the appliedAI Institute (the TransferLab division) has decided to further develop Tianshou and provide long-term support.
Breaking Changes
- dropped support of python<3.11
- dropped support of gym, from now on only Gymnasium envs are supported
- removed functions like
offpolicy_trainer
in favor ofOffpolicyTrainer(...).run()
(this affects all example scripts) - several breaking changes related to removing
**kwargs
from signatures, renamings of internal attributes (likecritic1
->critic
) - Outputs of training methods are now dataclasses instead of dicts
Functionality Extensions
Major
- High level interfaces for experiments, demonstrated by the new example scripts with names ending in
_hl.py
Minor
- Method to compute action directly from a policy's observation, can be used for unrolling
- Support for custom keys in ReplayBuffer
- Support for CalQL as part of CQL
- Support for explicit setting of multiprocessing context for SubprocEnvWorker
critic2
no longer has to be explicitly constructed and passed if it is supposed to be the same network ascritic
(formerlycritic1
)
Internal Improvements
Build and Docs
- Completely changed the build pipeline. Tianshou now uses poetry, black, ruff, poethepoet, nbqa and other niceties.
- Notebook tutorials are now part of the repository (previously they were in a drive). They were fixed and are executed during the build as integration tests, in addition to serving as documentation. Parts of the content have been improved.
- Documentation is now built with jupyter book. JavaScript code has been slightly improved, JS dependencies are included as part of the repository.
- Many improvements in docstrings
Typing
- Adding
BatchPrototypes
to cover the fields needed and returned by methods relying on batches in a backwards compatible way - Removing
**kwargs
from policies' constructors - Overall, much stricter and more correct typing. Removing
kwargs
and replacing dicts by dataclasses in several places. - Making use of
Generic
to express different kinds of stats that can be returned bylearn
andupdate
- Improved typing in
tests
andexamples
, close to passing mypy
General
- Reduced duplication, improved readability and simplified code in several places
- Use
dist.mode
instead of inferringloc
orargmax
from thedist_fn
input
Contributions
The OG creators
- @Trinkle23897 participated in almost all aspects of the coordination and reviewed most of the merged PRs
- @nuance1979 participated in several discussions
From appliedAI
The team working on this release of Tianshou consisted of @opcode81 @MischaPanch @maxhuettenrauch @carlocagnetta @bordeauxred
External contributions
- @BFAnas participated in several discussions and contributed the CalQL implementation, extending the pre-processing logic.
- @dantp-ai fixed many mypy issues and improved the tests
- @arnaujc91 improved the logic of computing deterministic actions
- Many other contributors, among them many new ones participated in this release. The Tianshou team is very grateful for your contributions!
0.5.0: Gymnasium Support
Enhancement
- Gymnasium Integration (#789, @Markus28)
- Implement args/kwargs for init of norm_layers and activation (#788, @janofsun)
- Add "act" to preprocess_fn call in collector. (#801, @jamartinh)
- Various update (#803, #826, @Trinkle23897)
Bug fix
0.4.11
Enhancement
- Hindsight Experience Replay as a replay buffer (#753, @Juno-T)
- Fix Atari PPO example (#780, @nuance1979)
- Update experiment details of MuJoCo benchmark (#779, @ChenDRAG)
- Tiny change since the tests are more than unit tests (#765, @fzyzcjy)
Bug Fix
- Multi-agent: gym->gymnasium; render() update (#769, @WillDudley)
- Updated atari wrappers (#781, @Markus28)
- Fix info not pass issue in PGPolicy (#787, @Trinkle23897)
0.4.10
Enhancement
- Changes to support Gym 0.26.0 (#748, @Markus28)
- Added pre-commit (#752, @Markus28)
- Added support for new PettingZoo API (#751, @Markus28)
- Fix docs tictactoc dummy vector env (#749, @5cat)
Bug fix
- Fix 2 bugs and refactor RunningMeanStd to support dict obs norm (#695, @Trinkle23897)
- Do not allow async simulation for test collector (#705, @cwher)
- Fix venv wrapper reset retval error with gym env (#712, @Trinkle23897)
0.4.9
Bug Fix
- Fix save_checkpoint_fn return value to checkpoint_path (#659, @Trinkle23897)
- Fix an off-by-one bug in trainer iterator (#659, @Trinkle23897)
- Fix a bug in Discrete SAC evaluation; default to deterministic mode (#657, @nuance1979)
- Fix a bug in trainer about test reward not logged because self.env_step is not set for offline setting (#660, @nuance1979)
- Fix exception with watching pistonball environments (#663, @ycheng517)
- Use
env.np_random.integers
instead ofenv.np_random.randint
in Atari examples (#613, @ycheng517)
API Change
- Upgrade gym to
>=0.23.1
, supportseed
andreturn_info
arguments for reset (#613, @ycheng517)
New Features
- Add BranchDQN for large discrete action spaces (#618, @BFAnas)
- Add show_progress option for trainer (#641, @michalgregor)
- Added support for clipping to DQNPolicy (#642, @michalgregor)
- Implement TD3+BC for offline RL (#660, @nuance1979)
- Add multiDiscrete to discrete gym action space wrapper (#664, @BFAnas)
Enhancement
- Use envpool in vizdoom example (#634, @Trinkle23897)
- Add Atari (discrete) SAC examples (#657, @nuance1979)
0.4.8
Bug fix
Enhancement
- Add write_flush in two loggers, fix argument passing in WandbLogger (#581, @Trinkle23897)
- Update Multi-agent RL docs and upgrade pettingzoo (#595, @ycheng517)
- Add learning rate scheduler to BasePolicy (#598, @alexnikulkov)
- Add Jupyter notebook tutorials using Google Colaboratory (#599, @ChenDRAG)
- Unify
utils.network
: change action_dim to action_shape (#602, @Squeemos) - Update Mujoco bemchmark's webpage (#606, @ChenDRAG)
- Add Atari results (#600, @gogoduan) (#616, @ChenDRAG)
- Convert RL Unplugged Atari datasets to tianshou ReplayBuffer (#621, @nuance1979)
- Implement REDQ (#623, @Jimenius)
- Improve data loading from D4RL and convert RL Unplugged to D4RL format (#624, @nuance1979)
- Add vecenv wrappers for obs_norm to support running mujoco experiment with envpool (#628, @Trinkle23897)
0.4.7
Bug Fix
- Add map_action_inverse for fixing the error of storing random action (#568)
API Change
- Update WandbLogger implementation and update Atari examples, use Tensorboard SummaryWritter as core with
wandb.init(..., sync_tensorboard=True)
(#558, #562) - Rename save_fn to save_best_fn to avoid ambiguity (#575)
- (Internal) Add
tianshou.utils.deprecation
for a unified deprecation wrapper. (#575)
New Features
- Implement Generative Adversarial Imitation Learning (GAIL), add Mujoco examples (#550)
- Add Trainers as generators: OnpolicyTrainer, OffpolicyTrainer, and OfflineTrainer; remove duplicated code and merge into base trainer (#559)
Enhancement
- Add imitation baselines for offline RL (#566)