Bug Fix

Add map_action_inverse for fixing the error of storing random action (#568)

API Change

Update WandbLogger implementation and update Atari examples, use Tensorboard SummaryWritter as core with wandb.init(..., sync_tensorboard=True) (#558, #562)
Rename save_fn to save_best_fn to avoid ambiguity (#575)
(Internal) Add tianshou.utils.deprecation for a unified deprecation wrapper. (#575)

New Features

Implement Generative Adversarial Imitation Learning (GAIL), add Mujoco examples (#550)
Add Trainers as generators: OnpolicyTrainer, OffpolicyTrainer, and OfflineTrainer; remove duplicated code and merge into base trainer (#559)

Enhancement

Add imitation baselines for offline RL (#566)

This release is to fix the conda pkg publish, support more gym version instead of only the newest one, and keep compatibility of internal API. See #536.

Bug Fix

Fix casts to int by to_torch_as(...) calls in policies when using discrete actions (#521)

API Change

Change venv internal API name of worker: send_action -> send, get_result -> recv (align with envpool) (#517)

New Features

Add Intrinsic Curiosity Module (#503)
Implement CQLPolicy and offline_cql example (#506)
Pettingzoo environment support (#494)
Enable venvs.reset() concurrent execution (#517)

Enhancement

Remove reset_buffer() from reset method (#501)
Add atari ppo example (#523, #529)
Add VizDoom PPO example and results (#533)
Upgrade gym version to >=0.21 (#534)
Switch atari example to use EnvPool by default (#534)

Documentation

Update dqn tutorial and add envpool to docs (#526)

Bug Fix

Fix tqdm issue (#481)
Fix atari wrapper to be deterministic (#467)
Add writer.flush() in TensorboardLogger to ensure real-time logging result (#485)

Enhancement

Implements set_env_attr and get_env_attr for vector environments (#478)
Implement BCQPolicy and offline_bcq example (#480)
Enable test_collector=None in 3 trainers to turn off testing during training (#485)
Fix an inconsistency in the implementation of Discrete CRR. Now it uses Critic class for its critic, following conventions in other actor-critic policies (#485)
Update several offline policies to use ActorCritic class for its optimizer to eliminate randomness caused by parameter sharing between actor and critic (#485)
Move Atari offline RL examples to examples/offline and tests to test/offline (#485)

API Change

add a new class DataParallelNet for multi-GPU training (#461)
add ActorCritic for deterministic parameter grouping for share-head actor-critic network (#458)
collector.collect() now returns 4 extra keys: rew/rew_std/len/len_std (previously this work is done in logger) (#459)
rename WandBLogger -> WandbLogger (#441)

Bug Fix

fix logging in atari examples (#444)

Enhancement

save_fn() will be called at the beginning of trainer (#459)
create a new page for logger (#463)
add save_data and restore_data in wandb, allow more input arguments for wandb init, and integrate wandb into test/modelbase/test_psrl.py and examples/atari/atari_dqn.py (#441)

Bug Fix

fix a2c/ppo optimizer bug when sharing head (#428)
fix ppo dual clip implementation (#435)

Enhancement

add Rainbow (#386)
add WandbLogger (#427)
add env_id in preprocess_fn (#391)
update README, add new chart and bibtex (#406)
add Makefile, now you can use make commit-checks to automatically perform almost all checks (#432)
add isort and yapf, apply to existing codebase (#432)
add spelling check by using make spelling (#432)
update contributing.rst (#432)

Enhancement

Add model-free dqn family: IQN (#371), FQF (#376)
Add model-free on-policy algorithm: NPG (#344, #347), TRPO (#337, #340)
Add offline-rl algorithm: CQL (#359), CRR (#367)
Support deterministic evaluation for onpolicy algorithms (#354)
Make trainer resumable (#350)
Support different state size and fix exception in venv.__del__ (#352, #384)
Add vizdoom example (#384)
Add numerical analysis tool and interactive plot (#335, #341)

API Change

Add observation normalization in BaseVectorEnv (norm_obs, obs_rms, update_obs_rms and RunningMeanStd) (#308)
Add policy.map_action to bound with raw action (e.g., map from (-inf, inf) to [-1, 1] by clipping or tanh squashing), and the mapped action won't store in replaybuffer (#313)
Add lr_scheduler in on-policy algorithms, typically for LambdaLR (#318)

Note

To adapt with this version, you should change the action_range=... to action_space=env.action_space in policy initialization.

Bug Fix

Fix incorrect behaviors (error when n/ep==0 and reward shown in tqdm) with on-policy algorithm (#306, #328)
Fix q-value mask_action error for obs_next (#310)

Enhancement

Release SOTA Mujoco benchmark (DDPG/TD3/SAC: #305, REINFORCE: #320, A2C: #325, PPO: #330) and add corresponding notes in /examples/mujoco/README.md
Fix numpy>=1.20 typing issue (#323)
Add cross-platform unittest (#331)
Add a test on how to deal with finite env (#324)
Add value normalization in on-policy algorithms (#319, #321)
Separate advantage normalization and value normalization in PPO (#329)

This release contains several API and behavior changes.

API Change

Buffer

Add ReplayBufferManager, PrioritizedReplayBufferManager, VectorReplayBuffer, PrioritizedVectorReplayBuffer, CachedReplayBuffer (#278, #280);
Change buffer.add API from buffer.add(obs, act, rew, done, obs_next, info, policy, ...) to buffer.add(batch, buffer_ids) in order to add data more efficient (#280);
Add set_batch method in buffer (#278);
Add sample_index method, same as sample but only return index instead of both index and batch data (#278);
Add prev (one-step previous transition index), next (one-step next transition index) and unfinished_index (the last modified index whose done==False) (#278);
Add internal method _alloc_by_keys_diff in batch to support any form of keys pop up (#280);

Collector

Rewrite the original Collector, split the async function to AsyncCollector: Collector only supports sync mode, AsyncCollector support both modes (#280);
Drop collector.collect(n_episode=List[int]) because the new collector can collect episodes without bias (#280);
Move reward_metric from Collector to trainer (#280);
Change Collector.collect logic: AsyncCollector.collect's semantic is the same as previous version, where collect(n_step or n_episode) will not collect exact n_step or n_episode transitions; Collector.collect(n_step or n_episode)'s semantic now changes to exact n_step or n_episode collect (#280);

Policy

Add policy.exploration_noise(action, batch) -> action method instead of implemented in policy.forward() (#280);
Add Timelimit.truncate handler in compute_*_returns (#296);
remove ignore_done flag (#296);
remove reward_normalization option in offpolicy-algorithm (will raise Error if set to True) (#298);

Trainer

Change collect_per_step to step_per_collect (#293);
Add update_per_step and episode_per_collect (#293);
onpolicy_trainer now supports either step_collect or episode_collect (#293)
Add BasicLogger and LazyLogger to log data more conveniently (#295)

Bug Fix

Fix VectorEnv action_space seed randomness -- when call env.seed(seed), it will call env.action_space.seed(seed); otherwise using Collector.collect(..., random=True) will produce different result each time (#300, #303).

Bug Fix

fix networks under utils/discrete and utils/continuous cannot work well under CUDA+torch<=1.6.0 (#289)
fix 2 bugs of Batch: creating keys in Batch.__setitem__ now throws ValueError instead of KeyError; _create_value now allows placeholder with stack=False option (#284)

Enhancement

Add QR-DQN algorithm (#276)
small optimization for Batch.cat and Batch.stack (#284), now it is almost as fast as v0.2.3

Releases: thu-ml/tianshou

0.4.7

Bug Fix

API Change

New Features

Enhancement

Uh oh!

0.4.6.post1

Uh oh!

0.4.6

Bug Fix

API Change

New Features

Enhancement

Documentation

Uh oh!

0.4.5

Bug Fix

Enhancement

Uh oh!

0.4.4

API Change

Bug Fix

Enhancement

Uh oh!

0.4.3

Bug Fix

Enhancement

Uh oh!

0.4.2

Enhancement

Uh oh!

0.4.1

API Change

Note

Bug Fix

Enhancement

Uh oh!

0.4.0

API Change

Buffer

Collector

Policy

Trainer

Bug Fix

Uh oh!

0.3.2

Bug Fix

Enhancement

Uh oh!