Releases: thu-ml/tianshou
Releases · thu-ml/tianshou
0.4.7
Bug Fix
- Add map_action_inverse for fixing the error of storing random action (#568)
API Change
- Update WandbLogger implementation and update Atari examples, use Tensorboard SummaryWritter as core with
wandb.init(..., sync_tensorboard=True)(#558, #562) - Rename save_fn to save_best_fn to avoid ambiguity (#575)
- (Internal) Add
tianshou.utils.deprecationfor a unified deprecation wrapper. (#575)
New Features
- Implement Generative Adversarial Imitation Learning (GAIL), add Mujoco examples (#550)
- Add Trainers as generators: OnpolicyTrainer, OffpolicyTrainer, and OfflineTrainer; remove duplicated code and merge into base trainer (#559)
Enhancement
- Add imitation baselines for offline RL (#566)
0.4.6.post1
This release is to fix the conda pkg publish, support more gym version instead of only the newest one, and keep compatibility of internal API. See #536.
0.4.6
Bug Fix
- Fix casts to int by to_torch_as(...) calls in policies when using discrete actions (#521)
API Change
- Change venv internal API name of worker: send_action -> send, get_result -> recv (align with envpool) (#517)
New Features
- Add Intrinsic Curiosity Module (#503)
- Implement CQLPolicy and offline_cql example (#506)
- Pettingzoo environment support (#494)
- Enable venvs.reset() concurrent execution (#517)
Enhancement
- Remove reset_buffer() from reset method (#501)
- Add atari ppo example (#523, #529)
- Add VizDoom PPO example and results (#533)
- Upgrade gym version to >=0.21 (#534)
- Switch atari example to use EnvPool by default (#534)
Documentation
- Update dqn tutorial and add envpool to docs (#526)
0.4.5
Bug Fix
- Fix tqdm issue (#481)
- Fix atari wrapper to be deterministic (#467)
- Add
writer.flush()in TensorboardLogger to ensure real-time logging result (#485)
Enhancement
- Implements set_env_attr and get_env_attr for vector environments (#478)
- Implement BCQPolicy and offline_bcq example (#480)
- Enable
test_collector=Nonein 3 trainers to turn off testing during training (#485) - Fix an inconsistency in the implementation of Discrete CRR. Now it uses
Criticclass for its critic, following conventions in other actor-critic policies (#485) - Update several offline policies to use
ActorCriticclass for its optimizer to eliminate randomness caused by parameter sharing between actor and critic (#485) - Move Atari offline RL examples to
examples/offlineand tests totest/offline(#485)
0.4.4
API Change
- add a new class DataParallelNet for multi-GPU training (#461)
- add ActorCritic for deterministic parameter grouping for share-head actor-critic network (#458)
- collector.collect() now returns 4 extra keys: rew/rew_std/len/len_std (previously this work is done in logger) (#459)
- rename WandBLogger -> WandbLogger (#441)
Bug Fix
- fix logging in atari examples (#444)
Enhancement
0.4.3
Bug Fix
Enhancement
- add Rainbow (#386)
- add WandbLogger (#427)
- add env_id in preprocess_fn (#391)
- update README, add new chart and bibtex (#406)
- add Makefile, now you can use
make commit-checksto automatically perform almost all checks (#432) - add isort and yapf, apply to existing codebase (#432)
- add spelling check by using
make spelling(#432) - update contributing.rst (#432)
0.4.2
Enhancement
- Add model-free dqn family: IQN (#371), FQF (#376)
- Add model-free on-policy algorithm: NPG (#344, #347), TRPO (#337, #340)
- Add offline-rl algorithm: CQL (#359), CRR (#367)
- Support deterministic evaluation for onpolicy algorithms (#354)
- Make trainer resumable (#350)
- Support different state size and fix exception in venv.__del__ (#352, #384)
- Add vizdoom example (#384)
- Add numerical analysis tool and interactive plot (#335, #341)
0.4.1
API Change
- Add observation normalization in BaseVectorEnv (
norm_obs,obs_rms,update_obs_rmsandRunningMeanStd) (#308) - Add
policy.map_actionto bound with raw action (e.g., map from (-inf, inf) to [-1, 1] by clipping or tanh squashing), and the mapped action won't store in replaybuffer (#313) - Add
lr_schedulerin on-policy algorithms, typically forLambdaLR(#318)
Note
To adapt with this version, you should change the action_range=... to action_space=env.action_space in policy initialization.
Bug Fix
- Fix incorrect behaviors (error when
n/ep==0and reward shown in tqdm) with on-policy algorithm (#306, #328) - Fix q-value mask_action error for obs_next (#310)
Enhancement
- Release SOTA Mujoco benchmark (DDPG/TD3/SAC: #305, REINFORCE: #320, A2C: #325, PPO: #330) and add corresponding notes in /examples/mujoco/README.md
- Fix
numpy>=1.20typing issue (#323) - Add cross-platform unittest (#331)
- Add a test on how to deal with finite env (#324)
- Add value normalization in on-policy algorithms (#319, #321)
- Separate advantage normalization and value normalization in PPO (#329)
0.4.0
This release contains several API and behavior changes.
API Change
Buffer
- Add ReplayBufferManager, PrioritizedReplayBufferManager, VectorReplayBuffer, PrioritizedVectorReplayBuffer, CachedReplayBuffer (#278, #280);
- Change
buffer.addAPI frombuffer.add(obs, act, rew, done, obs_next, info, policy, ...)tobuffer.add(batch, buffer_ids)in order to add data more efficient (#280); - Add
set_batchmethod in buffer (#278); - Add
sample_indexmethod, same assamplebut only return index instead of both index and batch data (#278); - Add
prev(one-step previous transition index),next(one-step next transition index) andunfinished_index(the last modified index whosedone==False) (#278); - Add internal method
_alloc_by_keys_diffin batch to support any form of keys pop up (#280);
Collector
- Rewrite the original Collector, split the async function to AsyncCollector: Collector only supports sync mode, AsyncCollector support both modes (#280);
- Drop
collector.collect(n_episode=List[int])because the new collector can collect episodes without bias (#280); - Move
reward_metricfrom Collector to trainer (#280); - Change
Collector.collectlogic:AsyncCollector.collect's semantic is the same as previous version, wherecollect(n_step or n_episode)will not collect exact n_step or n_episode transitions;Collector.collect(n_step or n_episode)'s semantic now changes to exact n_step or n_episode collect (#280);
Policy
- Add
policy.exploration_noise(action, batch) -> actionmethod instead of implemented inpolicy.forward()(#280); - Add
Timelimit.truncatehandler incompute_*_returns(#296); - remove
ignore_doneflag (#296); - remove
reward_normalizationoption in offpolicy-algorithm (will raise Error if set to True) (#298);
Trainer
- Change
collect_per_steptostep_per_collect(#293); - Add
update_per_stepandepisode_per_collect(#293); onpolicy_trainernow supports either step_collect or episode_collect (#293)- Add BasicLogger and LazyLogger to log data more conveniently (#295)