这是indexloc提供的服务,不要输入任何密码
Skip to content

Tianshou v2 #1259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 257 commits into from
Jul 15, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
257 commits
Select commit Hold shift + click to select a range
8ec2023
v2: Restore high-level API support for DDPG and DeepQLearning
opcode81 Mar 4, 2025
e34d37b
v2: Set train mode on Algorithm instead of Policy
opcode81 Mar 4, 2025
f32e51b
v2: Adjust QRDQN and corresponding test, adapt test_dqn
opcode81 Mar 4, 2025
8d4e182
ActorFactoryDefault: Fix hidden sizes and activation not being passed…
opcode81 Mar 4, 2025
b63dcd5
v2: Adapt C51, test_c51 and atari_c51
opcode81 Mar 5, 2025
6eb3170
v2: Adapt A2C and test_a2c_with_il (skipping the il part)
opcode81 Mar 5, 2025
df9d4bc
v2: Adapt PPO and test_ppo
opcode81 Mar 5, 2025
60f19e1
v2: Adapt TD3 and test_td3
opcode81 Mar 5, 2025
9ef9a84
v2: Adapt NPG and test_npg
opcode81 Mar 5, 2025
2555eb8
v2: Fix class hierarchy issues: NPG now no longer inherits from A2C
opcode81 Mar 5, 2025
629905b
v2: Adapt TRPO and test_trpo
opcode81 Mar 5, 2025
571ce91
Add registration of log configuration callback to tianshou.__init__
opcode81 Mar 5, 2025
85d3a19
v2: Restore high-level API support for A2C, PPO, TRPO, NPG
opcode81 Mar 5, 2025
7f79994
Fix method reference (map_action_inverse)
opcode81 Mar 5, 2025
74f5956
v2: Restore high-level API support for TD3
opcode81 Mar 5, 2025
e36bbd6
v2: Adapt SAC and test_sac_with_il (without the il part)
opcode81 Mar 5, 2025
9b06cd9
v2: Use train mode for full Algorithm in update() [fix]
opcode81 Mar 5, 2025
d1fd07f
v2: Major refactoring of DDPG, TD3 and SAC
opcode81 Mar 6, 2025
228326d
v2: Adapt DiscreteSAC and test_discrete_sac
opcode81 Mar 6, 2025
a22c89a
v2: Refactor SAC and DiscreteSAC to use new abstractions for alpha ha…
opcode81 Mar 6, 2025
ad74776
AutoAlphaFactoryDefault: Differentiate discrete and continuous action…
opcode81 Mar 7, 2025
dc4dce2
v2: Adapt REDQ (and test_redq), fixing problematic inheritance from DDPG
opcode81 Mar 7, 2025
1d58866
v2: Adapt BranchingDuelingQNetwork (BDQN) and test_bdqn, adding a suc…
opcode81 Mar 7, 2025
3845905
v2: Adapt FQF and test_fqf
opcode81 Mar 7, 2025
cbc478e
v2: Adapt IQN and test_iqn
opcode81 Mar 7, 2025
fb28e47
v2: Adapt RainbowDQN and test_rainbow
opcode81 Mar 7, 2025
2d61f78
v2: Adapt ImitationLearning, test_a2c_with_il and test_sac_with_il
opcode81 Mar 7, 2025
d1f3962
v2: Rename Trainer classes, correcting capitalisation
opcode81 Mar 9, 2025
835792f
v2: Major refactoring of the Trainer classes
opcode81 Mar 8, 2025
31280fb
v2: Adapt BCQ and test_bcq
opcode81 Mar 8, 2025
319469b
v2: Move the functions gather_info and test_episode from trainer.util…
opcode81 Mar 10, 2025
928c7a7
v2: Trainer: Turn functions moved from trainer.util into methods with…
opcode81 Mar 10, 2025
4cc7aaa
Do not allow ruff to remove unused imports from __init__.py files
opcode81 Mar 10, 2025
9e2fa6e
v2: Trainer: Factorise/resolve inconsistencies in performance evaluat…
opcode81 Mar 10, 2025
9da2c00
v2: Trainer: Replace flawed gradient step counter with an update step…
opcode81 Mar 10, 2025
fd19577
v2: Move function for adding exploration noise from Algorithm to Policy,
opcode81 Mar 10, 2025
34965c7
v2: Adapt DiscreteCRR, test_discrete_crr and gather_cartpole_data
opcode81 Mar 11, 2025
5e5ee4e
v2: Adapt discrete/test_ppo
opcode81 Mar 11, 2025
3fb465d
v2: Adapt RandomActionPolicy
opcode81 Mar 11, 2025
5995429
v2: Adapt GAIL and test_gail
opcode81 Mar 11, 2025
09d8cf2
v2: Adapt DiscreteBCQ and test_bcq
opcode81 Mar 11, 2025
f47e97a
v2: Adapt CQL and test_cql
opcode81 Mar 11, 2025
bf85a58
v2: Introduce mixins for the handling of lagged networks (target netw…
opcode81 Mar 12, 2025
ffe9d8e
v2: Improve on-policy class hierarchy
opcode81 Mar 12, 2025
7a6c29d
v2: Adapt ICM, test_dqn_icm and test_ppo_icm
opcode81 Mar 12, 2025
ff5f889
v2: Adapt PSRL and test_psrl
opcode81 Mar 12, 2025
e4b1ba7
v2: Rename algorithms back to acronym-based name (DQN, BDQN)
opcode81 Mar 14, 2025
26e87b9
v2: Improve class hierarchy of deep Q-learning algorithms
opcode81 Mar 14, 2025
ed7a43a
v2: Adapt DiscreteCQL (using diamond inheritance to convert from off-…
opcode81 Mar 14, 2025
cec9a3c
v2: Adapt TD3BC (using diamond inheritance to convert from off-policy…
opcode81 Mar 14, 2025
00badd4
v2: Fix Algorithm updating/learning interface
opcode81 Mar 14, 2025
20231bd
v2: Adapt multi-agent RL algorithms
opcode81 Mar 14, 2025
bd98c3f
v2: Adapt Atari examples
opcode81 Mar 14, 2025
a82555c
v2: Use OptimizerFactory instances instead of torch.optim.Optimizer i…
opcode81 Mar 14, 2025
f20dbce
v2: Adapt high-level API (new optimizer handling, remaining adaptations)
opcode81 Mar 15, 2025
cd77724
v2: Adapt MuJoCo examples
opcode81 Mar 17, 2025
652c7f3
v2: Refactor vanilla imitation learning, separating off-policy from o…
opcode81 Mar 17, 2025
45bf412
v2: Adapt offline examples
opcode81 Mar 17, 2025
877023b
v2: Adapt test_drqn
opcode81 Mar 17, 2025
b17e03d
v2: Adapt remaining examples
opcode81 Mar 17, 2025
5aa8b3a
v2: Rename *TrainingConfig classes to *TrainerParams
opcode81 Mar 17, 2025
9bfdb89
v2: High-Level API: Replace SamplingConfig with *TrainingConfig
opcode81 Mar 17, 2025
ed23c11
v2: Change default of test_in_train from True to False,
opcode81 Mar 17, 2025
f0c160a
v2: Remove obsolete module utils.lr_scheduler with now unused class M…
opcode81 Mar 17, 2025
41049fd
v2: Fix typing issues in *WrapperAlgorithm
opcode81 Mar 18, 2025
33414aa
v2: Remove unused type-ignores
opcode81 Mar 18, 2025
a2958c7
v2: Fix typing issues in Trainer
opcode81 Mar 18, 2025
b855ab4
v2: Adapt test_policy and test_collector
opcode81 Mar 18, 2025
cd73407
v2: Fix some incorrect types, make mypy happier
opcode81 Mar 18, 2025
0c26b33
v2: AutoAlpha: Use optimizer factory and create the tensor internally
opcode81 Mar 18, 2025
c9a8e2f
v2: Improve Algorithm (formerly BasePolicy) method names
opcode81 Mar 18, 2025
d0d1f14
v2: Rename Actor classes to improve clarity
opcode81 Mar 18, 2025
dbd472f
v2: Rename Critic classes to improve clarity
opcode81 Mar 18, 2025
864d17e
v2: Fix device assignment issue #810
opcode81 Mar 19, 2025
df3c116
v2: Remove updating flag of Algorithm
opcode81 Mar 20, 2025
e3ba271
v2: Algorithms now internally use a wrapper (Algorithm.Optimizer) aro…
opcode81 Mar 20, 2025
c5455b9
v2: Improve handling of epsilon-greedy exploration for discrete Q-lea…
opcode81 Mar 20, 2025
2434b79
Merge remote-tracking branch 'thuml/dev-v1' into dev-v2
opcode81 Apr 22, 2025
80f1b21
Merge branch 'dev-v1' into dev-v2
opcode81 Apr 22, 2025
9f5e055
Fix return type
opcode81 Apr 22, 2025
5fceae6
Merge branch 'dev-v1' into dev-v2
opcode81 Apr 22, 2025
de1e8b7
v2: Fix mypy/typing issues
opcode81 May 2, 2025
7a72da9
v2: Clean up handling of modules that define attribute `output_dim`
opcode81 May 2, 2025
24d7a4a
v2: Update README: Algorithm abstraction, high-level example
opcode81 May 3, 2025
66e9e24
v2: Improve description of 'estimation_step'
opcode81 May 4, 2025
1c08dd8
v2: Consistently use `gamma` instead of `discount_factor` and improve…
opcode81 May 4, 2025
99e3c5a
v2: Improve docstrings for optimizer factories (and related/neighbour…
opcode81 May 4, 2025
41c4563
v2: Improve description of parameter 'tau'
opcode81 May 4, 2025
7e4d696
v2: Update references to parameters
opcode81 May 4, 2025
186178f
v2: Add/improve docstrings of algorithm base classes
opcode81 May 4, 2025
793297a
v2: Improve description of parameter 'gae_lambda'
opcode81 May 4, 2025
2796c5b
v2: Improve description of parameter 'actor_step_size'
opcode81 May 4, 2025
6a3646b
v2: Improve description of parameter 'max_batchsize'
opcode81 May 4, 2025
3fc22e9
v2: Improve description of parameter 'optim_critic_iters'
opcode81 May 4, 2025
d1b2e32
v2: Improve description of parameter 'dist_fn'
opcode81 May 4, 2025
2e7a3b9
v2: Standardise descriptions for 'action_space' and 'observation_space'
opcode81 May 4, 2025
785524c
v2: Improve descriptions of parameters 'action_scaling' and 'action_b…
opcode81 May 4, 2025
796022d
v2: Improve description of parameter 'deterministic_eval'
opcode81 May 4, 2025
a59c3b5
v2: Rename PGParams, PGExperimentBuilder -> Reinforce* (high-level API)
opcode81 May 4, 2025
039a04a
v2: Remove obsolete comments/docstrings
opcode81 May 4, 2025
91760fa
v2: policy_wrapper: duplicated call to instantiation, enabling more e…
MischaPanch May 4, 2025
fcdb5ad
v2: Actor: make get_preprocess_net always return ModuleWithVectorOutput
MischaPanch May 4, 2025
4025b82
v2: Imitation: removed unnecessary (and incorrect) generic TImitation…
MischaPanch May 4, 2025
f7f2dcd
v2: A bunch of small fixes in typing, param names and attribute refer…
MischaPanch May 4, 2025
b9c6774
Merge remote-tracking branch 'thuml/dev-v2' into dev-v2
MischaPanch May 4, 2025
336615f
Merge branch 'dev-v2' of github.com:thu-ml/tianshou into dev-v2
opcode81 May 4, 2025
84fa2bc
v2: Improve description of parameter eps_clip
opcode81 May 5, 2025
4006a02
Ignore .serena
opcode81 May 5, 2025
5f50547
v2: Improve description of parameter 'dual_clip'
opcode81 May 5, 2025
86c0a3a
v2: Improve description of parameter 'value_clip'
opcode81 May 5, 2025
0f226ee
v2: Improve description of parameter 'alpha'
opcode81 May 5, 2025
beb9ef8
v2: handle case of empty module parameters in torch_device
MischaPanch May 5, 2025
0293e06
v2: remove the generic-specification pattern for TrainingStats
MischaPanch May 5, 2025
3c94128
v2: removed mujoco-py from dependencies
MischaPanch May 5, 2025
5088ce4
v2: processing functions in Algorithm are now private and have strict…
MischaPanch May 5, 2025
3f8948b
v2: Clean up 'reward_normalization' parameter
opcode81 May 5, 2025
ab603d5
v2: Update parameter references (discount_factor -> gamma)
opcode81 May 5, 2025
abc7529
Merge branch 'dev-v1' into dev-v2
opcode81 May 5, 2025
e2f1ca2
Fix message assignment
opcode81 May 5, 2025
cf24f8a
Merge branch 'dev-v1' into dev-v2
opcode81 May 5, 2025
6556778
Merge branch 'dev-v1' into dev-v2
opcode81 May 5, 2025
9148647
Merge branch 'dev-v1' into dev-v2
opcode81 May 5, 2025
78d42fd
v2: Fix handling of eps in DiscreteBCQ (moved to policy, inheriting f…
opcode81 May 5, 2025
521123a
v2: Disable determinism tests by default (only on demand)
opcode81 May 5, 2025
c7fd661
v2: Update Algorithm method name references
opcode81 May 6, 2025
3107bd3
v2: Remove obsolete module utils.optim (superseded by Algorithm.Optim…
opcode81 May 6, 2025
d2c75eb
v2: Improve description of parameter 'max_grad_norm'
opcode81 May 6, 2025
26c588b
v2: Improve descriptions of parameters 'vf_coef' and 'ent_coef'
opcode81 May 6, 2025
e607b00
v2: Improve descriptions of ICM parameters
opcode81 May 6, 2025
225012f
v2: Complete removal of reward_normalization/return_scaling parameter…
opcode81 May 6, 2025
05e0d0d
v2: Improve description of parameter 'is_double'
opcode81 May 6, 2025
02b3f01
v2: DiscreteBCQ: Remove parameters 'is_double' and 'clip_loss_grad'
opcode81 May 6, 2025
3ed750c
v2: Improve description of parameter 'clip_loss_grad'
opcode81 May 6, 2025
1cf3d47
v2: Update DDPG parameter docstrings [addendum]
opcode81 May 6, 2025
eea6ba7
v2: Improve description of parameter 'policy_noise'
opcode81 May 6, 2025
650ae77
v2: Improve description of parameter 'update_actor_freq'
opcode81 May 6, 2025
11b2320
v2: Improve description of parameter 'noise_clip'
opcode81 May 6, 2025
aff12e6
v2: Improve descriptions of REDQ parameters
opcode81 May 7, 2025
ab49383
v2: Improve descriptions of CQL parameters
opcode81 May 7, 2025
ab15bfc
v2: Improve descriptions of GAIL parameters disc_*
opcode81 May 7, 2025
2e42c85
v2: Improve description of parameter 'num_quantiles'
opcode81 May 7, 2025
02c0318
v2: Improve description of parameter 'target_update_freq'
opcode81 May 7, 2025
a66396b
v2: Improve description of parameter 'add_done_loop'
opcode81 May 7, 2025
f459556
v2: Remove docstrings for removed parameters
opcode81 May 7, 2025
0404b79
v2: Fix/remove references to BasePolicy
opcode81 May 7, 2025
dbfb462
v2: Rename DDPGPolicy -> ContinuousDeterministicPolicy
opcode81 May 12, 2025
d5a5289
v2: Rename DQNPolicy -> DiscreteQLearningPolicy
opcode81 May 12, 2025
24532ba
v2: Fix references to SamplingConfig/sampling_config
opcode81 May 12, 2025
c347f40
v2: Update references to reward_normalization parameter in high-level…
opcode81 May 12, 2025
6639d9d
v2: Improve description of 'action_boun_method' in SACPolicy (largely…
opcode81 May 12, 2025
0db25dc
Merge branch 'dev-v1' into dev-v2
opcode81 May 12, 2025
65a16e7
Fix determinism test name
opcode81 May 12, 2025
8e593e5
v2: Fix parameter initialization in AutoAlpha
opcode81 May 13, 2025
829be23
Fix test name
opcode81 May 13, 2025
9b40321
v2: Rename test file
opcode81 May 13, 2025
ed72314
v2: Update docstring
opcode81 May 13, 2025
cd2c170
Merge branch 'dev-v1' into dev-v2
opcode81 May 13, 2025
495c4e6
Merge branch 'dev-v1' into dev-v2
opcode81 May 13, 2025
5833051
v2: Fix docstring
opcode81 May 14, 2025
cb24b86
Merge branch 'dev-v1' into dev-v2
opcode81 May 14, 2025
9878981
v2: Rainbow: Do not wrap model_old in EvalModeModuleWrapper
opcode81 May 14, 2025
66c0c86
v2: Fix LaggedNetworkCollection.full_parameter_update forcing target …
opcode81 May 14, 2025
40381eb
v2: NoisyLinear: Treat the noise parameters as parameters instead of …
opcode81 May 14, 2025
cd08323
Merge branch 'dev-v1' into dev-v2
opcode81 May 14, 2025
ad9a5f3
Merge branch 'dev-v1' into dev-v2
opcode81 May 14, 2025
aa23efb
v2: typing, use EvalModeModuleWrapper in annotations
MischaPanch May 14, 2025
dcd6f87
Merge branch 'dev-v2' of github.com:thu-ml/tianshou into dev-v2
MischaPanch May 14, 2025
4f9600a
v2: minor, typing
MischaPanch May 14, 2025
7e2ab50
v2: A2C: Fix gradient step counter not being incremented
opcode81 May 15, 2025
8b30028
v2: Fix docstring
opcode81 May 15, 2025
87c7fb5
v2: Improve change log
opcode81 May 15, 2025
17449c4
Merge branch 'dev-v1' into dev-v2
opcode81 May 15, 2025
a724abb
v1: minor rename
MischaPanch May 15, 2025
f5c5fce
v2: Add base classes/marker interfaces for actors
MischaPanch May 15, 2025
16eda46
v2: remove irrelevant action_bound param in SACPolicy
MischaPanch May 15, 2025
e4e7b75
v2: Change parameter clip_loss_grad to huber_loss_delta
opcode81 May 15, 2025
99d3204
v2: Rename package policy -> algorithm
opcode81 May 15, 2025
d88eff5
v2: Rename module algorithm.base -> algorithm.algorithm_base
opcode81 May 15, 2025
de18c80
v2: Rename module trainer.base -> trainer.trainer
opcode81 May 15, 2025
33a8235
Rename module mapolicy -> marl
opcode81 May 15, 2025
43be06b
Rename module policy_params -> algorithm_params
opcode81 May 15, 2025
1d03c1c
Rename module logger.base -> logger.logger_base
opcode81 May 15, 2025
3061e5b
Rename module buffer.base -> buffer.buffer_base
opcode81 May 15, 2025
d058d92
Rename module imitation.base -> imitation.imitation_base
opcode81 May 15, 2025
37e7f8d
Rename module env.worker.base -> env.worker.worker_base
opcode81 May 15, 2025
b593806
v2: Clean imports/apply formatter after renamings
opcode81 May 15, 2025
8f8ea0b
v2: Mention renamed packages and modules in change log
opcode81 May 15, 2025
29887a7
v2: Rename parameter estimation_step -> n_step_return_horizon (more p…
opcode81 May 15, 2025
59d5916
v2: BDQN: Remove parameter 'n_step_horizon' (formerly 'estimation_step')
opcode81 May 15, 2025
272e70f
v2: Rename 'pg' module and associated scripts to 'reinforce'
opcode81 May 16, 2025
d0c72e6
v2: Improve docstrings of actors
opcode81 May 16, 2025
ed195c1
v2: Fix import
opcode81 May 16, 2025
9728bc0
v2: Fix assertion (stats can be None)
opcode81 May 16, 2025
9ba76b7
v2: Rename trainer parameters:
opcode81 May 16, 2025
d2800cf
v2: Update test names for Reinforce
opcode81 May 16, 2025
839a929
v2: major refactoring - all Actors now follow a proper forward interface
MischaPanch May 16, 2025
6ae76cc
Merge remote-tracking branch 'origin/dev-v2' into dev-v2
MischaPanch May 16, 2025
5ced882
Minor fix in type and getattr (needs explicit None)
MischaPanch May 16, 2025
a4877ed
Relax determinism tests:
opcode81 May 16, 2025
ada0eaa
v2: Transfer recent algorithm parameter changes to high-level API
opcode81 May 16, 2025
b2fd31f
v2: Relax discrete_sac determinism test (to account for v1 inheritanc…
opcode81 May 16, 2025
1f9416f
Merge branch 'dev-v1' into dev-v2
opcode81 May 16, 2025
a06c6ae
v2: Establish backward compatibility with persisted v1 buffers
opcode81 May 16, 2025
fd3d194
v2: renamed many params
MischaPanch May 16, 2025
fb795b0
Merge branch 'dev-v2' of github.com:thu-ml/tianshou into dev-v2
MischaPanch May 16, 2025
90ec2bd
v1: Removed unused bash script
MischaPanch May 16, 2025
ba26f8b
v2: Replace hyphenated args in argparsers with snake case args
opcode81 May 16, 2025
ba32173
Merge branch 'dev-v2' of github.com:thu-ml/tianshou into dev-v2
opcode81 May 16, 2025
0fa36cd
v2: Better handling of max_action in actor
MischaPanch May 16, 2025
2ab33f4
v2: Rename trainer parameter reward_metric -> multi_agent_return_redu…
opcode81 May 16, 2025
cf1a34d
v2: block comment
MischaPanch May 16, 2025
91b9120
v2: minor, renamed kwarg
MischaPanch May 16, 2025
54c6eba
v2: Remove some TODOs
opcode81 May 16, 2025
0913e75
Merge branch 'dev-v2' of github.com:thu-ml/tianshou into dev-v2
opcode81 May 16, 2025
efa64a3
v2: Minor typefix
MischaPanch May 16, 2025
50ef0c9
v2: removed unused kwarg
MischaPanch May 16, 2025
2972b13
v2: Comments, typos, minor renaming
MischaPanch May 16, 2025
0312351
v2: add_exploration_noise - raise error on wrong type instead of doin…
MischaPanch May 16, 2025
2f95cec
v2: Remove TODOs
opcode81 May 16, 2025
0d38340
v2: Fix mypy issues
opcode81 May 17, 2025
d059481
v1: improvement in doc-build commands
MischaPanch May 17, 2025
04290c8
v2: docs - removed outdated documents, fixed remaning
MischaPanch May 17, 2025
0d622f3
v1: minor improvement in doc-build command
MischaPanch May 17, 2025
e9fe650
v2: minor fixes in docstrings, doc build runs through
MischaPanch May 17, 2025
62bd07f
v2: Fix import
opcode81 May 17, 2025
3ff9423
v2: Rename BranchingActor back to BranchingNet
opcode81 May 17, 2025
86fba86
Merge remote-tracking branch 'origin/dev-v2' into dev-v2
MischaPanch May 17, 2025
105043e
v2: Fix renamed class reference
opcode81 May 17, 2025
03132cc
v2: Improve neural network class hierarchy
opcode81 May 17, 2025
75cfaf0
v2: Add mock import for cv2 (used in atari_wrapper)
opcode81 May 19, 2025
0d8afaf
Merge branch 'dev-v1' into dev-v2
opcode81 May 19, 2025
5209218
v2: Fix argument references: test_num -> num_test_envs
opcode81 May 19, 2025
f68b6f5
v2: Disable determinism tests for CI
opcode81 May 19, 2025
87fbd6d
Merge branch 'dev-v1' into dev-v2
opcode81 May 19, 2025
c264917
Merge branch 'dev-v1' into dev-v2
opcode81 May 19, 2025
556224e
v2: Update parameter names (mainly test_num -> num_test_envs)
opcode81 May 19, 2025
d5960cf
v2: Fix logic error introduced in commit 03123510
opcode81 May 20, 2025
6c3abb0
v2: Handle nested algorithms in Algorithm.state_dict
opcode81 May 20, 2025
78d52ed
v2: Update identifier names (policy -> algorithm)
opcode81 May 20, 2025
989ecc6
v2: Rename hl module: policy_wrapper -> algorithm_wrapper
opcode81 May 20, 2025
3fb51cd
v2: HL: Move optim module to params package
opcode81 May 20, 2025
6ebb6de
v2: Add issue references
opcode81 May 20, 2025
9436396
Merge branch 'master' into dev-v2
opcode81 Jul 3, 2025
87cc6fa
v2: Set version to 2.0.0b1
opcode81 Jul 3, 2025
6bf4f03
v2: adjusted dqn.rst to reflect the new API
MischaPanch Jul 8, 2025
16270cb
v2: Docs. Improved concepts_rst, mentioned that parts of the docs are…
MischaPanch Jul 14, 2025
3d5ab5f
v2: Docs. Updated readme and concepts_rst to use v2 structure policies
MischaPanch Jul 14, 2025
4215eaf
v2: Changelog: Add information on changes pertaining to lagged networks
opcode81 Jul 14, 2025
8a49660
v2: Fix typo in docstring
opcode81 Jul 14, 2025
c74cc17
v2: Update examples in README
opcode81 Jul 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -160,5 +160,8 @@ docs/conf.py
/temp
/temp*.py

# Serena
/.serena

# determinism test snapshots
/test/resources/determinism/
265 changes: 263 additions & 2 deletions CHANGELOG.md

Large diffs are not rendered by default.

235 changes: 131 additions & 104 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@
1. Convenient high-level interfaces for applications of RL (training an implemented algorithm on a custom environment).
1. Large scope: online (on- and off-policy) and offline RL, experimental support for multi-agent RL (MARL), experimental support for model-based RL, and more


Unlike other reinforcement learning libraries, which may have complex codebases,
unfriendly high-level APIs, or are not optimized for speed, Tianshou provides a high-performance, modularized framework
and user-friendly interfaces for building deep reinforcement learning agents. One more aspect that sets Tianshou apart is its
Expand Down Expand Up @@ -149,9 +148,11 @@ If no errors are reported, you have successfully installed Tianshou.

## Documentation

Tutorials and API documentation are hosted on [tianshou.readthedocs.io](https://tianshou.readthedocs.io/).
Find example scripts in the [test/]( https://github.com/thu-ml/tianshou/blob/master/test) and [examples/](https://github.com/thu-ml/tianshou/blob/master/examples) folders.

Find example scripts in the [test/](https://github.com/thu-ml/tianshou/blob/master/test) and [examples/](https://github.com/thu-ml/tianshou/blob/master/examples) folders.
Tutorials and API documentation are hosted on [tianshou.readthedocs.io](https://tianshou.readthedocs.io/).
**Important**: The documentation is currently being updated to reflect the changes in Tianshou v2.0.0. Not all features are documented yet, and some parts are outdated (they are marked as such). The documentation will be fully updated when
the v2.0.0 release is finalized.

## Why Tianshou?

Expand Down Expand Up @@ -180,20 +181,23 @@ Check out the [GitHub Actions](https://github.com/thu-ml/tianshou/actions) page

Atari and MuJoCo benchmark results can be found in the [examples/atari/](examples/atari/) and [examples/mujoco/](examples/mujoco/) folders respectively. **Our MuJoCo results reach or exceed the level of performance of most existing benchmarks.**

### Policy Interface
### Algorithm Abstraction

Reinforcement learning algorithms are build on abstractions for

- on-policy algorithms (`OnPolicyAlgorithm`),
- off-policy algorithms (`OffPolicyAlgorithm`), and
- offline algorithms (`OfflineAlgorithm`),

all of which clearly separate the core algorithm from the training process and the respective environment interactions.

All algorithms implement the following, highly general API:
In each case, the implementation of an algorithm necessarily involves only the implementation of methods for

- `__init__`: initialize the policy;
- `forward`: compute actions based on given observations;
- `process_buffer`: process initial buffer, which is useful for some offline learning algorithms
- `process_fn`: preprocess data from the replay buffer (since we have reformulated _all_ algorithms to replay buffer-based algorithms);
- `learn`: learn from a given batch of data;
- `post_process_fn`: update the replay buffer from the learning process (e.g., prioritized replay buffer needs to update the weight);
- `update`: the main interface for training, i.e., `process_fn -> learn -> post_process_fn`.
- pre-processing a batch of data, augmenting it with necessary information/sufficient statistics for learning (`_preprocess_batch`),
- updating model parameters based on an augmented batch of data (`_update_with_batch`).

The implementation of this API suffices for a new algorithm to be applicable within Tianshou,
making experimenation with new approaches particularly straightforward.
The implementation of these methods suffices for a new algorithm to be applicable within Tianshou,
making experimentation with new approaches particularly straightforward.

## Quick Start

Expand All @@ -203,70 +207,68 @@ Tianshou provides two API levels:
- the procedural interface, which provides a maximum of control, especially for very advanced users and developers of reinforcement learning algorithms.

In the following, let us consider an example application using the _CartPole_ gymnasium environment.
We shall apply the deep Q network (DQN) learning algorithm using both APIs.
We shall apply the deep Q-network (DQN) learning algorithm using both APIs.

### High-Level API

To get started, we need some imports.

```python
from tianshou.highlevel.config import SamplingConfig
from tianshou.highlevel.env import (
EnvFactoryRegistered,
VectorEnvType,
)
from tianshou.highlevel.experiment import DQNExperimentBuilder, ExperimentConfig
from tianshou.highlevel.params.policy_params import DQNParams
from tianshou.highlevel.trainer import (
EpochTestCallbackDQNSetEps,
EpochTrainCallbackDQNSetEps,
EpochStopCallbackRewardThreshold
)
```

In the high-level API, the basis for an RL experiment is an `ExperimentBuilder`
with which we can build the experiment we then seek to run.
Since we want to use DQN, we use the specialization `DQNExperimentBuilder`.
The other imports serve to provide configuration options for our experiment.

The high-level API provides largely declarative semantics, i.e. the code is
almost exclusively concerned with configuration that controls what to do
(rather than how to do it).

```python
from tianshou.highlevel.config import OffPolicyTrainingConfig
from tianshou.highlevel.env import (
EnvFactoryRegistered,
VectorEnvType,
)
from tianshou.highlevel.experiment import DQNExperimentBuilder, ExperimentConfig
from tianshou.highlevel.params.algorithm_params import DQNParams
from tianshou.highlevel.trainer import (
EpochStopCallbackRewardThreshold,
)

experiment = (
DQNExperimentBuilder(
EnvFactoryRegistered(task="CartPole-v1", train_seed=0, test_seed=0, venv_type=VectorEnvType.DUMMY),
ExperimentConfig(
persistence_enabled=False,
watch=True,
watch_render=1 / 35,
watch_num_episodes=100,
),
SamplingConfig(
num_epochs=10,
step_per_epoch=10000,
batch_size=64,
num_train_envs=10,
num_test_envs=100,
buffer_size=20000,
step_per_collect=10,
update_per_step=1 / 10,
),
)
.with_dqn_params(
DQNParams(
lr=1e-3,
discount_factor=0.9,
estimation_step=3,
target_update_freq=320,
),
)
.with_model_factory_default(hidden_sizes=(64, 64))
.with_epoch_train_callback(EpochTrainCallbackDQNSetEps(0.3))
.with_epoch_test_callback(EpochTestCallbackDQNSetEps(0.0))
.with_epoch_stop_callback(EpochStopCallbackRewardThreshold(195))
.build()
DQNExperimentBuilder(
EnvFactoryRegistered(
task="CartPole-v1",
venv_type=VectorEnvType.DUMMY,
train_seed=0,
test_seed=10,
),
ExperimentConfig(
persistence_enabled=False,
watch=True,
watch_render=1 / 35,
watch_num_episodes=100,
),
OffPolicyTrainingConfig(
max_epochs=10,
epoch_num_steps=10000,
batch_size=64,
num_train_envs=10,
num_test_envs=100,
buffer_size=20000,
collection_step_num_env_steps=10,
update_step_num_gradient_steps_per_sample=1 / 10,
),
)
.with_dqn_params(
DQNParams(
lr=1e-3,
gamma=0.9,
n_step_return_horizon=3,
target_update_freq=320,
eps_training=0.3,
eps_inference=0.0,
),
)
.with_model_factory_default(hidden_sizes=(64, 64))
.with_epoch_stop_callback(EpochStopCallbackRewardThreshold(195))
.build()
)
experiment.run()
```
Expand All @@ -281,24 +283,25 @@ The experiment builder takes three arguments:
episodes (`watch_num_episodes=100`). We have disabled persistence, because
we do not want to save training logs, the agent or its configuration for
future use.
- the sampling configuration, which controls fundamental training parameters,
- the training configuration, which controls fundamental training parameters,
such as the total number of epochs we run the experiment for (`num_epochs=10`)
and the number of environment steps each epoch shall consist of
(`step_per_epoch=10000`).
(`epoch_num_steps=10000`).
Every epoch consists of a series of data collection (rollout) steps and
training steps.
The parameter `step_per_collect` controls the amount of data that is
The parameter `collection_step_num_env_steps` controls the amount of data that is
collected in each collection step and after each collection step, we
perform a training step, applying a gradient-based update based on a sample
of data (`batch_size=64`) taken from the buffer of data that has been
collected. For further details, see the documentation of `SamplingConfig`.
collected. For further details, see the documentation of configuration class.

We then proceed to configure some of the parameters of the DQN algorithm itself
and of the neural network model we want to use.
A DQN-specific detail is the use of callbacks to configure the algorithm's
epsilon parameter for exploration. We want to use random exploration during rollouts
(train callback), but we don't when evaluating the agent's performance in the test
environments (test callback).
We then proceed to configure some of the parameters of the DQN algorithm itself:
For instance, we control the epsilon parameter for exploration.
We want to use random exploration during rollouts for training (`eps_training`),
but we don't when evaluating the agent's performance in the test environments
(`eps_inference`).
Furthermore, we configure model parameters of the network for the Q function,
parametrising the number of hidden layers of the default MLP factory.

Find the script in [examples/discrete/discrete_dqn_hl.py](examples/discrete/discrete_dqn_hl.py).
Here's a run (with the training time cut short):
Expand All @@ -309,15 +312,15 @@ Here's a run (with the training time cut short):

Find many further applications of the high-level API in the `examples/` folder;
look for scripts ending with `_hl.py`.
Note that most of these examples require the extra package `argparse`
Note that most of these examples require the extra `argparse`
(install it by adding `--extras argparse` when invoking poetry).

### Procedural API

Let us now consider an analogous example in the procedural API.
Find the full script in [examples/discrete/discrete_dqn.py](https://github.com/thu-ml/tianshou/blob/master/examples/discrete/discrete_dqn.py).

First, import some relevant packages:
First, import the relevant packages:

```python
import gymnasium as gym
Expand All @@ -326,7 +329,7 @@ from torch.utils.tensorboard import SummaryWriter
import tianshou as ts
```

Define some hyper-parameters:
Define hyper-parameters:

```python
task = 'CartPole-v1'
Expand All @@ -335,14 +338,13 @@ train_num, test_num = 10, 100
gamma, n_step, target_freq = 0.9, 3, 320
buffer_size = 20000
eps_train, eps_test = 0.1, 0.05
step_per_epoch, step_per_collect = 10000, 10
epoch_num_steps, collection_step_num_env_steps = 10000, 10
```

Initialize the logger:

```python
logger = ts.utils.TensorboardLogger(SummaryWriter('log/dqn'))
# For other loggers, see https://tianshou.readthedocs.io/en/master/01_tutorials/05_logger.html
```

Make environments:
Expand All @@ -353,53 +355,78 @@ train_envs = ts.env.DummyVectorEnv([lambda: gym.make(task) for _ in range(train_
test_envs = ts.env.DummyVectorEnv([lambda: gym.make(task) for _ in range(test_num)])
```

Create the network as well as its optimizer:
Create the network, policy, and algorithm:

```python
from tianshou.utils.net.common import Net
from tianshou.algorithm import DQN
from tianshou.algorithm.modelfree.dqn import DiscreteQLearningPolicy
from tianshou.algorithm.optim import AdamOptimizerFactory

# Note: You can easily define other networks.
# See https://tianshou.readthedocs.io/en/master/01_tutorials/00_dqn.html#build-the-network
env = gym.make(task, render_mode="human")
state_shape = env.observation_space.shape or env.observation_space.n
action_shape = env.action_space.shape or env.action_space.n
net = Net(state_shape=state_shape, action_shape=action_shape, hidden_sizes=[128, 128, 128])
optim = torch.optim.Adam(net.parameters(), lr=lr)
```

Set up the policy and collectors:
net = Net(
state_shape=state_shape, action_shape=action_shape,
hidden_sizes=[128, 128, 128]
)

```python
policy = ts.policy.DQNPolicy(
policy = DiscreteQLearningPolicy(
model=net,
optim=optim,
discount_factor=gamma,
action_space=env.action_space,
estimation_step=n_step,
eps_training=eps_train,
eps_inference=eps_test
)

# Create the algorithm with the policy and optimizer factory
algorithm = DQN(
policy=policy,
optim=AdamOptimizerFactory(lr=lr),
gamma=gamma,
n_step_return_horizon=n_step,
target_update_freq=target_freq
)
train_collector = ts.data.Collector(policy, train_envs, ts.data.VectorReplayBuffer(buffer_size, train_num), exploration_noise=True)
test_collector = ts.data.Collector(policy, test_envs, exploration_noise=True) # because DQN uses epsilon-greedy method
```

Let's train it:
Set up the collectors:

```python
result = ts.trainer.OffpolicyTrainer(
policy=policy,
train_collector = ts.data.Collector(policy, train_envs,
ts.data.VectorReplayBuffer(buffer_size, train_num), exploration_noise=True)
test_collector = ts.data.Collector(policy, test_envs,
exploration_noise=True) # because DQN uses epsilon-greedy method
```

Let's train it using the algorithm:

```python
from tianshou.highlevel.config import OffPolicyTrainingConfig

# Create training configuration
training_config = OffPolicyTrainingConfig(
max_epochs=epoch,
epoch_num_steps=epoch_num_steps,
batch_size=batch_size,
num_train_envs=train_num,
num_test_envs=test_num,
buffer_size=buffer_size,
collection_step_num_env_steps=collection_step_num_env_steps,
update_step_num_gradient_steps_per_sample=1 / collection_step_num_env_steps,
test_step_num_episodes=test_num,
)

# Run training (trainer is created automatically by the algorithm)
result = algorithm.run_training(
training_config=training_config,
train_collector=train_collector,
test_collector=test_collector,
max_epoch=epoch,
step_per_epoch=step_per_epoch,
step_per_collect=step_per_collect,
episode_per_test=test_num,
batch_size=batch_size,
update_per_step=1 / step_per_collect,
logger=logger,
train_fn=lambda epoch, env_step: policy.set_eps(eps_train),
test_fn=lambda epoch, env_step: policy.set_eps(eps_test),
stop_fn=lambda mean_rewards: mean_rewards >= env.spec.reward_threshold,
logger=logger,
).run()
)
print(f"Finished training in {result.timing.total_time} seconds")
```

Expand Down
Loading
Loading