这是indexloc提供的服务,不要输入任何密码
Skip to content

Add multi-agent example: tic-tac-toe #122

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 127 commits into from
Jul 21, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
127 commits
Select commit Hold shift + click to select a range
6e2a582
make fileds with empty Batch rather than None after reset
youkaichao Jul 10, 2020
2a2b887
dummy code
youkaichao Jul 10, 2020
cf32249
remove dummy
youkaichao Jul 10, 2020
62ac1d3
add reward_length argument for collector
youkaichao Jul 10, 2020
e976d74
Improve Batch (#126)
youkaichao Jul 11, 2020
e1322c4
bugfix for reward_length
youkaichao Jul 11, 2020
bc33b7c
add get_final_reward_fn argument to collector to deal with marl
youkaichao Jul 11, 2020
ddbaef4
minor polish
Trinkle23897 Jul 11, 2020
1d5058b
Merge branch 'dev' into collector
Trinkle23897 Jul 11, 2020
306dd68
remove multibuf
Trinkle23897 Jul 11, 2020
3328252
minor polish
Trinkle23897 Jul 11, 2020
6fe3ac5
improve and implement Batch.cat_
youkaichao Jul 11, 2020
f34b0f8
bugfix for buffer.sample with field impt_weight
youkaichao Jul 11, 2020
eba3b36
restore the usage of a.cat_(b)
youkaichao Jul 11, 2020
42ee76d
fix 2 bugs in batch and add corresponding unittest
Trinkle23897 Jul 11, 2020
7e42b76
code fix for update
youkaichao Jul 11, 2020
d6d0a20
update is_empty to recognize empty over empty; bugfix for len
youkaichao Jul 11, 2020
988f0f3
bugfix for update and add testcase
youkaichao Jul 11, 2020
1c27de8
add testcase of update
youkaichao Jul 11, 2020
7d90c7f
make fileds with empty Batch rather than None after reset
youkaichao Jul 10, 2020
0a6a74c
dummy code
youkaichao Jul 10, 2020
336a300
remove dummy
youkaichao Jul 10, 2020
ab34c2b
add reward_length argument for collector
youkaichao Jul 10, 2020
7034674
bugfix for reward_length
youkaichao Jul 11, 2020
44abe88
add get_final_reward_fn argument to collector to deal with marl
youkaichao Jul 11, 2020
2688d6e
make sure the key type of Batch is string, and add unit tests
youkaichao Jul 10, 2020
ecb77d4
add is_empty() function and unit tests
youkaichao Jul 10, 2020
9ac35e5
enable cat of mixing dict and Batch, just like stack
youkaichao Jul 10, 2020
2c1478e
dummy code
youkaichao Jul 10, 2020
adeafec
remove dummy
youkaichao Jul 10, 2020
39f9397
add multi-agent example: tic-tac-toe
youkaichao Jul 9, 2020
79c1a56
move TicTacToeEnv to a separate file
youkaichao Jul 9, 2020
0413df0
remove dummy MANet
youkaichao Jul 9, 2020
19a3971
code refactor
youkaichao Jul 9, 2020
2b19b1f
move tic-tac-toe example to test
youkaichao Jul 9, 2020
1f45783
update doc with marl-example
youkaichao Jul 9, 2020
2b2a7ef
fix docs
Trinkle23897 Jul 10, 2020
2941161
reduce the threshold
Trinkle23897 Jul 10, 2020
c221bc4
revert
Trinkle23897 Jul 10, 2020
64edf55
update player id to start from 1 and change player to agent; keep coding
youkaichao Jul 10, 2020
7c5dbf6
add reward_length argument for collector
youkaichao Jul 10, 2020
d1a2037
Improve Batch (#128)
youkaichao Jul 11, 2020
4ed64fd
Merge branch 'dev' into collector
Trinkle23897 Jul 11, 2020
c519af8
refact
Trinkle23897 Jul 11, 2020
aefcdfc
re-implement Batch.stack and add testcases
youkaichao Jul 11, 2020
b3ab3fe
add doc for Batch.stack
youkaichao Jul 11, 2020
9c4eb51
reward_metric
Trinkle23897 Jul 11, 2020
b6beb67
modify flag
Trinkle23897 Jul 11, 2020
5ce2692
minor fix
youkaichao Jul 11, 2020
f7d7482
reuse _create_values and refactor stack_ & cat_
youkaichao Jul 11, 2020
6bf23c7
fix pep8
youkaichao Jul 11, 2020
39c1b39
fix reward stat in collector
Trinkle23897 Jul 12, 2020
96ee017
fix stat of collector, simplify test/base/env.py
Trinkle23897 Jul 12, 2020
f434a26
fix docs
Trinkle23897 Jul 12, 2020
20ea6a1
minor fix
Trinkle23897 Jul 12, 2020
a78d4ac
raise exception for stacking with partial keys and axis!=0
youkaichao Jul 12, 2020
1f08c4b
minor fix
Trinkle23897 Jul 12, 2020
69181d8
minor fix
Trinkle23897 Jul 12, 2020
c187390
minor fix
Trinkle23897 Jul 12, 2020
6436f9d
marl-examples
youkaichao Jul 11, 2020
e64d10c
Merge branch 'stack' into marl-example
youkaichao Jul 12, 2020
f6177b6
add condense; bugfix for torch.Tensor; code refactor
youkaichao Jul 12, 2020
9277b30
Merge branch 'collector' into marl-example
youkaichao Jul 12, 2020
0dd6a8a
marl example can run now
youkaichao Jul 12, 2020
81d0038
enable tic tac toe with larger board size and win-size
youkaichao Jul 12, 2020
741a255
add test dependency
youkaichao Jul 12, 2020
a55ad33
Fix padding of inconsistent keys with Batch.stack and Batch.cat (#130)
youkaichao Jul 12, 2020
75a292c
Merge branch 'dev' into marl-example
Trinkle23897 Jul 12, 2020
22c5fb2
stash
youkaichao Jul 12, 2020
70b6d06
let agent learn to play as agent 2 which is harder
youkaichao Jul 12, 2020
4230241
code refactor
youkaichao Jul 12, 2020
92f6031
Merge remote-tracking branch 'origin/marl-example' into marl-example
youkaichao Jul 12, 2020
885fbc1
Improve collector (#125)
youkaichao Jul 12, 2020
a42572a
Merge remote-tracking branch 'original/dev' into marl-example
youkaichao Jul 12, 2020
7fe79ae
marl for tic-tac-toe and general gomoku
youkaichao Jul 12, 2020
075ec73
update default gamma to 0.1 for tic tac toe to win earlier
youkaichao Jul 13, 2020
767ef99
fix name typo; change default game config; add rew_norm option
youkaichao Jul 13, 2020
ed0340c
fix pep8
youkaichao Jul 13, 2020
0468f96
test commit
Trinkle23897 Jul 13, 2020
4e27789
mv test dir name
Trinkle23897 Jul 13, 2020
f9ae704
add rew flag
Trinkle23897 Jul 13, 2020
d50278c
fix torch.optim import error and madqn rew_norm
Trinkle23897 Jul 13, 2020
3710b33
remove useless kwargs
youkaichao Jul 13, 2020
cee8088
Vector env enable select worker (#132)
duburcqa Jul 13, 2020
ea53c37
Merge branch 'dev' into marl-example
Trinkle23897 Jul 13, 2020
af86db4
show the last move of tictactoe by capital letters
youkaichao Jul 14, 2020
f390778
add multi-agent tutorial
youkaichao Jul 14, 2020
da99922
Merge remote-tracking branch 'origin/marl-example' into marl-example
youkaichao Jul 14, 2020
414430f
fix link
youkaichao Jul 14, 2020
f8ad6df
Standardized behavior of Batch.cat and misc code refactor (#137)
youkaichao Jul 16, 2020
fa542f8
write tutorials to specify the standard of Batch (#142)
youkaichao Jul 19, 2020
50e5992
Merge branch 'dev' into marl-example
youkaichao Jul 19, 2020
a458dcc
bugfix for mapolicy
youkaichao Jul 19, 2020
762c57e
pretty code
youkaichao Jul 19, 2020
8cf0d13
remove debug code; remove condense
youkaichao Jul 19, 2020
44624dd
doc fix
youkaichao Jul 19, 2020
c4f7311
check before get_agents in tutorials/tictactoe
Trinkle23897 Jul 19, 2020
7eb2f18
tutorial
Trinkle23897 Jul 19, 2020
13f7834
fix
Trinkle23897 Jul 19, 2020
d3985d5
minor fix for batch doc
Trinkle23897 Jul 19, 2020
3b50fcb
minor polish
Trinkle23897 Jul 19, 2020
65e2064
faster test_ttt
Trinkle23897 Jul 20, 2020
954b4f9
improve tic-tac-toe environment
youkaichao Jul 20, 2020
b052bf0
change default epoch and step-per-epoch for tic-tac-toe
youkaichao Jul 20, 2020
e12749f
fix mapolicy
Trinkle23897 Jul 20, 2020
6065166
minor polish for mapolicy
Trinkle23897 Jul 20, 2020
78cc1d5
90% to 80% (need to change the tutorial)
Trinkle23897 Jul 20, 2020
6052792
win rate
Trinkle23897 Jul 20, 2020
2757299
show step number at board
youkaichao Jul 20, 2020
2df6eaa
Merge branch 'dev' into marl-example
Trinkle23897 Jul 20, 2020
35be0e8
simplify mapolicy
Trinkle23897 Jul 20, 2020
721364f
minor polish for mapolicy
Trinkle23897 Jul 20, 2020
a0c8d60
remove MADQN
Trinkle23897 Jul 20, 2020
58485ec
fix pep8
Trinkle23897 Jul 20, 2020
4a44517
change legal_actions to mask (need to update docs)
Trinkle23897 Jul 20, 2020
4b73651
simplify maenv
Trinkle23897 Jul 20, 2020
43e908e
fix typo
Trinkle23897 Jul 20, 2020
ed61a6a
Merge branch 'dev' into marl-example
Trinkle23897 Jul 20, 2020
6141459
move basevecenv to single file
Trinkle23897 Jul 21, 2020
6dab46f
separate RandomAgent
Trinkle23897 Jul 21, 2020
bbd89fd
update docs
Trinkle23897 Jul 21, 2020
b0d8f64
grammarly
Trinkle23897 Jul 21, 2020
ae067ea
fix pep8
Trinkle23897 Jul 21, 2020
2017e27
win rate typo
Trinkle23897 Jul 21, 2020
267340f
format in cheatsheet
Trinkle23897 Jul 21, 2020
3c76bd2
use bool mask directly
youkaichao Jul 21, 2020
a04ddf6
update doc for boolean mask
youkaichao Jul 21, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -143,3 +143,5 @@ MUJOCO_LOG.TXT
*.pth
.vscode/
.DS_Store
*.zip
*.pstats
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ Here is Tianshou's other features:
- Support any type of environment state (e.g. a dict, a self-defined class, ...) [Usage](https://tianshou.readthedocs.io/en/latest/tutorials/cheatsheet.html#user-defined-environment-and-different-state-representation)
- Support customized training process [Usage](https://tianshou.readthedocs.io/en/latest/tutorials/cheatsheet.html#customize-training-process)
- Support n-step returns estimation for all Q-learning based algorithms
- Support multi-agent RL easily [Usage](https://tianshou.readthedocs.io/en/latest/tutorials/cheatsheet.html##multi-agent-reinforcement-learning)

In Chinese, Tianshou means divinely ordained and is derived to the gift of being born with. Tianshou is a reinforcement learning platform, and the RL algorithm does not learn from humans. So taking "Tianshou" means that there is no teacher to study with, but rather to learn by themselves through constant interaction with the environment.

Expand Down
Binary file added docs/_static/images/marl.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/images/tic-tac-toe.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/contributor.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ We always welcome contributions to help make Tianshou better. Below are an incom
* Jiayi Weng (`Trinkle23897 <https://github.com/Trinkle23897>`_)
* Minghao Zhang (`Mehooz <https://github.com/Mehooz>`_)
* Alexis Duburcq (`duburcqa <https://github.com/duburcqa>`_)
* Kaichao You (`youkaichao <https://github.com/youkaichao>`_)
2 changes: 2 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Here is Tianshou's other features:
* Support any type of environment state (e.g. a dict, a self-defined class, ...): :ref:`self_defined_env`
* Support customized training process: :ref:`customize_training`
* Support n-step returns estimation :meth:`~tianshou.policy.BasePolicy.compute_nstep_return` for all Q-learning based algorithms
* Support multi-agent RL easily (a tutorial is available at :doc:`/tutorials/tictactoe`)

中文文档位于 https://tianshou.readthedocs.io/zh/latest/

Expand Down Expand Up @@ -71,6 +72,7 @@ Tianshou is still under development, you can also check out the documents in sta
tutorials/dqn
tutorials/concepts
tutorials/batch
tutorials/tictactoe
tutorials/trick
tutorials/cheatsheet

Expand Down
43 changes: 43 additions & 0 deletions docs/tutorials/cheatsheet.rst
Original file line number Diff line number Diff line change
Expand Up @@ -244,3 +244,46 @@ But the state stored in the buffer may be a shallow-copy. To make sure each of y
def step(a):
...
return copy.deepcopy(self.graph), reward, done, {}

.. _marl_example:

Multi-Agent Reinforcement Learning
----------------------------------

This is related to `Issue 121 <https://github.com/thu-ml/tianshou/issues/121>`_. The discussion is still goes on.

With the flexible core APIs, Tianshou can support multi-agent reinforcement learning with minimal efforts.

Currently, we support three types of multi-agent reinforcement learning paradigms:

1. Simultaneous move: at each timestep, all the agents take their actions (example: moba games)

2. Cyclic move: players take action in turn (example: Go game)

3. Conditional move, at each timestep, the environment conditionally selects an agent to take action. (example: `Pig Game <https://en.wikipedia.org/wiki/Pig_(dice_game)>`_)

We mainly address these multi-agent RL problems by converting them into traditional RL formulations.

For simultaneous move, the solution is simple: we can just add a ``num_agent`` dimension to state, action, and reward. Nothing else is going to change.

For 2 & 3 (cyclic move and conditional move), they can be unified into a single framework: at each timestep, the environment selects an agent with id ``agent_id`` to play. Since multi-agents are usually wrapped into one object (which we call "abstract agent"), we can pass the ``agent_id`` to the "abstract agent", leaving it to further call the specific agent.

In addition, legal actions in multi-agent RL often vary with timestep (just like Go games), so the environment should also passes the legal action mask to the "abstract agent", where the mask is a boolean array that "True" for available actions and "False" for illegal actions at the current step. Below is a figure that explains the abstract agent.

.. image:: /_static/images/marl.png
:align: center
:height: 300

The above description gives rise to the following formulation of multi-agent RL:
::

action = policy(state, agent_id, mask)
(next_state, next_agent_id, next_mask), reward = env.step(action)

By constructing a new state ``state_ = (state, agent_id, mask)``, essentially we can return to the typical formulation of RL:
::

action = policy(state_)
next_state_, reward = env.step(action)

Following this idea, we write a tiny example of playing `Tic Tac Toe <https://en.wikipedia.org/wiki/Tic-tac-toe>`_ against a random player by using a Q-lerning algorithm. The tutorial is at :doc:`/tutorials/tictactoe`.
2 changes: 1 addition & 1 deletion docs/tutorials/dqn.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ We use the defined ``net`` and ``optim``, with extra policy hyper-parameters, to

policy = ts.policy.DQNPolicy(net, optim,
discount_factor=0.9, estimation_step=3,
use_target_network=True, target_update_freq=320)
target_update_freq=320)


Setup Collector
Expand Down
Loading