thu-ml · Trinkle23897 · Jul 21, 2020 · Jul 10, 2020 · Jul 10, 2020 · Jul 10, 2020
diff --git a/.gitignore b/.gitignore
@@ -143,3 +143,5 @@ MUJOCO_LOG.TXT
 *.pth
 .vscode/
 .DS_Store
+*.zip
+*.pstats
diff --git a/README.md b/README.md
@@ -38,6 +38,7 @@ Here is Tianshou's other features:
 - Support any type of environment state (e.g. a dict, a self-defined class, ...) [Usage](https://tianshou.readthedocs.io/en/latest/tutorials/cheatsheet.html#user-defined-environment-and-different-state-representation)
 - Support customized training process [Usage](https://tianshou.readthedocs.io/en/latest/tutorials/cheatsheet.html#customize-training-process)
 - Support n-step returns estimation for all Q-learning based algorithms
+- Support multi-agent RL easily [Usage](https://tianshou.readthedocs.io/en/latest/tutorials/cheatsheet.html##multi-agent-reinforcement-learning)
 
 In Chinese, Tianshou means divinely ordained and is derived to the gift of being born with. Tianshou is a reinforcement learning platform, and the RL algorithm does not learn from humans. So taking "Tianshou" means that there is no teacher to study with, but rather to learn by themselves through constant interaction with the environment.
 

diff --git a/docs/_static/images/marl.png b/docs/_static/images/marl.png
diff --git a/docs/_static/images/tic-tac-toe.png b/docs/_static/images/tic-tac-toe.png
diff --git a/docs/contributor.rst b/docs/contributor.rst
@@ -6,3 +6,4 @@ We always welcome contributions to help make Tianshou better. Below are an incom
 * Jiayi Weng (`Trinkle23897 <https://github.com/Trinkle23897>`_)
 * Minghao Zhang (`Mehooz <https://github.com/Mehooz>`_)
 * Alexis Duburcq (`duburcqa <https://github.com/duburcqa>`_)
+* Kaichao You (`youkaichao <https://github.com/youkaichao>`_)
diff --git a/docs/index.rst b/docs/index.rst
@@ -28,6 +28,7 @@ Here is Tianshou's other features:
 * Support any type of environment state (e.g. a dict, a self-defined class, ...): :ref:`self_defined_env`
 * Support customized training process: :ref:`customize_training`
 * Support n-step returns estimation :meth:`~tianshou.policy.BasePolicy.compute_nstep_return` for all Q-learning based algorithms
+* Support multi-agent RL easily (a tutorial is available at :doc:`/tutorials/tictactoe`)
 
 中文文档位于 https://tianshou.readthedocs.io/zh/latest/
 
@@ -71,6 +72,7 @@ Tianshou is still under development, you can also check out the documents in sta
    tutorials/dqn
    tutorials/concepts
    tutorials/batch
+   tutorials/tictactoe
    tutorials/trick
    tutorials/cheatsheet
 

diff --git a/docs/tutorials/cheatsheet.rst b/docs/tutorials/cheatsheet.rst
@@ -244,3 +244,46 @@ But the state stored in the buffer may be a shallow-copy. To make sure each of y
     def step(a):
         ...
         return copy.deepcopy(self.graph), reward, done, {}
+
+.. _marl_example:
+
+Multi-Agent Reinforcement Learning
+----------------------------------
+
+This is related to `Issue 121 <https://github.com/thu-ml/tianshou/issues/121>`_. The discussion is still goes on.
+
+With the flexible core APIs, Tianshou can support multi-agent reinforcement learning with minimal efforts.
+
+Currently, we support three types of multi-agent reinforcement learning paradigms:
+
+1. Simultaneous move: at each timestep, all the agents take their actions (example: moba games)
+
+2. Cyclic move: players take action in turn (example: Go game)
+
+3. Conditional move, at each timestep, the environment conditionally selects an agent to take action. (example: `Pig Game <https://en.wikipedia.org/wiki/Pig_(dice_game)>`_)
+
+We mainly address these multi-agent RL problems by converting them into traditional RL formulations.
+
+For simultaneous move, the solution is simple: we can just add a ``num_agent`` dimension to state, action, and reward. Nothing else is going to change.
+
+For 2 & 3 (cyclic move and conditional move), they can be unified into a single framework: at each timestep, the environment selects an agent with id ``agent_id`` to play. Since multi-agents are usually wrapped into one object (which we call "abstract agent"), we can pass the ``agent_id`` to the "abstract agent", leaving it to further call the specific agent.
+
+In addition, legal actions in multi-agent RL often vary with timestep (just like Go games), so the environment should also passes the legal action mask to the "abstract agent", where the mask is a boolean array that "True" for available actions and "False" for illegal actions at the current step. Below is a figure that explains the abstract agent.
+
+.. image:: /_static/images/marl.png
+    :align: center
+    :height: 300
+
+The above description gives rise to the following formulation of multi-agent RL:
+::
+
+    action = policy(state, agent_id, mask)
+    (next_state, next_agent_id, next_mask), reward = env.step(action)
+
+By constructing a new state ``state_ = (state, agent_id, mask)``, essentially we can return to the typical formulation of RL:
+::
+
+    action = policy(state_)
+    next_state_, reward = env.step(action)
+
+Following this idea, we write a tiny example of playing `Tic Tac Toe <https://en.wikipedia.org/wiki/Tic-tac-toe>`_ against a random player by using a Q-lerning algorithm. The tutorial is at :doc:`/tutorials/tictactoe`.
diff --git a/docs/tutorials/dqn.rst b/docs/tutorials/dqn.rst
@@ -88,7 +88,7 @@ We use the defined ``net`` and ``optim``, with extra policy hyper-parameters, to
 
     policy = ts.policy.DQNPolicy(net, optim,
         discount_factor=0.9, estimation_step=3,
-        use_target_network=True, target_update_freq=320)
+        target_update_freq=320)
 
 
 Setup Collector