MultiAgentPolicyManager misses rewards

- [ ] I have marked all applicable categories:
    + [ ] exception-raising bug
    + [x] RL algorithm bug
    + [ ] documentation request (i.e. "X is missing from the documentation.")
    + [ ] new feature request
- [x] I have visited the [source website](https://github.com/thu-ml/tianshou/)
- [x] I have searched through the [issue tracker](https://github.com/thu-ml/tianshou/issues) for duplicates
- [x] I have mentioned version numbers, operating system and environment, where applicable:

So Tianshou's multi-agent policy manager has an issue where it ignores certain rewards coming in from the environment. 

In particular, the only reward an agent sees is the reward it receives immediately after taking an action. It does not notice reward it receives when an opponent takes an action. 
To demonstrate this, consider the following extremely simple environment. 

Agent's action should be their opponent's action + 5. If they choose the correct action, they are rewarded +1. If not, then +0. This is a trivial environment to learn, and tianshou's multi-agent policy manager learns it successfully. 

But what if the reward is delayed? In the ordinary case, where the reward is not delayed, you would expect that

start action = 0

agent | action | reward_1 | reward_2
---   | -----  | -------- | --------
1     | 2      | 0        | 0
2     | 7      | 0        | 1
1     | 12     | 1        | 0
2     | 1      | 0        | 1

And in this setup, the agent learns fine.

However, it should be perfectly valid to delay the reward one turn:

agent | action | reward_1 | reward_2
---   | -----  | -------- | --------
1     | 2      | 0        | 0
2     | 7      | 0        | 0
1     | 12     | 0        | 1
2     | 1      | 1        | 0

And the agent should still learn. However, Tianshou's MultiAgentPolicyManager
is unable to learn this environment at all. The reason is that it only considers rewards on the same step the agent takes. 

Fully reproducible example here: https://gist.github.com/benblack769/80fe3ea5637108bf4e63d94de53e28b1

The problematic bit of code is this line here: https://github.com/thu-ml/tianshou/blob/master/tianshou/policy/multiagent/mapolicy.py#L121
Note that this line only fetches a subset of the rewards returned by the environment. 

Note that the tictactoe example embedded in tianshou also suffers from this problem, as only the last player is rewarded, the player who did not step during the last turn will not receive any reward. 

I am a maintainer of [PettingZoo](https://www.pettingzoo.ml/). Our solution to this problem is to accumulate all the rewards between when a agent takes an action, and before it takes its next action (this is stored in a [`env._cumulative_rewards` dict](https://github.com/PettingZoo-Team/PettingZoo/blob/master/pettingzoo/utils/env.py#L100) then exposed through the [last() method](https://www.pettingzoo.ml/api#interacting-with-environments). This can be done on the environment side, but requires agents to keep stepping after the environment is done. We are looking to add pettingzoo support for tianshou and we will hopefully make a PR soon to add an environment wrapper that correctly converts a pettingzoo environment into a tianshou multiagent environment. 

This is on Tianshou 0.4.2, but I belive the problem has been around for a long time. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MultiAgentPolicyManager misses rewards #399

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MultiAgentPolicyManager misses rewards #399

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions