-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
- I have marked all applicable categories:
- exception-raising bug
- RL algorithm bug
- documentation request (i.e. "X is missing from the documentation.")
- new feature request
- I have visited the source website
- I have searched through the issue tracker for duplicates
- I have mentioned version numbers, operating system and environment, where applicable:
So Tianshou's multi-agent policy manager has an issue where it ignores certain rewards coming in from the environment.
In particular, the only reward an agent sees is the reward it receives immediately after taking an action. It does not notice reward it receives when an opponent takes an action.
To demonstrate this, consider the following extremely simple environment.
Agent's action should be their opponent's action + 5. If they choose the correct action, they are rewarded +1. If not, then +0. This is a trivial environment to learn, and tianshou's multi-agent policy manager learns it successfully.
But what if the reward is delayed? In the ordinary case, where the reward is not delayed, you would expect that
start action = 0
agent | action | reward_1 | reward_2 |
---|---|---|---|
1 | 2 | 0 | 0 |
2 | 7 | 0 | 1 |
1 | 12 | 1 | 0 |
2 | 1 | 0 | 1 |
And in this setup, the agent learns fine.
However, it should be perfectly valid to delay the reward one turn:
agent | action | reward_1 | reward_2 |
---|---|---|---|
1 | 2 | 0 | 0 |
2 | 7 | 0 | 0 |
1 | 12 | 0 | 1 |
2 | 1 | 1 | 0 |
And the agent should still learn. However, Tianshou's MultiAgentPolicyManager
is unable to learn this environment at all. The reason is that it only considers rewards on the same step the agent takes.
Fully reproducible example here: https://gist.github.com/benblack769/80fe3ea5637108bf4e63d94de53e28b1
The problematic bit of code is this line here: https://github.com/thu-ml/tianshou/blob/master/tianshou/policy/multiagent/mapolicy.py#L121
Note that this line only fetches a subset of the rewards returned by the environment.
Note that the tictactoe example embedded in tianshou also suffers from this problem, as only the last player is rewarded, the player who did not step during the last turn will not receive any reward.
I am a maintainer of PettingZoo. Our solution to this problem is to accumulate all the rewards between when a agent takes an action, and before it takes its next action (this is stored in a env._cumulative_rewards
dict then exposed through the last() method. This can be done on the environment side, but requires agents to keep stepping after the environment is done. We are looking to add pettingzoo support for tianshou and we will hopefully make a PR soon to add an environment wrapper that correctly converts a pettingzoo environment into a tianshou multiagent environment.
This is on Tianshou 0.4.2, but I belive the problem has been around for a long time.