-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
I am trying to adapt the tic-tac-toe example to train through self-play. I train an agent against a fixed agent (with the same network architecture) until the average win rate reaches 95%. Then I copy the weights from the trained agent to the fixed agent and repeat the process for 10 generations of evolution.
My first question is how to fix an agent under multi-agent settings. Currently I am using an optimizer with lr=0 on the agent I want to freeze as an ad hoc solution. But I am not sure it is the correct way, and there should be a solution that does not compute gradients. Should I use some API that exists (but I don't know)? Or should I formulate the fixed agent as a part of the environment and train without multi-agent settings?
My second problem is that training has no variances at all (the log at the end of each epoch says test_reward: 1.000000 ± 0.000000). Is it because DQN is deterministic at test time and therefore is too easy to beat? What should I change to make meaningful training progress like training against a random policy?