How to do self-play correctly for tic-tac-toe?

I am trying to adapt the tic-tac-toe example to train through self-play. I train an agent against a fixed agent (with the same network architecture) until the average win rate reaches 95%. Then I copy the weights from the trained agent to the fixed agent and repeat the process for 10 generations of evolution. 
My first question is how to fix an agent under multi-agent settings. Currently I am using an optimizer with lr=0 on the agent I want to freeze as an ad hoc solution. But I am not sure it is the correct way, and there should be a solution that does not compute gradients. Should I use some API that exists (but I don't know)? Or should I formulate the fixed agent as a part of the environment and train without multi-agent settings?
My second problem is that training has no variances at all (the log at the end of each epoch says `test_reward: 1.000000 ± 0.000000`). Is it because DQN is deterministic at test time and therefore is too easy to beat? What should I change to make meaningful training progress like training against a random policy? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to do self-play correctly for tic-tac-toe? #381

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to do self-play correctly for tic-tac-toe? #381

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions