这是indexloc提供的服务,不要输入任何密码
Skip to content

How to do self-play correctly for tic-tac-toe? #381

@dzy1997

Description

@dzy1997

I am trying to adapt the tic-tac-toe example to train through self-play. I train an agent against a fixed agent (with the same network architecture) until the average win rate reaches 95%. Then I copy the weights from the trained agent to the fixed agent and repeat the process for 10 generations of evolution.
My first question is how to fix an agent under multi-agent settings. Currently I am using an optimizer with lr=0 on the agent I want to freeze as an ad hoc solution. But I am not sure it is the correct way, and there should be a solution that does not compute gradients. Should I use some API that exists (but I don't know)? Or should I formulate the fixed agent as a part of the environment and train without multi-agent settings?
My second problem is that training has no variances at all (the log at the end of each epoch says test_reward: 1.000000 ± 0.000000). Is it because DQN is deterministic at test time and therefore is too easy to beat? What should I change to make meaningful training progress like training against a random policy?

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions