这是indexloc提供的服务,不要输入任何密码
Skip to content

Reproducibility for Atari tasks #456

@nuance1979

Description

@nuance1979
  • I have marked all applicable categories:
    • exception-raising bug
    • RL algorithm bug
    • documentation request (i.e. "X is missing from the documentation.")
    • new feature request
  • I have visited the source website
  • I have searched through the issue tracker for duplicates
  • I have mentioned version numbers, operating system and environment, where applicable:
    import tianshou, torch, numpy, sys
    print(tianshou.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)

Related to #449. I was trying to fix the reproducibility issue in atari_bcq.py and I initially believed it was coming from set(a.parameters()).union(b.parameters()). However, even if I kept the order of parameters fixed, I still got different results between two runs with the same seed. For example, I ran python3 ./atari_dqn.py --task PongNoFrameskip-v4 --epoch 5 for 3 times and got 3 different results:

❯ grep best_reward log.dqn.pong.epoch_5*
log.dqn.pong.epoch_5:Epoch #1: test_reward: -21.000000 ± 0.000000, best_reward: -21.000000 ± 0.000000 in #0
log.dqn.pong.epoch_5:Epoch #2: test_reward: -21.000000 ± 0.000000, best_reward: -21.000000 ± 0.000000 in #0
log.dqn.pong.epoch_5:Epoch #3: test_reward: -21.000000 ± 0.000000, best_reward: -21.000000 ± 0.000000 in #0
log.dqn.pong.epoch_5:Epoch #4: test_reward: -21.000000 ± 0.000000, best_reward: -21.000000 ± 0.000000 in #0
log.dqn.pong.epoch_5:Epoch #5: test_reward: -21.000000 ± 0.000000, best_reward: -21.000000 ± 0.000000 in #0
log.dqn.pong.epoch_5: 'best_reward': -21.0,
log.dqn.pong.epoch_5.1:Epoch #1: test_reward: -19.000000 ± 1.264911, best_reward: -19.000000 ± 1.264911 in #1
log.dqn.pong.epoch_5.1:Epoch #2: test_reward: -16.100000 ± 2.385372, best_reward: -16.100000 ± 2.385372 in #2
log.dqn.pong.epoch_5.1:Epoch #3: test_reward: -19.100000 ± 0.943398, best_reward: -16.100000 ± 2.385372 in #2
log.dqn.pong.epoch_5.1:Epoch #4: test_reward: -18.600000 ± 1.624808, best_reward: -16.100000 ± 2.385372 in #2
log.dqn.pong.epoch_5.1:Epoch #5: test_reward: 2.400000 ± 2.059126, best_reward: 2.400000 ± 2.059126 in #5
log.dqn.pong.epoch_5.1: 'best_reward': 2.4,
log.dqn.pong.epoch_5.2:Epoch #1: test_reward: -20.800000 ± 0.600000, best_reward: -20.800000 ± 0.600000 in #1
log.dqn.pong.epoch_5.2:Epoch #2: test_reward: -12.800000 ± 0.600000, best_reward: -12.800000 ± 0.600000 in #2
log.dqn.pong.epoch_5.2:Epoch #3: test_reward: -16.000000 ± 1.843909, best_reward: -12.800000 ± 0.600000 in #2
log.dqn.pong.epoch_5.2:Epoch #4: test_reward: -12.700000 ± 3.689173, best_reward: -12.700000 ± 3.689173 in #4
log.dqn.pong.epoch_5.2:Epoch #5: test_reward: -10.400000 ± 1.562050, best_reward: -10.400000 ± 1.562050 in #5
log.dqn.pong.epoch_5.2: 'best_reward': -10.4,

I wonder where the randomness comes from given we already set the seeds like:

    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    train_envs.seed(args.seed)
    test_envs.seed(args.seed)

Does it come from the vector replay buffer? Or does it come from GPU? Is it possible to remove it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions