这是indexloc提供的服务,不要输入任何密码
Skip to content

Buffer should not store remapped action #312

@ChenDRAG

Description

@ChenDRAG

For now, for example in PPO, actions are sampled from Gaussian distribution, which ranges from (-inf. inf). We then remap the action to (-1, 1) by clipping.

Do this, and you will find that the algorithm doesn't converge. This is actually a bug because when you clip the action, the log probability of your action changes, which make the original update equation incorrect, which will harm the performance severely in certain circumstances.
This is where the comment in test_ppo.py comes from:

policy = PPOPolicy(
    actor, critic, optim, dist, args.gamma,
    max_grad_norm=args.max_grad_norm,
    eps_clip=args.eps_clip,
    vf_coef=args.vf_coef,
    ent_coef=args.ent_coef,
    reward_normalization=args.rew_norm,
    # dual_clip=args.dual_clip,
    # dual clip cause monotonically increasing log_std :)
    value_clip=args.value_clip,
    # action_range=[env.action_space.low[0], env.action_space.high[0]],)
    # if clip the action, ppo would not converge :)
    gae_lambda=args.gae_lambda)

# action_range=[env.action_space.low[0], env.action_space.high[0]],)

One way to fix this is not to store remapped action in the buffer and only regard it as something in the environment(like black box), and it's what gym/SB3 actually do.

I will start a pr shortly after to fix this.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions