-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
For now, for example in PPO, actions are sampled from Gaussian distribution, which ranges from (-inf. inf). We then remap the action to (-1, 1) by clipping.
Do this, and you will find that the algorithm doesn't converge. This is actually a bug because when you clip the action, the log probability of your action changes, which make the original update equation incorrect, which will harm the performance severely in certain circumstances.
This is where the comment in test_ppo.py comes from:
policy = PPOPolicy(
actor, critic, optim, dist, args.gamma,
max_grad_norm=args.max_grad_norm,
eps_clip=args.eps_clip,
vf_coef=args.vf_coef,
ent_coef=args.ent_coef,
reward_normalization=args.rew_norm,
# dual_clip=args.dual_clip,
# dual clip cause monotonically increasing log_std :)
value_clip=args.value_clip,
# action_range=[env.action_space.low[0], env.action_space.high[0]],)
# if clip the action, ppo would not converge :)
gae_lambda=args.gae_lambda)
tianshou/test/continuous/test_ppo.py
Line 103 in ec23c7e
# action_range=[env.action_space.low[0], env.action_space.high[0]],) |
One way to fix this is not to store remapped action in the buffer and only regard it as something in the environment(like black box), and it's what gym/SB3 actually do.
I will start a pr shortly after to fix this.