这是indexloc提供的服务,不要输入任何密码
Skip to content

A bug about dual clip PPO implementation #433

@Ending2015a

Description

@Ending2015a
  • I have marked all applicable categories:
    • exception-raising bug
    • RL algorithm bug
    • documentation request (i.e. "X is missing from the documentation.")
    • new feature request
  • I have visited the source website
  • I have searched through the issue tracker for duplicates
  • I have mentioned version numbers, operating system and environment, where applicable:
    import tianshou, torch, numpy, sys
    print(tianshou.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)

Hello, I have a question about the dual clipping implementation for PPO.

ratio = (dist.log_prob(b.act) - b.logp_old).exp().float()
ratio = ratio.reshape(ratio.size(0), -1).transpose(0, 1)
surr1 = ratio * b.adv
surr2 = ratio.clamp(1.0 - self._eps_clip, 1.0 + self._eps_clip) * b.adv
if self._dual_clip:
clip_loss = -torch.max(
torch.min(surr1, surr2), self._dual_clip * b.adv
).mean()
else:
clip_loss = -torch.min(surr1, surr2).mean()

Is this implementation correct? I mean, the paper arXiv:1912.09729 uses dual clipping max(min(surr1, surr2), c*adv) only when the advantage (\hat{A} in paper) is negative. Because if \hat{A} is positive, it may have a much larger number than surr1, surr2, it then brings a huge policy loss after taking max(..., c*adv) (btw, c is always >1.0). So should this be changed like...?

if self._dual_clip:
    clip1 = torch.min(surr1, surr2)
    clip2 = torch.max(clip1, self._dual_clip * b.adv)
    clip_loss = -torch.where(b.adv < 0, clip2, clip1),mean()
else:
    clip_loss = -torch.min(surr1, surr2).mean()

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions