-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
- I have marked all applicable categories:
- exception-raising bug
- RL algorithm bug
- documentation request (i.e. "X is missing from the documentation.")
- new feature request
- I have visited the source website
- I have searched through the issue tracker for duplicates
- I have mentioned version numbers, operating system and environment, where applicable:
import tianshou, torch, numpy, sys print(tianshou.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)
Hello, I have a question about the dual clipping implementation for PPO.
tianshou/tianshou/policy/modelfree/ppo.py
Lines 114 to 123 in 8a5e219
ratio = (dist.log_prob(b.act) - b.logp_old).exp().float() | |
ratio = ratio.reshape(ratio.size(0), -1).transpose(0, 1) | |
surr1 = ratio * b.adv | |
surr2 = ratio.clamp(1.0 - self._eps_clip, 1.0 + self._eps_clip) * b.adv | |
if self._dual_clip: | |
clip_loss = -torch.max( | |
torch.min(surr1, surr2), self._dual_clip * b.adv | |
).mean() | |
else: | |
clip_loss = -torch.min(surr1, surr2).mean() |
Is this implementation correct? I mean, the paper arXiv:1912.09729 uses dual clipping
max(min(surr1, surr2), c*adv)
only when the advantage (\hat{A} in paper) is negative. Because if \hat{A} is positive, it may have a much larger number than surr1, surr2, it then brings a huge policy loss after taking max(..., c*adv)
(btw, c is always >1.0). So should this be changed like...?
if self._dual_clip:
clip1 = torch.min(surr1, surr2)
clip2 = torch.max(clip1, self._dual_clip * b.adv)
clip_loss = -torch.where(b.adv < 0, clip2, clip1),mean()
else:
clip_loss = -torch.min(surr1, surr2).mean()
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working