-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add discrete Conservative Q-Learning for offline RL #359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
nuance1979
commented
May 5, 2021
- A tianshou implementation of the discrete Conservative Q-Learning for offline/batch RL
- Reference: paper and code
|
Codecov Report
@@ Coverage Diff @@
## master #359 +/- ##
==========================================
+ Coverage 94.21% 94.55% +0.33%
==========================================
Files 53 54 +1
Lines 3472 3505 +33
==========================================
+ Hits 3271 3314 +43
+ Misses 201 191 -10
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
| BreakoutNoFrameskip-v4 | 394.3 | 46.9 | 248.3 (epoch 12) | `python3 atari_cql.py --task "BreakoutNoFrameskip-v4" --load-buffer-name log/BreakoutNoFrameskip-v4/qrdqn/expert.hdf5 --epoch 12 --min-q-weight 50` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to suggest using --eps-test 0.9
or --eps-test 0.99
when generating buffer to see what happens because, in the original paper, the author used 1% and 10% of the expert data to train CQL and got a good result (Table 3).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried --eps-test 0.9
on Pong and Breakout. Pong can easily achieve +20 reward but it is not so stable; Breakout cannot achieve >20 reward when min-q-weight
is either 10 or 50. I also test with only first 10% random data instead of mixed 10% data, the results are the same. Could you please help check what's going wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried --eps-test 0.9
and --eps-test 0.99
and got terrible results, basically failed to learn. Then I checked this paper where the data were generated and found in Section 6: We train offline QR-DQN and REM with reduced data obtained via randomly subsampling entire trajectories from the logged DQN experiences, thereby maintaining the same data distribution. Figure 6 presents the performance of the offline REM and QR-DQN agents with N% of the tuples in the DQN replay dataset where N ∈ {1, 10, 20, 50, 100}.
So they were simply using less data, as opposed to use worse data. So I generated 1% and 10% of the 1M data, i.e., 10k and 100k, and tuned the parameters a little. In deed, smaller datasets need smaller --min-q-weight
to work. (I initially tried to sample from my 1M buffer data but couldn't get the format right.) These results are within expectations, IMHO.
| PongNoFrameskip-v4 | 20.5 | nan | 1.8 (epoch 5) | `python3 atari_cql.py --task "PongNoFrameskip-v4" --load-buffer-name log/PongNoFrameskip-v4/qrdqn/expert.size_1e4.hdf5 --epoch 5 --min-q-weight 1` | | ||
| BreakoutNoFrameskip-v4 | 394.3 | 31.7 | 22.5 (epoch 12) | `python3 atari_cql.py --task "BreakoutNoFrameskip-v4" --load-buffer-name log/BreakoutNoFrameskip-v4/qrdqn/expert.size_1e4.hdf5 --epoch 12 --min-q-weight 10` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to stop at epoch 5 and 12 -- that's too small and not converge yet. In fact, the original paper indicates that they use ~25 iterations to get the optimal performance in Breakout, is that (25*1M/64/10000=39) epoch in our setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that epoch 5 and 12 are arbitrary. I just wanted to compare roughly with the BCQ results above. I reran the two Breakout runs to epoch 40 and the test results were worse than epoch 12. (However, I do think the evaluation protocol of 10 episodes is too small to smooth out the randomness.) The learning curves show that the loss seemed to be behaving but the test rewards were all over the places:
I double-checked my implementation with the reference but couldn't find any errors. (It was essentially 3 lines of code on top of the QRDQN anyways.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this tensorboard or other visualization tools? I'm quite curious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's tensorboard. I usually run it like this: tensorboard --logdir log/BreakoutNoFrameskip-v4/cql/ --host $(hostname -i) --port 8086
. Then click the server address to open in a browser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. I had many runs in my directory so tensorboard used all kinds of colors and it just happened that the two runs that I cared about had nice colors.
Co-authored-by: Yi Su <yi.su@antgroup.com> Co-authored-by: Yi Su <yi.su@antfin.com>