这是indexloc提供的服务,不要输入任何密码
Skip to content

Add discrete Conservative Q-Learning for offline RL #359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 12, 2021

Conversation

nuance1979
Copy link
Collaborator

  • A tianshou implementation of the discrete Conservative Q-Learning for offline/batch RL
  • Reference: paper and code

@Trinkle23897 Trinkle23897 linked an issue May 5, 2021 that may be closed by this pull request
8 tasks
@Trinkle23897
Copy link
Collaborator

pytest uses alphabetical order to test scripts under the same directory (for now), so that cql would run before dqn. You can either modify the code to add test dependency in pytest, or change the filename to test_il_cql.

@codecov-commenter
Copy link

codecov-commenter commented May 5, 2021

Codecov Report

Merging #359 (8335a69) into master (84f5863) will increase coverage by 0.33%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #359      +/-   ##
==========================================
+ Coverage   94.21%   94.55%   +0.33%     
==========================================
  Files          53       54       +1     
  Lines        3472     3505      +33     
==========================================
+ Hits         3271     3314      +43     
+ Misses        201      191      -10     
Flag Coverage Δ
unittests 94.55% <100.00%> (+0.33%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tianshou/policy/__init__.py 100.00% <100.00%> (ø)
tianshou/policy/imitation/discrete_cql.py 100.00% <100.00%> (ø)
tianshou/policy/modelfree/trpo.py 93.33% <100.00%> (ø)
tianshou/policy/modelfree/npg.py 98.85% <0.00%> (+1.14%) ⬆️
tianshou/utils/log_tools.py 93.33% <0.00%> (+10.00%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84f5863...8335a69. Read the comment docs.

Comment on lines +84 to +85
| BreakoutNoFrameskip-v4 | 394.3 | 46.9 | 248.3 (epoch 12) | `python3 atari_cql.py --task "BreakoutNoFrameskip-v4" --load-buffer-name log/BreakoutNoFrameskip-v4/qrdqn/expert.hdf5 --epoch 12 --min-q-weight 50` |
Copy link
Collaborator

@Trinkle23897 Trinkle23897 May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to suggest using --eps-test 0.9 or --eps-test 0.99 when generating buffer to see what happens because, in the original paper, the author used 1% and 10% of the expert data to train CQL and got a good result (Table 3).

Copy link
Collaborator

@Trinkle23897 Trinkle23897 May 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried --eps-test 0.9 on Pong and Breakout. Pong can easily achieve +20 reward but it is not so stable; Breakout cannot achieve >20 reward when min-q-weight is either 10 or 50. I also test with only first 10% random data instead of mixed 10% data, the results are the same. Could you please help check what's going wrong?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried --eps-test 0.9 and --eps-test 0.99 and got terrible results, basically failed to learn. Then I checked this paper where the data were generated and found in Section 6: We train offline QR-DQN and REM with reduced data obtained via randomly subsampling entire trajectories from the logged DQN experiences, thereby maintaining the same data distribution. Figure 6 presents the performance of the offline REM and QR-DQN agents with N% of the tuples in the DQN replay dataset where N ∈ {1, 10, 20, 50, 100}. So they were simply using less data, as opposed to use worse data. So I generated 1% and 10% of the 1M data, i.e., 10k and 100k, and tuned the parameters a little. In deed, smaller datasets need smaller --min-q-weight to work. (I initially tried to sample from my 1M buffer data but couldn't get the format right.) These results are within expectations, IMHO.

Comment on lines +100 to +101
| PongNoFrameskip-v4 | 20.5 | nan | 1.8 (epoch 5) | `python3 atari_cql.py --task "PongNoFrameskip-v4" --load-buffer-name log/PongNoFrameskip-v4/qrdqn/expert.size_1e4.hdf5 --epoch 5 --min-q-weight 1` |
| BreakoutNoFrameskip-v4 | 394.3 | 31.7 | 22.5 (epoch 12) | `python3 atari_cql.py --task "BreakoutNoFrameskip-v4" --load-buffer-name log/BreakoutNoFrameskip-v4/qrdqn/expert.size_1e4.hdf5 --epoch 12 --min-q-weight 10` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to stop at epoch 5 and 12 -- that's too small and not converge yet. In fact, the original paper indicates that they use ~25 iterations to get the optimal performance in Breakout, is that (25*1M/64/10000=39) epoch in our setting?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that epoch 5 and 12 are arbitrary. I just wanted to compare roughly with the BCQ results above. I reran the two Breakout runs to epoch 40 and the test results were worse than epoch 12. (However, I do think the evaluation protocol of 10 episodes is too small to smooth out the randomness.) The learning curves show that the loss seemed to be behaving but the test rewards were all over the places:

Screen Shot 2021-05-11 at 11 15 23 AM

I double-checked my implementation with the reference but couldn't find any errors. (It was essentially 3 lines of code on top of the QRDQN anyways.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this tensorboard or other visualization tools? I'm quite curious

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tensorboard. I usually run it like this: tensorboard --logdir log/BreakoutNoFrameskip-v4/cql/ --host $(hostname -i) --port 8086. Then click the server address to open in a browser.

Copy link
Collaborator

@Trinkle23897 Trinkle23897 May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask how to change the css style? Because mine is orange and looks quite different to yours:
2021-05-12 09-09-58 的屏幕截图

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. I had many runs in my directory so tensorboard used all kinds of colors and it just happened that the two runs that I cared about had nice colors.

Trinkle23897
Trinkle23897 previously approved these changes May 11, 2021
@Trinkle23897 Trinkle23897 merged commit b5c3dda into thu-ml:master May 12, 2021
@nuance1979 nuance1979 deleted the cql branch October 6, 2021 17:28
BFAnas pushed a commit to BFAnas/tianshou that referenced this pull request May 5, 2024
Co-authored-by: Yi Su <yi.su@antgroup.com>
Co-authored-by: Yi Su <yi.su@antfin.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Conservative Q-Learning (CQL) for offline/batch reinforcement learning
3 participants