Add discrete Conservative Q-Learning for offline RL #359

nuance1979 · 2021-05-05T02:07:14Z

A tianshou implementation of the discrete Conservative Q-Learning for offline/batch RL
Reference: paper and code

examples/atari/README.md

Trinkle23897 · 2021-05-05T02:39:46Z

pytest uses alphabetical order to test scripts under the same directory (for now), so that cql would run before dqn. You can either modify the code to add test dependency in pytest, or change the filename to test_il_cql.

codecov-commenter · 2021-05-05T04:24:34Z

Codecov Report

Merging #359 (8335a69) into master (84f5863) will increase coverage by 0.33%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #359      +/-   ##
==========================================
+ Coverage   94.21%   94.55%   +0.33%     
==========================================
  Files          53       54       +1     
  Lines        3472     3505      +33     
==========================================
+ Hits         3271     3314      +43     
+ Misses        201      191      -10

Flag	Coverage Δ
unittests	`94.55% <100.00%> (+0.33%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
tianshou/policy/__init__.py	`100.00% <100.00%> (ø)`
tianshou/policy/imitation/discrete_cql.py	`100.00% <100.00%> (ø)`
tianshou/policy/modelfree/trpo.py	`93.33% <100.00%> (ø)`
tianshou/policy/modelfree/npg.py	`98.85% <0.00%> (+1.14%)`	⬆️
tianshou/utils/log_tools.py	`93.33% <0.00%> (+10.00%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84f5863...8335a69. Read the comment docs.

Trinkle23897 · 2021-05-07T07:33:09Z

examples/atari/README.md

I'd like to suggest using --eps-test 0.9 or --eps-test 0.99 when generating buffer to see what happens because, in the original paper, the author used 1% and 10% of the expert data to train CQL and got a good result (Table 3).

I've tried --eps-test 0.9 on Pong and Breakout. Pong can easily achieve +20 reward but it is not so stable; Breakout cannot achieve >20 reward when min-q-weight is either 10 or 50. I also test with only first 10% random data instead of mixed 10% data, the results are the same. Could you please help check what's going wrong?

I tried --eps-test 0.9 and --eps-test 0.99 and got terrible results, basically failed to learn. Then I checked this paper where the data were generated and found in Section 6: We train offline QR-DQN and REM with reduced data obtained via randomly subsampling entire trajectories from the logged DQN experiences, thereby maintaining the same data distribution. Figure 6 presents the performance of the offline REM and QR-DQN agents with N% of the tuples in the DQN replay dataset where N ∈ {1, 10, 20, 50, 100}. So they were simply using less data, as opposed to use worse data. So I generated 1% and 10% of the 1M data, i.e., 10k and 100k, and tuned the parameters a little. In deed, smaller datasets need smaller --min-q-weight to work. (I initially tried to sample from my 1M buffer data but couldn't get the format right.) These results are within expectations, IMHO.

Trinkle23897 · 2021-05-11T00:36:53Z

examples/atari/README.md

+| PongNoFrameskip-v4     | 20.5         | nan        | 1.8 (epoch 5)                      | `python3 atari_cql.py --task "PongNoFrameskip-v4" --load-buffer-name log/PongNoFrameskip-v4/qrdqn/expert.size_1e4.hdf5 --epoch 5 --min-q-weight 1` |
+| BreakoutNoFrameskip-v4 | 394.3        | 31.7       | 22.5 (epoch 12) | `python3 atari_cql.py --task "BreakoutNoFrameskip-v4" --load-buffer-name log/BreakoutNoFrameskip-v4/qrdqn/expert.size_1e4.hdf5 --epoch 12 --min-q-weight 10` |


There's no need to stop at epoch 5 and 12 -- that's too small and not converge yet. In fact, the original paper indicates that they use ~25 iterations to get the optimal performance in Breakout, is that (25*1M/64/10000=39) epoch in our setting?

I understand that epoch 5 and 12 are arbitrary. I just wanted to compare roughly with the BCQ results above. I reran the two Breakout runs to epoch 40 and the test results were worse than epoch 12. (However, I do think the evaluation protocol of 10 episodes is too small to smooth out the randomness.) The learning curves show that the loss seemed to be behaving but the test rewards were all over the places:

I double-checked my implementation with the reference but couldn't find any errors. (It was essentially 3 lines of code on top of the QRDQN anyways.)

Is this tensorboard or other visualization tools? I'm quite curious

It's tensorboard. I usually run it like this: tensorboard --logdir log/BreakoutNoFrameskip-v4/cql/ --host $(hostname -i) --port 8086. Then click the server address to open in a browser.

May I ask how to change the css style? Because mine is orange and looks quite different to yours:

I don't know. I had many runs in my directory so tensorboard used all kinds of colors and it just happened that the two runs that I cared about had nice colors.

Co-authored-by: Yi Su <yi.su@antgroup.com> Co-authored-by: Yi Su <yi.su@antfin.com>

Yi Su added 2 commits May 5, 2021 07:51

Add discrete CQL policy

bcbd8f7

Add test for cql

3e3f2c4

Trinkle23897 linked an issue May 5, 2021 that may be closed by this pull request

Conservative Q-Learning (CQL) for offline/batch reinforcement learning #358

Closed

8 tasks

Trinkle23897 reviewed May 5, 2021

View reviewed changes

examples/atari/README.md Outdated Show resolved Hide resolved

fix test failure by changing file name as dependency workaround

ebb622b

Trinkle23897 and others added 6 commits May 6, 2021 08:54

Merge branch 'master' into cql

1da0134

update default hyper params and results

aff80c0

Merge branch 'cql' of https://github.com/nuance1979/tianshou into cql

5a18725

make mypy happy

ee0aae6

harder test

a4b5f04

try w=5

b76458a

Trinkle23897 reviewed May 7, 2021

View reviewed changes

Trinkle23897 and others added 3 commits May 7, 2021 16:04

avg 13s

74056cc

make windows happy

e95f3a2

add results with smaller datasets

8335a69

Trinkle23897 reviewed May 11, 2021

View reviewed changes

Trinkle23897 previously approved these changes May 11, 2021

View reviewed changes

make mypy happy

8b7d4cd

Trinkle23897 dismissed their stale review via 8b7d4cd May 12, 2021 00:15

Trinkle23897 approved these changes May 12, 2021

View reviewed changes

Trinkle23897 merged commit b5c3dda into thu-ml:master May 12, 2021

nuance1979 deleted the cql branch October 6, 2021 17:28

BFAnas pushed a commit to BFAnas/tianshou that referenced this pull request May 5, 2024

Add discrete Conservative Q-Learning for offline RL (thu-ml#359)

7e666d3

Co-authored-by: Yi Su <yi.su@antgroup.com> Co-authored-by: Yi Su <yi.su@antfin.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add discrete Conservative Q-Learning for offline RL #359

Add discrete Conservative Q-Learning for offline RL #359

Uh oh!

nuance1979 commented May 5, 2021

Uh oh!

Uh oh!

Trinkle23897 commented May 5, 2021

Uh oh!

codecov-commenter commented May 5, 2021 •

edited

Loading

Uh oh!

Trinkle23897 May 7, 2021 •

edited

Loading

Uh oh!

Trinkle23897 May 10, 2021 •

edited

Loading

Uh oh!

nuance1979 May 10, 2021

Uh oh!

Trinkle23897 May 11, 2021

Uh oh!

nuance1979 May 11, 2021

Uh oh!

Trinkle23897 May 11, 2021

Uh oh!

nuance1979 May 12, 2021

Uh oh!

Trinkle23897 May 12, 2021 •

edited

Loading

Uh oh!

nuance1979 May 12, 2021

Uh oh!

Uh oh!

		\| PongNoFrameskip-v4 \| 20.5 \| nan \| 1.8 (epoch 5) \| `python3 atari_cql.py --task "PongNoFrameskip-v4" --load-buffer-name log/PongNoFrameskip-v4/qrdqn/expert.size_1e4.hdf5 --epoch 5 --min-q-weight 1` \|
		\| BreakoutNoFrameskip-v4 \| 394.3 \| 31.7 \| 22.5 (epoch 12) \| `python3 atari_cql.py --task "BreakoutNoFrameskip-v4" --load-buffer-name log/BreakoutNoFrameskip-v4/qrdqn/expert.size_1e4.hdf5 --epoch 12 --min-q-weight 10` \|

Add discrete Conservative Q-Learning for offline RL #359

Add discrete Conservative Q-Learning for offline RL #359

Uh oh!

Conversation

nuance1979 commented May 5, 2021

Uh oh!

Uh oh!

Trinkle23897 commented May 5, 2021

Uh oh!

codecov-commenter commented May 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Trinkle23897 May 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Trinkle23897 May 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nuance1979 May 10, 2021

Choose a reason for hiding this comment

Uh oh!

Trinkle23897 May 11, 2021

Choose a reason for hiding this comment

Uh oh!

nuance1979 May 11, 2021

Choose a reason for hiding this comment

Uh oh!

Trinkle23897 May 11, 2021

Choose a reason for hiding this comment

Uh oh!

nuance1979 May 12, 2021

Choose a reason for hiding this comment

Uh oh!

Trinkle23897 May 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nuance1979 May 12, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented May 5, 2021 •

edited

Loading

Trinkle23897 May 7, 2021 •

edited

Loading

Trinkle23897 May 10, 2021 •

edited

Loading

Trinkle23897 May 12, 2021 •

edited

Loading