-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Method to compute actions from observations #991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Method to compute actions from observations #991
Conversation
39ad259 to
bf9c971
Compare
dad715a to
ad196d0
Compare
|
The cql integration test is failing, as it is on master currently. I don't know why, and it doesn't happen locally.. Any idea, @Trinkle23897 ? |
17e7bcf to
010cf32
Compare
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #991 +/- ##
==========================================
+ Coverage 88.06% 88.11% +0.05%
==========================================
Files 96 96
Lines 7505 7512 +7
==========================================
+ Hits 6609 6619 +10
+ Misses 896 893 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
We set an arbitrary threshold here and check to see if after 5 iterations the model can beat it. The difference is most probably due to floating point math error but could be a hidden bug. I'd suggest lowering the threshold to -1210 to let it pass for now and revisit it as part of the benchmarking work in the future. |
|
I'd suggest changing seed but I'm fine with lowering threshold |
fa9a45d to
a216c14
Compare
|
@Trinkle23897 I removed the Would be nice to merge this one soon, as we need the |
This PR adds a new method for getting actions from an env's observation and info. This is useful for standard inference and stands in contrast to batch-based methods that are currently used in training and evaluation. Without this, users have to do some kind of gymnastics to actually perform inference with a trained policy. I have also added a test for the new method.
In future PRs, this method should be included in the examples (in the the "watch" section).
To add this required improving multiple typing things and, importantly, simplifying the signature of
forwardin many policies! This is a breaking change, but it will likely affect no users. Theinputparameter of forward was a rather hacky mechanism, I believe it is good that it's gone now. It will also help with #948 .The main functional change is the addition of
compute_actiontoBasePolicy.Other minor changes:
max_action_numCloses #981