Improve offline algo performance #1261
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For @arnaujc91
Background: currently the pre-processing of most offline learning algorithms is done in
_preprocess_batch
, which is highly suboptimal. Instead it should be done inprocess_buffer
.In this PR, a general class that implements
process_buffer
using_preprocess_batch
is introduced that allows converting an OffPolicyAlgorithm into an efficient OfflineAlgorithm.Current status: when using it to improve performance of
TD3BC
, the algorithm converges and tests pass, but the determinism test fails, meaning something changed in the processing or at least in the random number generation. If it's the latter, the failure is not a problem, but I currently don't see why any rng related things should have changed.The implementation that changes the buffer's managed batch is rather hacky. I suspect something goes wrong with the indexing but after 20 mins of debugging I haven't yet pinned down what causes the determinism test to fail. Understanding this will require reading through the sample_indices in
ReplayBuffer
andReplayBufferManager
. Note that there is inconsistency in howsample_indices(None)
is handled between the two, but it shouldn't play a role for this PR.In the course of this PR, determinism snapshots should be created from the dev-v2 branch, then after switching to this branch, the determinism test of td3bc should succeed. See docstring of
AlgorithmDeterminismTest
indeterminism_test.py
for further details.The PR is finished when all relevant offline algorithms inherit from
OfflineAlgorithmFromOffPolicyAlgorithm
and either the determinism tests pass, or a source of different random number generation caused by the refactoring has been identified.