[WIP] Hindsight Experience Replay Transform #1819

BY571 · 2024-01-19T15:27:33Z

Description

Adds Hindsight Experience Replay (HER) Transform

Motivation and Context

The first draft for the HER transform. However, I am not sure if it should be a Transform or if we create an extra Augmentation class as we are not transforming a single element in the tensordict but augmenting existing collection data. Could be interesting for future "data augmentation strategies", which I think we do not have until now.

Types of changes

What types of changes does your code introduce? Remove all that do not apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds core functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)
Example (update in the folder of examples)

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

I have read the CONTRIBUTION guide (required)
My change requires a change to the documentation.
I have updated the tests accordingly (required for a bug fix or a new feature).
I have updated the documentation accordingly.

pytorch-bot · 2024-01-19T15:27:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/1819

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 19 Unrelated Failures

As of commit 90eef75 with merge base 57139bd ():

NEW FAILURES - The following jobs have failed:

Wheels / build-wheel-mac (3.10, 3.10.3) (gh)
##[error]The operation was canceled.
Wheels / build-wheel-mac (3.8, 3.8) (gh)
##[error]The operation was canceled.
Wheels / build-wheel-mac (3.9, 3.9) (gh)

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

Continuous Benchmark (PR) / CPU Pytest benchmark (gh)
Workflow failed! Resource not accessible by integration
Continuous Benchmark (PR) / GPU Pytest benchmark (gh)
Workflow failed! Resource not accessible by integration
Examples Tests on Linux / tests (3.9, 12.1) / linux-job (gh)
RuntimeError: Command docker exec -t 184df0bbc8416d4f6dffaaa35d1ffa8c1c3da392c7de351de1865a4a86e8a267 /exec failed with exit code 1
Habitat Tests on Linux / tests (3.9, 11.6) / linux-job (gh)
RuntimeError: Command docker exec -t 3258c420522b3517fb8db3ff7719123fd31be6c04aff78f84940c3720934e876 /exec failed with exit code 139
Lint / python-source-and-configs / linux-job (gh)
RuntimeError: Command docker exec -t 674e3f0b4d94994149bc3c569aff7186428763cc7b6d437e2b4a4aadf9cab60b /exec failed with exit code 1
Unit-tests on Linux / tests-cpu (3.10) / linux-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
Unit-tests on Linux / tests-cpu (3.11) / linux-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
Unit-tests on Linux / tests-cpu (3.8) / linux-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
Unit-tests on Linux / tests-cpu (3.9) / linux-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
Unit-tests on Linux / tests-gpu (3.8, 12.1) / linux-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
Unit-tests on Linux / tests-olddeps (3.8, 11.6) / linux-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
Unit-tests on Linux / tests-optdeps (3.9, 12.1) / linux-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
Unit-tests on Linux / tests-stable-gpu (3.8, 11.8) / linux-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
Unit-tests on MacOS CPU / tests (3.11) / macos-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
Unit-tests on MacOS CPU / tests (3.8) / macos-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
Unit-tests on Windows / unittests-cpu / windows-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
Unit-tests on Windows / unittests-gpu / windows-job (gh)
##[error]The operation was canceled.
Wheels / build-wheel-mac (3.11, 3.11) (gh)
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/torch/include/ATen/core/function_schema.h:603:46: error: 'value' is unavailable: introduced in macOS 10.13

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Libs Tests on Linux / unittests-sklearn (3.9, 12.1) / linux-job (gh)
test/test_libs.py::TestOpenML::test_data[mushroom_onehot]

This comment was automatically generated by Dr. CI and updates every 15 minutes.

BY571 · 2024-01-19T15:30:26Z

torchrl/envs/transforms/transforms.py

+
+        augmentation_td = TensorDict(
+            {
+                "observation": sampled_td.get("observation").repeat_interleave(


If we keep it a transform we probably need to specify all those tensordict keys ... Not sure what a better alternative would be. Any idea?

BY571 · 2024-01-19T15:34:03Z

torchrl/envs/transforms/transforms.py

+
+    def _inv_call(self, tensordict: TensorDictBase) -> TensorDictBase:
+        augmentation_td = self.her_augmentation(tensordict)
+        return torch.cat([tensordict, augmentation_td], dim=0)


As explained above. It doesnt feel like a transform as we create a new tensordict and have to combine original and augmented data before adding to the replay buffer. I think ideally the "augmentations" would done directly after the collection. So as a postproc for collectors or as here in the example as an inverse_transform for the replay buffer.

Why not a transform? I think it's pretty neat to use a transform. Who said a transform had to change things in-place?
Our API to modify samples at writing time is to use either a transform or a different writer. If you think this can be achieved with a writer I'm on board. But I don't think there's anything wrong with the transform.

An advantage of using a writer instead os that it feels more natural (transforms can be used with envs unless specified otherwise, writers are dedicated to RBs)

vmoens

@ahmed-touati suggested we use a sampler for this rather than a transform. I'm not strongly opinionated on the matter, mostly because I need more context on what we're trying to achieve here.
Can you elaborate a bit more on what this transform does, maybe with a bunch of examples?

BY571 · 2024-02-08T14:34:21Z

@ahmed-touati suggested we use a sampler for this rather than a transform. I'm not strongly opinionated on the matter, mostly because I need more context on what we're trying to achieve here. Can you elaborate a bit more on what this transform does, maybe with a bunch of examples?

So HER is mainly used in goal-conditioned RL with sparse reward signals where the agent has to reach/achieve a goal state and only gets a reward (+1) when the goal state is achieved, otherwise no reward. The observation consists of three elements: the observation the agent sees, the state the agent had (could be x,y,z position), and the goal state the agent should reach (x,y,z). A typical task could be a robot that has to reach a goal position. The observation will include the agent position but its mostly added as additional information also helps here for understanding.

Now as we have a sparse reward function most of the trajectories will have no learning signal for the agent as it might not be possible for the agent to reach the goal position randomly or by pure luck.
What HER now does is that for each step in a trajectory that you want to add to the buffer, you sample a new goal state and pretend that this was the actual goal the agent had to reach.

So lets say you have a real transition (obs, action, reward, done, next obs, achieved_position, goal_position) for this tuple you now want to sample a new goal_position and then calculate the reward based on this new goal_position and the real achieved_position. So you then add the real transition (obs, action, reward, done, next obs, achieved_position, goal_position) but also the HER augmented transition (obs, action, new_reward, done, new next obs, achieved_position, new_goal_position). The sampling can happen in different ways but is not important for now. However, I think important will be that we need the reward function, Im not sure if we can pass it to the writer/sampler for the buffer, that's why my first thought was a transform. Most of the time the reward function might just be Euclidean distance but maybe for other tasks the user needs to provide a more sophisticated reward function.

vmoens · 2024-02-08T15:17:12Z

However, I think important will be that we need the reward function, Im not sure if we can pass it to the writer/sampler for the buffer, that's why my first thought was a transform.

Why not? I would guess that even if it's a complex nn.Module you can still do pretty much everything with a well tailored function (at least nothing less than with a transform).

vmoens · 2024-02-08T15:17:20Z

Thanks for the context btw!

BY571 · 2024-02-22T09:24:45Z

Why not? I would guess that even if it's a complex nn.Module you can still do pretty much everything with a well tailored function (at least nothing less than with a transform).

Revisiting this I think it would make much more sense to do it with a writer. We want to augment current incoming data with new sampled goal states and store them all together in the buffer. I think this would be generally a good way to add other data augmentation strategies with writer instead of transforms. Having a closer look right now on the writer classes and will update the code here

BY571 · 2024-02-22T09:28:54Z

But this would not allow us to stack multiple augmentations on top of each other... so maybe not that ideal for augmentations

vmoens · 2024-02-22T15:53:24Z

You could still transform your data before passing it to the writer, but not after

dtsaras

While hindsight experience replay is pretty useful, I think it falls under the category of specialized algorithm rather than a building block @vmoens

dtsaras · 2024-04-12T08:00:56Z

torchrl/envs/transforms/transforms.py

+            new_goals.append(splitted_achieved_goals[i][ids])
+
+        # calculate rewards given new desired goals and old achieved goals
+        vmap_rewards = torch.vmap(distance_reward_function)


I think you wanna call self.reward_function instead of distance_reward_function. Also maybe the reward_function should be a TensorDictModule such it can be more easily customized for a given environment. There is torchrl.modules.VmapModule for wrapping TensorDictModules with vmap.

dtsaras · 2024-04-12T08:04:45Z

torchrl/envs/transforms/transforms.py

+        cat_rewards = torch.cat(rewards).reshape(b, t, self.samples, -1).squeeze(-1)
+        cat_new_goals = torch.cat(new_goals).reshape(b, t, self.samples, -1)
+
+        augmentation_td = TensorDict(


I think the augmentation_td should still maintain other metadata that are related to the state rather than selecting only the keys: observation, action, terminated, truncated, ...

vmoens · 2024-04-12T11:27:19Z

Not sure about that one.
We've had that request already 3 times but for 3 different purposes so if there's a way to make it a modular component of the lib I'd like to consider that above a script that is harder to reuse (and more error prone on the user side)

cedricgrothues · 2024-12-15T09:56:46Z

Are you planning to continue working on this PR? If not, I'd be happy to help out and look into finishing it myself.

Additionally, @vmoens, do you now have a clearer sense of whether this should be implemented as a transform, writer, or sampler? It would be great to make sure we're aligned on the best approach before moving forward.

Looking forward to your thoughts!

dtsaras · 2024-12-15T12:01:13Z

Are you planning to continue working on this PR? If not, I'd be happy to help out and look into finishing it myself.

Additionally, @vmoens, do you now have a clearer sense of whether this should be implemented as a transform, writer, or sampler? It would be great to make sure we're aligned on the best approach before moving forward.

Looking forward to your thoughts!

I have an already complete version of it in a gist but I have not made a PR for it yet. Maybe you can have a look and provide feedback. I had talked with Alexandre on discord long ago over the implementation and he liked it. Any feedback? @vmoens https://gist.github.com/dtsaras/f321aed253a64e4849ce95bd232d1635

cedricgrothues · 2024-12-16T14:57:01Z

I really like the modular approach you've taken! That said, I have a few questions about some details of the implementation.

From what I understand, the reward function would be implemented as a HERRewardTransform. I wonder if this might be more complex than necessary compared to an alternative approach, such as passing a compute_reward_fn: Callable[[torch.Tensor, torch.Tensor], torch.Tensor] directly to the HERTransform class.

Could you share your thoughts on the reasoning behind including the HERSubGoalSampler? If we implement the common sampling methods (random, final, future), it seems like the additional complexity might not be strictly necessary – but I'd love to hear your perspective!

dtsaras · 2024-12-17T14:23:00Z

I really like the modular approach you've taken! That said, I have a few questions about some details of the implementation.

From what I understand, the reward function would be implemented as a HERRewardTransform. I wonder if this might be more complex than necessary compared to an alternative approach, such as passing a compute_reward_fn: Callable[[torch.Tensor, torch.Tensor], torch.Tensor] directly to the HERTransform class.

Could you share your thoughts on the reasoning behind including the HERSubGoalSampler? If we implement the common sampling methods (random, final, future), it seems like the additional complexity might not be strictly necessary – but I'd love to hear your perspective!

You are correct, the HERRewardTransform would be responsible to assign the rewards for all the intermediate states. While it's not necessary that it has to be its own special transform, it has to be "something" that reassigns the rewards to intermediate states of a trajectory. The reason I choose for it to be a Transform rather than a callable is to utilize the torchRL API. For example, the user does not need to reimplement the discounted reward function as Reward2GoTransform exists and multiple transforms can also be nicely composed into one.

The HERSubgoalSampler I have implemented only contains [final (marked as last in the code), future]. I did not implement random since it did not produce good results in the original paper if I recall correctly. The HERSubgoalSampler is only responsible for sampling states i.e. identifying the indices to be used for HERSubgoalAssigner. Do you mind elaborating more on your concerns about the complexity?

Maybe I should create a PR and work on it together where you can make changes.

P.S. You can join the Discord server as well and maybe more can provide some feedback there.

cedricgrothues · 2024-12-17T15:31:21Z

You are correct, the HERRewardTransform would be responsible to assign the rewards for all the intermediate states. While it's not necessary that it has to be its own special transform, it has to be "something" that reassigns the rewards to intermediate states of a trajectory. The reason I choose for it to be a Transform rather than a callable is to utilize the torchRL API. For example, the user does not need to reimplement the discounted reward function as Reward2GoTransform exists and multiple transforms can also be nicely composed into one.

I'm sorry, but I'm having a bit of trouble following your reasoning here. Could you elaborate on why it can't be a Callable? Wouldn't that still allow us to use Reward2GoTransform and compose it as needed?

The HERSubgoalSampler I have implemented only contains [final (marked as last in the code), future]. I did not implement random since it did not produce good results in the original paper if I recall correctly. The HERSubgoalSampler is only responsible for sampling states i.e. identifying the indices to be used for HERSubgoalAssigner. Do you mind elaborating more on your concerns about the complexity?

Ah, I see – sorry for the confusion! What I meant is that, as long as we implement all the sampling methods outlined in the paper, it seems unlikely that we'd need another one. This makes me think we could move the subgoal sampling logic directly into the HERTransform and potentially remove the HERSubgoalSampler class altogether.

I understand that random sampling didn't yield good results, but I thought it might be valuable to include it for the sake of completeness, since it was mentioned in the original paper.

Maybe I should create a PR and work on it together where you can make changes.

That sounds like a great idea.

P.S. You can join the Discord server as well and maybe more can provide some feedback there.

Thanks for letting me know! I didn't realize there was a Discord server; I've just joined.

her augmentation transformer init

90eef75

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 19, 2024

BY571 commented Jan 19, 2024

View reviewed changes

vmoens reviewed Feb 8, 2024

View reviewed changes

vmoens added the enhancement New feature or request label Feb 8, 2024

dtsaras reviewed Apr 12, 2024

View reviewed changes

dtsaras mentioned this pull request Dec 19, 2024

First draft for modular Hindsight Experience Replay Transform #2667

Draft

10 tasks

[WIP] Hindsight Experience Replay Transform #1819

Are you sure you want to change the base?

[WIP] Hindsight Experience Replay Transform #1819

Uh oh!

Conversation

BY571 commented Jan 19, 2024

Description

Motivation and Context

Types of changes

Checklist

Uh oh!

pytorch-bot bot commented Jan 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/1819

❌ 3 New Failures, 19 Unrelated Failures

Uh oh!

BY571 Jan 19, 2024

Choose a reason for hiding this comment

Uh oh!

BY571 Jan 19, 2024

Choose a reason for hiding this comment

Uh oh!

vmoens Feb 8, 2024

Choose a reason for hiding this comment

Uh oh!

vmoens left a comment

Choose a reason for hiding this comment

Uh oh!

BY571 commented Feb 8, 2024

Uh oh!

vmoens commented Feb 8, 2024

Uh oh!

vmoens commented Feb 8, 2024

Uh oh!

BY571 commented Feb 22, 2024

Uh oh!

BY571 commented Feb 22, 2024

Uh oh!

vmoens commented Feb 22, 2024

Uh oh!

dtsaras left a comment

Choose a reason for hiding this comment

Uh oh!

dtsaras Apr 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dtsaras Apr 12, 2024

Choose a reason for hiding this comment

Uh oh!

vmoens commented Apr 12, 2024

Uh oh!

cedricgrothues commented Dec 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dtsaras commented Dec 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cedricgrothues commented Dec 16, 2024

Uh oh!

dtsaras commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cedricgrothues commented Dec 17, 2024

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 19, 2024 •

edited

Loading

dtsaras Apr 12, 2024 •

edited

Loading

cedricgrothues commented Dec 15, 2024 •

edited

Loading

dtsaras commented Dec 15, 2024 •

edited

Loading

dtsaras commented Dec 17, 2024 •

edited

Loading