Implement MBPO (#16) and REDQ #514

Jimenius · 2022-01-27T09:42:42Z

I have marked all applicable categories:
- exception-raising fix
- algorithm implementation fix
- documentation modification
- new feature
I have reformatted the code using make format (required)
I have checked the code using make commit-checks (required)
If applicable, I have mentioned the relevant/related issue(s)
If applicable, I have listed every items in this Pull Request below
Both tested on Hopper-v2. See examples/modelbased/mujoco_mbpo.py and examples/mujoco/mujoco_redq.py.
Results are shown below. The orange line represents SAC baseline. The blue line represents MBPO result. The red line represents REDQ result.
Add ensemble network structures in tianshou/utils/net/common.py and tianshou/utils/net/continuous.py.

codecov-commenter · 2022-01-27T12:57:50Z

Codecov Report

Merging #514 (4d2499b) into master (40289b8) will decrease coverage by 9.74%.
The diff coverage is 17.37%.

@@            Coverage Diff             @@
##           master     #514      +/-   ##
==========================================
- Coverage   94.35%   84.60%   -9.75%     
==========================================
  Files          63       69       +6     
  Lines        4251     4865     +614     
==========================================
+ Hits         4011     4116     +105     
- Misses        240      749     +509

Flag	Coverage Δ
unittests	`84.60% <17.37%> (-9.75%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
tianshou/data/collector.py	`80.93% <8.00%> (-14.65%)`	⬇️
tianshou/trainer/dyna.py	`11.11% <11.11%> (ø)`
tianshou/env/fake.py	`14.39% <14.39%> (ø)`
tianshou/policy/modelfree/redq.py	`14.70% <14.70%> (ø)`
tianshou/utils/net/common.py	`56.65% <16.84%> (-35.02%)`	⬇️
tianshou/utils/net/continuous.py	`90.00% <23.52%> (-7.92%)`	⬇️
tianshou/data/buffer/simple.py	`23.61% <23.61%> (ø)`
tianshou/policy/modelbased/mbpo.py	`31.57% <31.57%> (ø)`
tianshou/data/dataset.py	`40.00% <40.00%> (ø)`
tianshou/data/__init__.py	`100.00% <100.00%> (ø)`
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 40289b8...4d2499b. Read the comment docs.

nuance1979 · 2022-01-27T19:16:55Z

tianshou/policy/modelfree/redq.py

+from tianshou.policy import DDPGPolicy
+
+
+class REDQPolicy(DDPGPolicy):


Why don't you inherit SACPolicy to match the setup in the paper as well as your MBPO implementation?

Inheriting SACPolicy requires changes of currently working code sac.py as the 'critics1, ...' parameters are required now. I'm not sure if this is acceptable at this moment.

The same thing happens on EnsembleMLP in tianshou.utils.net.common.py. There will be less duplicated code if I can pass the ensemble_size as a parameter of EnsembleLinear to

tianshou/tianshou/utils/net/common.py

Lines 15 to 19 in bc53ead

linear_layer: Type[nn.Linear] = nn.Linear,

) -> List[nn.Module]:

"""Construct a miniblock with given input/output-size, norm layer and \

activation."""

layers: List[nn.Module] = [linear_layer(input_size, output_size)]

I see. Since SACPolicy is a sub-class of DDPGPolicy, you only go up one level. I think that's a good compromise. Thanks for the explanation!

Regarding EmsembleMLP and friends, how about something like this:

def miniblock( input_size: int, output_size: int = 0, norm_layer: Optional[ModuleType] = None, activation: Optional[ModuleType] = None, linear_layer: Type[nn.Linear] = nn.Linear, ensemble_size: int = 1, ) -> List[nn.Module]: """Construct a miniblock with given input/output-size, norm layer and \ activation.""" layers: List[nn.Module] = [linear_layer(input_size, output_size) if ensemble_size == 1 else linear_layer(ensemble_size, input_size, output_size)]

I was considering

def miniblock( input_size: int, output_size: int = 0, norm_layer: Optional[ModuleType] = None, activation: Optional[ModuleType] = None, linear_layer: Type[nn.Linear] = nn.Linear, **layer_args ) -> List[nn.Module]: """Construct a miniblock with given input/output-size, norm layer and \ activation.""" layers: List[nn.Module] = [linear_layer(input_size, output_size, **layer_args)]

and then change the parameter order of EnsembleLinear, so that it can be compatible with some other fancy layers in the future.

Wait. There might be simpler solutions. Keep EnsembleLinear, then define a lambda function to pass in to MLP or miniblock() as linear_layer:

def linear_layer(input_size, output_size): return EnsembleLinear(ensemble_size, input_size, output_size)

Similar to

tianshou/examples/atari/atari_network.py

Lines 110 to 114 in c25926d

def linear(x, y):

if is_noisy:

return NoisyLinear(x, y, noisy_std)

else:

return nn.Linear(x, y)

nuance1979 · 2022-01-27T19:21:50Z

tianshou/policy/modelfree/redq.py

+        return target_q
+
+    def learn(self, batch: Batch, **kwargs: Any) -> Dict[str, float]:
+        # critic ensemble


Is it possible to call super().learn() here to reduce code duplication? If so, is it possible to make REDQPolicy a decorator so that it can be used with any off-policy algorithm?

It is possible if the training of actor and critic are written in helper methods. Then we can overwrite train_critic method in REDQ. As for the decorator idea, I don't have a clear thought of solution at this moment.

Trinkle23897 · 2022-02-12T15:10:18Z

Sorry for the delay. Could you please kindly answer the following questions because I'm not quite familiar with these two algorithms and need some context?

What's the difference between SimpleReplayBuffer and ReplayBuffer?
What's the difference between RolloutCollector and Collector?
Why did you need to construct TransitionDataset instead of reusing ReplayBuffer?
Why did you put training logics under env/? Is there a better abstraction or location to put in?
What's special in dyna_trainer? (I'm not quite familiar with this algo...)
Can we add some extra arguments to the original MLP code to reuse most of the code in many Ensamble networks?
Similar issue raised from nuance, can we reuse the code as many as possible in REDQPolicy?
env/mujoco/static.py is an env-specific configuration from my perspective, maybe move to examples/mujoco?

Jimenius · 2022-03-02T15:28:38Z

Thanks @Trinkle23897 and @nuance1979 for the discussions and helpful suggestions. I reimplemented MBPO in response to the questions. I'll submit another PR to have a clear view as there are lots of changes.

Currently I still can't replace SimpleReplayBuffer by VectorReplayBuffer. Take the following code as an example

import time
import numpy as np
from tianshou.data import Batch, VectorReplayBuffer

if __name__ == '__main__':
    buf1 = VectorReplayBuffer(400000, 100000)
    for i in range(3):
        obs = np.ones((100000, 11)) * i
        act = np.ones((100000, 3)) * i
        rew = np.ones((100000, 1)) * i
        done = np.zeros((100000, 1), dtype=bool)
        obs_next = np.ones((100000, 11))
        batch = Batch(obs=obs, act=act, done=done, rew=rew, obs_next=obs_next)
        buf1.add(batch)
    start_time = time.time()
    buf1.sample(256)
    print(time.time() - start_time)

The sample call takes over 0.3s on my machine, and there are update_per_step (20) * step_per_epoch (1000) calls in an epoch, about 2 hours in total, which is not acceptable. The SimpleReplayBuffer.sample call is 20-300 times faster, but it breaks the sequential relationship of data, so the settings of algorithm and parameters are restricted.

A solution to optimize the sample method in my mind is to figure out or keep available indices of _meta in an array and sample with a single np.random.choice call instead of calculating the number and sampling in each sub-buffer. I'm not sure if this solution contradicts with the logic of other algorithms.

Trinkle23897 · 2022-03-06T13:57:17Z

A solution to optimize the sample method in my mind is to figure out or keep available indices of _meta in an array and sample with a single np.random.choice call instead of calculating the number and sampling in each sub-buffer. I'm not sure if this solution contradicts with the logic of other algorithms.

what if setting a flag:

if not self.full:
  return np.concatenate([sample(sub_buffer[i]) + offset[i] for i in range(self)])
else:
  return np.random.randint(len(self), size=batch_size)

In [5]: %timeit np.random.randint(400000, size=256)
9.01 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

sample with a single np.random.choice

That also works. From buffer's perspective:

At first we choose 1-d vector buffer because of vectorized prioritized replay buffer;
The above flag approach can only resolve off-policy case because on-policy needs to set buffer.maxsize >= trajectory_len to store the whole trajectory, and it is often the case that trajectory length is smaller than buffer size, in that case np.random.choice may be a good solution.
2d vecbuf or linklist buffer needs lots of code refactoring.

nuance1979 · 2022-03-09T18:30:42Z

Thanks @Trinkle23897 and @nuance1979 for the discussions and helpful suggestions. I reimplemented MBPO in response to the questions. I'll submit another PR to have a clear view as there are lots of changes.

Is it possible to move REDQ into a separate PR? I think it's an independent algorithm which doesn't really share code with MBPO.

Jimenius · 2022-03-11T04:14:56Z

what if setting a flag

Reply to @Trinkle23897: However, in MBPO, the model buffer is often not full as the buffer size is increasing with the rollout length (usually linearly increasing.) I am thinking of keeping an "available mask" the same length as _meta. The elements are set to be True when buffer.add. The sampling is gonna be like

available_indices = np.arange(maxsize)[mask]
np.random.choice(indices, size=batch_size)

Maybe this solution also handles stack? This solution transers the time to the add operation, but it worths as sampling is usually a much more frequent action.

Is it possible to move REDQ into a separate PR? I think it's an independent algorithm which doesn't really share code with MBPO.

Reply to @nuance1979: Yes, I'm doing so. The only resemblance is that they both use an ensemble network. It is usually implemented as ModuleList, but I implement it as a 3D matrix multiplication as in the original MBPO implementation, which should be faster. Based on it, I'm thinking of treating REDQ as a base algorithm and rewriting SAC and TD3 to use a critic ensemble instead of critic 1 & 2. Intrinsically, they are specific instances of REDQ.

It seems that I have a lot of work to do related to implementing the two algorithms, but currently I'm focusing my time and energy on my own research and I think it's better to keep the code stable. Feel free if you want to submit a PR with part of my code.

nuance1979 · 2022-03-12T06:06:54Z

Based on it, I'm thinking of treating REDQ as a base algorithm and rewriting SAC and TD3 to use a critic ensemble instead of critic 1 & 2. Intrinsically, they are specific instances of REDQ.

While I understand your point, I feel it's better to leave the current implementations of SAC and TD3 unchanged. Of course it's just my personal opinion. What do you think, @Trinkle23897 ?

It seems that I have a lot of work to do related to implementing the two algorithms, but currently I'm focusing my time and energy on my own research and I think it's better to keep the code stable. Feel free if you want to submit a PR with part of my code.

I'm also currently busy with other work. Might take it up when I have the time.

MischaPanch · 2023-10-14T15:25:13Z

Seems like this was implemented in #623, closing this one

Implement MBPO (thu-ml#16) and REDQ

cd6da31

nuance1979 reviewed Jan 27, 2022

View reviewed changes

Trinkle23897 linked an issue Feb 10, 2022 that may be closed by this pull request

Model-based algorithm? #16

Closed

8 tasks

merge master

4d2499b

nuance1979 mentioned this pull request Apr 30, 2022

implement REDQ based on original contribution by @Jimenius #623

Merged

9 tasks

MischaPanch closed this Oct 14, 2023

		from tianshou.policy import DDPGPolicy


		class REDQPolicy(DDPGPolicy):

	linear_layer: Type[nn.Linear] = nn.Linear,
	) -> List[nn.Module]:
	"""Construct a miniblock with given input/output-size, norm layer and \
	activation."""
	layers: List[nn.Module] = [linear_layer(input_size, output_size)]

	def linear(x, y):
	if is_noisy:
	return NoisyLinear(x, y, noisy_std)
	else:
	return nn.Linear(x, y)

Implement MBPO (#16) and REDQ #514

Implement MBPO (#16) and REDQ #514

Uh oh!

Conversation

Jimenius commented Jan 27, 2022

Uh oh!

codecov-commenter commented Jan 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Trinkle23897 commented Feb 12, 2022

Uh oh!

Jimenius commented Mar 2, 2022

Uh oh!

Trinkle23897 commented Mar 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nuance1979 commented Mar 9, 2022

Uh oh!

Jimenius commented Mar 11, 2022

Uh oh!

nuance1979 commented Mar 12, 2022

Uh oh!

MischaPanch commented Oct 14, 2023

Uh oh!

Uh oh!

codecov-commenter commented Jan 27, 2022 •

edited

Loading

Trinkle23897 commented Mar 6, 2022 •

edited

Loading