-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Implement MBPO (#16) and REDQ #514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #514 +/- ##
==========================================
- Coverage 94.35% 84.60% -9.75%
==========================================
Files 63 69 +6
Lines 4251 4865 +614
==========================================
+ Hits 4011 4116 +105
- Misses 240 749 +509
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
from tianshou.policy import DDPGPolicy | ||
|
||
|
||
class REDQPolicy(DDPGPolicy): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you inherit SACPolicy
to match the setup in the paper as well as your MBPO implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inheriting SACPolicy
requires changes of currently working code sac.py
as the 'critics1, ...' parameters are required now. I'm not sure if this is acceptable at this moment.
The same thing happens on EnsembleMLP
in tianshou.utils.net.common.py
. There will be less duplicated code if I can pass the ensemble_size
as a parameter of EnsembleLinear
to
tianshou/tianshou/utils/net/common.py
Lines 15 to 19 in bc53ead
linear_layer: Type[nn.Linear] = nn.Linear, | |
) -> List[nn.Module]: | |
"""Construct a miniblock with given input/output-size, norm layer and \ | |
activation.""" | |
layers: List[nn.Module] = [linear_layer(input_size, output_size)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Since SACPolicy
is a sub-class of DDPGPolicy
, you only go up one level. I think that's a good compromise. Thanks for the explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding EmsembleMLP
and friends, how about something like this:
def miniblock(
input_size: int,
output_size: int = 0,
norm_layer: Optional[ModuleType] = None,
activation: Optional[ModuleType] = None,
linear_layer: Type[nn.Linear] = nn.Linear,
ensemble_size: int = 1,
) -> List[nn.Module]:
"""Construct a miniblock with given input/output-size, norm layer and \
activation."""
layers: List[nn.Module] = [linear_layer(input_size, output_size) if ensemble_size == 1 else linear_layer(ensemble_size, input_size, output_size)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was considering
def miniblock(
input_size: int,
output_size: int = 0,
norm_layer: Optional[ModuleType] = None,
activation: Optional[ModuleType] = None,
linear_layer: Type[nn.Linear] = nn.Linear,
**layer_args
) -> List[nn.Module]:
"""Construct a miniblock with given input/output-size, norm layer and \
activation."""
layers: List[nn.Module] = [linear_layer(input_size, output_size, **layer_args)]
and then change the parameter order of EnsembleLinear
, so that it can be compatible with some other fancy layers in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait. There might be simpler solutions. Keep EnsembleLinear
, then define a lambda function to pass in to MLP
or miniblock()
as linear_layer
:
def linear_layer(input_size, output_size):
return EnsembleLinear(ensemble_size, input_size, output_size)
Similar to
tianshou/examples/atari/atari_network.py
Lines 110 to 114 in c25926d
def linear(x, y): | |
if is_noisy: | |
return NoisyLinear(x, y, noisy_std) | |
else: | |
return nn.Linear(x, y) |
return target_q | ||
|
||
def learn(self, batch: Batch, **kwargs: Any) -> Dict[str, float]: | ||
# critic ensemble |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to call super().learn()
here to reduce code duplication? If so, is it possible to make REDQPolicy
a decorator so that it can be used with any off-policy algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible if the training of actor and critic are written in helper methods. Then we can overwrite train_critic method in REDQ. As for the decorator idea, I don't have a clear thought of solution at this moment.
Sorry for the delay. Could you please kindly answer the following questions because I'm not quite familiar with these two algorithms and need some context?
|
Thanks @Trinkle23897 and @nuance1979 for the discussions and helpful suggestions. I reimplemented MBPO in response to the questions. I'll submit another PR to have a clear view as there are lots of changes. Currently I still can't replace SimpleReplayBuffer by VectorReplayBuffer. Take the following code as an example import time
import numpy as np
from tianshou.data import Batch, VectorReplayBuffer
if __name__ == '__main__':
buf1 = VectorReplayBuffer(400000, 100000)
for i in range(3):
obs = np.ones((100000, 11)) * i
act = np.ones((100000, 3)) * i
rew = np.ones((100000, 1)) * i
done = np.zeros((100000, 1), dtype=bool)
obs_next = np.ones((100000, 11))
batch = Batch(obs=obs, act=act, done=done, rew=rew, obs_next=obs_next)
buf1.add(batch)
start_time = time.time()
buf1.sample(256)
print(time.time() - start_time) The sample call takes over 0.3s on my machine, and there are update_per_step (20) * step_per_epoch (1000) calls in an epoch, about 2 hours in total, which is not acceptable. The SimpleReplayBuffer.sample call is 20-300 times faster, but it breaks the sequential relationship of data, so the settings of algorithm and parameters are restricted. A solution to optimize the sample method in my mind is to figure out or keep available indices of _meta in an array and sample with a single |
what if setting a flag: if not self.full:
return np.concatenate([sample(sub_buffer[i]) + offset[i] for i in range(self)])
else:
return np.random.randint(len(self), size=batch_size) In [5]: %timeit np.random.randint(400000, size=256)
9.01 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
That also works. From buffer's perspective:
|
Is it possible to move REDQ into a separate PR? I think it's an independent algorithm which doesn't really share code with MBPO. |
Reply to @Trinkle23897: However, in MBPO, the model buffer is often not full as the buffer size is increasing with the rollout length (usually linearly increasing.) I am thinking of keeping an "available mask" the same length as available_indices = np.arange(maxsize)[mask]
np.random.choice(indices, size=batch_size) Maybe this solution also handles
Reply to @nuance1979: Yes, I'm doing so. The only resemblance is that they both use an ensemble network. It is usually implemented as ModuleList, but I implement it as a 3D matrix multiplication as in the original MBPO implementation, which should be faster. Based on it, I'm thinking of treating REDQ as a base algorithm and rewriting SAC and TD3 to use a critic ensemble instead of critic 1 & 2. Intrinsically, they are specific instances of REDQ. It seems that I have a lot of work to do related to implementing the two algorithms, but currently I'm focusing my time and energy on my own research and I think it's better to keep the code stable. Feel free if you want to submit a PR with part of my code. |
While I understand your point, I feel it's better to leave the current implementations of SAC and TD3 unchanged. Of course it's just my personal opinion. What do you think, @Trinkle23897 ?
I'm also currently busy with other work. Might take it up when I have the time. |
Seems like this was implemented in #623, closing this one |
I have marked all applicable categories:
I have reformatted the code using
make format
(required)I have checked the code using
make commit-checks
(required)If applicable, I have mentioned the relevant/related issue(s)
If applicable, I have listed every items in this Pull Request below
Both tested on
Hopper-v2
. Seeexamples/modelbased/mujoco_mbpo.py
andexamples/mujoco/mujoco_redq.py
.Results are shown below. The orange line represents
SAC
baseline. The blue line representsMBPO
result. The red line representsREDQ
result.Add ensemble network structures in
tianshou/utils/net/common.py
andtianshou/utils/net/continuous.py
.