这是indexloc提供的服务,不要输入任何密码
Skip to content

Implement MBPO (#16) and REDQ #514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

Jimenius
Copy link
Contributor

  • I have marked all applicable categories:

    • exception-raising fix
    • algorithm implementation fix
    • documentation modification
    • new feature
  • I have reformatted the code using make format (required)

  • I have checked the code using make commit-checks (required)

  • If applicable, I have mentioned the relevant/related issue(s)

  • If applicable, I have listed every items in this Pull Request below

  • Both tested on Hopper-v2. See examples/modelbased/mujoco_mbpo.py and examples/mujoco/mujoco_redq.py.

  • Results are shown below. The orange line represents SAC baseline. The blue line represents MBPO result. The red line represents REDQ result.

  • Add ensemble network structures in tianshou/utils/net/common.py and tianshou/utils/net/continuous.py.

results

@codecov-commenter
Copy link

codecov-commenter commented Jan 27, 2022

Codecov Report

Merging #514 (4d2499b) into master (40289b8) will decrease coverage by 9.74%.
The diff coverage is 17.37%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #514      +/-   ##
==========================================
- Coverage   94.35%   84.60%   -9.75%     
==========================================
  Files          63       69       +6     
  Lines        4251     4865     +614     
==========================================
+ Hits         4011     4116     +105     
- Misses        240      749     +509     
Flag Coverage Δ
unittests 84.60% <17.37%> (-9.75%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tianshou/data/collector.py 80.93% <8.00%> (-14.65%) ⬇️
tianshou/trainer/dyna.py 11.11% <11.11%> (ø)
tianshou/env/fake.py 14.39% <14.39%> (ø)
tianshou/policy/modelfree/redq.py 14.70% <14.70%> (ø)
tianshou/utils/net/common.py 56.65% <16.84%> (-35.02%) ⬇️
tianshou/utils/net/continuous.py 90.00% <23.52%> (-7.92%) ⬇️
tianshou/data/buffer/simple.py 23.61% <23.61%> (ø)
tianshou/policy/modelbased/mbpo.py 31.57% <31.57%> (ø)
tianshou/data/dataset.py 40.00% <40.00%> (ø)
tianshou/data/__init__.py 100.00% <100.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 40289b8...4d2499b. Read the comment docs.

from tianshou.policy import DDPGPolicy


class REDQPolicy(DDPGPolicy):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you inherit SACPolicy to match the setup in the paper as well as your MBPO implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inheriting SACPolicy requires changes of currently working code sac.py as the 'critics1, ...' parameters are required now. I'm not sure if this is acceptable at this moment.

The same thing happens on EnsembleMLP in tianshou.utils.net.common.py. There will be less duplicated code if I can pass the ensemble_size as a parameter of EnsembleLinear to

linear_layer: Type[nn.Linear] = nn.Linear,
) -> List[nn.Module]:
"""Construct a miniblock with given input/output-size, norm layer and \
activation."""
layers: List[nn.Module] = [linear_layer(input_size, output_size)]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Since SACPolicy is a sub-class of DDPGPolicy, you only go up one level. I think that's a good compromise. Thanks for the explanation!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding EmsembleMLP and friends, how about something like this:

def miniblock(
    input_size: int,
    output_size: int = 0,
    norm_layer: Optional[ModuleType] = None,
    activation: Optional[ModuleType] = None,
    linear_layer: Type[nn.Linear] = nn.Linear,
    ensemble_size: int = 1,
) -> List[nn.Module]:
    """Construct a miniblock with given input/output-size, norm layer and \
    activation."""
    layers: List[nn.Module] = [linear_layer(input_size, output_size) if ensemble_size == 1 else linear_layer(ensemble_size, input_size, output_size)]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering

def miniblock(
    input_size: int,
    output_size: int = 0,
    norm_layer: Optional[ModuleType] = None,
    activation: Optional[ModuleType] = None,
    linear_layer: Type[nn.Linear] = nn.Linear,
    **layer_args
) -> List[nn.Module]:
    """Construct a miniblock with given input/output-size, norm layer and \
    activation."""
    layers: List[nn.Module] = [linear_layer(input_size, output_size, **layer_args)]

and then change the parameter order of EnsembleLinear, so that it can be compatible with some other fancy layers in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait. There might be simpler solutions. Keep EnsembleLinear, then define a lambda function to pass in to MLP or miniblock() as linear_layer:

def linear_layer(input_size, output_size):
  return EnsembleLinear(ensemble_size, input_size, output_size)

Similar to

def linear(x, y):
if is_noisy:
return NoisyLinear(x, y, noisy_std)
else:
return nn.Linear(x, y)

return target_q

def learn(self, batch: Batch, **kwargs: Any) -> Dict[str, float]:
# critic ensemble
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to call super().learn() here to reduce code duplication? If so, is it possible to make REDQPolicy a decorator so that it can be used with any off-policy algorithm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible if the training of actor and critic are written in helper methods. Then we can overwrite train_critic method in REDQ. As for the decorator idea, I don't have a clear thought of solution at this moment.

@Trinkle23897 Trinkle23897 linked an issue Feb 10, 2022 that may be closed by this pull request
8 tasks
@Trinkle23897
Copy link
Collaborator

Sorry for the delay. Could you please kindly answer the following questions because I'm not quite familiar with these two algorithms and need some context?

  1. What's the difference between SimpleReplayBuffer and ReplayBuffer?
  2. What's the difference between RolloutCollector and Collector?
  3. Why did you need to construct TransitionDataset instead of reusing ReplayBuffer?
  4. Why did you put training logics under env/? Is there a better abstraction or location to put in?
  5. What's special in dyna_trainer? (I'm not quite familiar with this algo...)
  6. Can we add some extra arguments to the original MLP code to reuse most of the code in many Ensamble networks?
  7. Similar issue raised from nuance, can we reuse the code as many as possible in REDQPolicy?
  8. env/mujoco/static.py is an env-specific configuration from my perspective, maybe move to examples/mujoco?

@Jimenius
Copy link
Contributor Author

Jimenius commented Mar 2, 2022

Thanks @Trinkle23897 and @nuance1979 for the discussions and helpful suggestions. I reimplemented MBPO in response to the questions. I'll submit another PR to have a clear view as there are lots of changes.

Currently I still can't replace SimpleReplayBuffer by VectorReplayBuffer. Take the following code as an example

import time
import numpy as np
from tianshou.data import Batch, VectorReplayBuffer

if __name__ == '__main__':
    buf1 = VectorReplayBuffer(400000, 100000)
    for i in range(3):
        obs = np.ones((100000, 11)) * i
        act = np.ones((100000, 3)) * i
        rew = np.ones((100000, 1)) * i
        done = np.zeros((100000, 1), dtype=bool)
        obs_next = np.ones((100000, 11))
        batch = Batch(obs=obs, act=act, done=done, rew=rew, obs_next=obs_next)
        buf1.add(batch)
    start_time = time.time()
    buf1.sample(256)
    print(time.time() - start_time)

The sample call takes over 0.3s on my machine, and there are update_per_step (20) * step_per_epoch (1000) calls in an epoch, about 2 hours in total, which is not acceptable. The SimpleReplayBuffer.sample call is 20-300 times faster, but it breaks the sequential relationship of data, so the settings of algorithm and parameters are restricted.

A solution to optimize the sample method in my mind is to figure out or keep available indices of _meta in an array and sample with a single np.random.choice call instead of calculating the number and sampling in each sub-buffer. I'm not sure if this solution contradicts with the logic of other algorithms.

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Mar 6, 2022

A solution to optimize the sample method in my mind is to figure out or keep available indices of _meta in an array and sample with a single np.random.choice call instead of calculating the number and sampling in each sub-buffer. I'm not sure if this solution contradicts with the logic of other algorithms.

what if setting a flag:

if not self.full:
  return np.concatenate([sample(sub_buffer[i]) + offset[i] for i in range(self)])
else:
  return np.random.randint(len(self), size=batch_size)
In [5]: %timeit np.random.randint(400000, size=256)
9.01 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

2022-03-06 09-44-06 的屏幕截图

sample with a single np.random.choice

That also works. From buffer's perspective:

  1. At first we choose 1-d vector buffer because of vectorized prioritized replay buffer;
  2. The above flag approach can only resolve off-policy case because on-policy needs to set buffer.maxsize >= trajectory_len to store the whole trajectory, and it is often the case that trajectory length is smaller than buffer size, in that case np.random.choice may be a good solution.
  3. 2d vecbuf or linklist buffer needs lots of code refactoring.

@nuance1979
Copy link
Collaborator

Thanks @Trinkle23897 and @nuance1979 for the discussions and helpful suggestions. I reimplemented MBPO in response to the questions. I'll submit another PR to have a clear view as there are lots of changes.

Is it possible to move REDQ into a separate PR? I think it's an independent algorithm which doesn't really share code with MBPO.

@Jimenius
Copy link
Contributor Author

what if setting a flag

Reply to @Trinkle23897: However, in MBPO, the model buffer is often not full as the buffer size is increasing with the rollout length (usually linearly increasing.) I am thinking of keeping an "available mask" the same length as _meta. The elements are set to be True when buffer.add. The sampling is gonna be like

available_indices = np.arange(maxsize)[mask]
np.random.choice(indices, size=batch_size)

Maybe this solution also handles stack? This solution transers the time to the add operation, but it worths as sampling is usually a much more frequent action.

Is it possible to move REDQ into a separate PR? I think it's an independent algorithm which doesn't really share code with MBPO.

Reply to @nuance1979: Yes, I'm doing so. The only resemblance is that they both use an ensemble network. It is usually implemented as ModuleList, but I implement it as a 3D matrix multiplication as in the original MBPO implementation, which should be faster. Based on it, I'm thinking of treating REDQ as a base algorithm and rewriting SAC and TD3 to use a critic ensemble instead of critic 1 & 2. Intrinsically, they are specific instances of REDQ.

It seems that I have a lot of work to do related to implementing the two algorithms, but currently I'm focusing my time and energy on my own research and I think it's better to keep the code stable. Feel free if you want to submit a PR with part of my code.

@nuance1979
Copy link
Collaborator

Based on it, I'm thinking of treating REDQ as a base algorithm and rewriting SAC and TD3 to use a critic ensemble instead of critic 1 & 2. Intrinsically, they are specific instances of REDQ.

While I understand your point, I feel it's better to leave the current implementations of SAC and TD3 unchanged. Of course it's just my personal opinion. What do you think, @Trinkle23897 ?

It seems that I have a lot of work to do related to implementing the two algorithms, but currently I'm focusing my time and energy on my own research and I think it's better to keep the code stable. Feel free if you want to submit a PR with part of my code.

I'm also currently busy with other work. Might take it up when I have the time.

@MischaPanch
Copy link
Collaborator

Seems like this was implemented in #623, closing this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Model-based algorithm?
5 participants