这是indexloc提供的服务,不要输入任何密码
Skip to content

A Bug in HERReplayBuffer. #811

@sunkafei

Description

@sunkafei
  • I have marked all applicable categories:
    • exception-raising bug
    • RL algorithm bug
    • documentation request (i.e. "X is missing from the documentation.")
    • new feature request
  • I have visited the source website
  • I have searched through the issue tracker for duplicates
  • I have mentioned version numbers, operating system and environment, where applicable:
    import tianshou, gym, torch, numpy, sys
    print(tianshou.__version__, gym.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)

Hello, I have some questions in the code of rewrite_transitions in HERReplayBuffer:

    def rewrite_transitions(self, indices: np.ndarray) -> None:
        """Re-write the goal of some sampled transitions' episodes according to HER.
        Currently applies only HER's 'future' strategy. The new goals will be written \
        directly to the internal batch data temporarily and will be restored right \
        before the next sampling or when using some of the buffer's method (e.g. \
        `add`, `save_hdf5`, etc.). This is to make sure that n-step returns \
        calculation etc., performs correctly without additional alteration.
        """
        if indices.size == 0:
            return

        # Sort indices keeping chronological order
        indices[indices < self._index] += self.maxsize
        indices = np.sort(indices)
        indices[indices >= self.maxsize] -= self.maxsize

        # Construct episode trajectories
        indices = [indices]
        for _ in range(self.horizon - 1):
            indices.append(self.next(indices[-1]))
        indices = np.stack(indices)

        # Calculate future timestep to use
        current = indices[0]
        terminal = indices[-1]
        future_offset = np.random.uniform(size=len(indices[0])) * (terminal - current)
        future_offset = future_offset.astype(int)
        future_t = (current + future_offset)
        ...

As ReplayBuffer is implemented as a circular queue, the indices in terminal may be less than the corresponding indices in current. So that some elements in future_offset may be negative, which will make the states in future_t not desired future states.
@Juno-T

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions