这是indexloc提供的服务,不要输入任何密码
Skip to content

Revisit and maybe optimize Collectors #1069

@MischaPanch

Description

@MischaPanch
          The main assumption Tianshou holds is that batch-style data transfer can reduce a lot of overhead, so we can improve GPU utilization by sending batch data and the overall system throughput. That's why the initial version of the collector is in batch style.

There are some constraints in front of this assumption:

  1. We cannot sequentially send data to GPU to achieve the same throughput as batch-style easily
  2. The model is relatively small, and it's not memory-bound
  3. The Environment step function takes a small amount of time (including reward calculation), at least shorter than policy forward

These are very strong constraints. If either is not true, we can switch to full async rollout implementation to get better throughput, i.e., achieving shorter wall-clock collector.collect time. For example, in RLHF case:

  • LLM's completion function can be implemented in a fully-async style to achieve the same throughput as batch completion, as long as you provide enough thread/process to handle per request. That invalids (1) (2);
  • The environment needs a reward model to calculate rewards. If we do things in batch-style, we have to do all policy sampling first, sync, and do reward calculation. The system might be environment throughput bound by not investigating enough compute for reward. But if you can do policy/reward calculation in a fully async way, you can remove all bubbles. That invalids (3).

Originally posted by @Trinkle23897 in #1058 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    optimizationPerformance optimization (throughout, memory, processing speed)tentativeUp to discussion, may be dismissed

    Type

    No type

    Projects

    Status

    To do

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions