Revisit and maybe optimize Collectors

              The main assumption Tianshou holds is that batch-style data transfer can reduce a lot of overhead, so we can improve GPU utilization by sending batch data and the overall system throughput. That's why the initial version of the collector is in batch style.

There are some constraints in front of this assumption:
1. We cannot sequentially send data to GPU to achieve the same throughput as batch-style easily
2. The model is relatively small, and it's not memory-bound
3. The Environment step function takes a small amount of time (including reward calculation), at least shorter than policy forward

These are very strong constraints. If either is not true, we can switch to full async rollout implementation to get better throughput, i.e., achieving shorter wall-clock `collector.collect` time. For example, in RLHF case:
- LLM's completion function can be implemented in a fully-async style to achieve the same throughput as batch completion, as long as you provide enough thread/process to handle per request. That invalids (1) (2);
- The environment needs a reward model to calculate rewards. If we do things in batch-style, we have to do all policy sampling first, sync, and do reward calculation. The system might be environment throughput bound by not investigating enough compute for reward. But if you can do policy/reward calculation in a fully async way, you can remove all bubbles. That invalids (3).

_Originally posted by @Trinkle23897 in https://github.com/thu-ml/tianshou/issues/1058#issuecomment-1975236213_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revisit and maybe optimize Collectors #1069

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Revisit and maybe optimize Collectors #1069

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions