How to better tune peak memory usage

I have some datasets and transformations that I want to run that unfortunately won't fit on n1-highmem-16 instances (which is what FlexRS requires). The features are fairly standard scalar features with `tft.quantiles` analyzer and string features with `tft.vocabulary` analyzer (but there are a lot of each type of feature). Generally the analyze step will run fine up until the final combine which will typically run on a very small number of machines and cause them to repeatedly OOM.

Of course I can do something like use a larger machine type or even a custom machine type, but these don't work with FlexRS and would be more expensive. I'm generally curious about whether either of the following two options would be viable solutions:

1. Shard the analyze step by features. So split up the set of features into separate groups and run multiple different analyze steps sequentially, which should hopefully reduce peak memory usage. The challenge would be how to merge the outputs of the analyze steps together at the end.
2. Add beam [resource hints](https://beam.apache.org/documentation/runtime/resource-hints/) specifically to the problematic combine tasks so that they do not get scheduled to run on the same machine.

Are either of these two options viable or is there a solution that I have not considered yet?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to better tune peak memory usage #260

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to better tune peak memory usage #260

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions