TFMA on Flink does not seems to be parallelizing work.

Hello I am running into an issue running the evaluator component of a tfx pipeline. I use the FlinkRunner for beam and the evaluator component is super slow as the size of data scales. It seems it is because the work is done only by a single Task Manager.

### System information

I am running a TFX pipeline using python 3.7. TFX version 1.8.1 which comes with TFMA version `tensorflow-model-analysis==0.39.0`.
I don't have a small example to reproduce, I can work on one if you think it will help.

### Describe the problem

I use the evaluator TFX component as such

```
evaluator = Evaluator(
        examples=example_gen.outputs[standard_component_specs.EXAMPLES_KEY],
        model=trainer.outputs[standard_component_specs.MODEL_KEY],
        eval_config=eval_config,
    )
```

With a simple eval_config without any splits. So we only have the `eval_split` which is used for evaluation.

To run the TFX pipeline we use the FlinkRunner for beam. The sidecar image is built from tensorflow/tfx:1.8.1.

We run flink with a parellism of 10. So 10 files of tf_records are in input of the evaluator component.

From what we could gather, beam tells flink to build a single task for the 3 p_transforms:

"ReadFromTFRecordToArrow" | "FlattenExamples" | "ExtractEvaluateAndWriteResults"

Our issue is that this ends up creating a single subtask for Flink, so a single task manager is doing all the work as you can see in the attached screenshot. So the issue seems to be with the beam workflow which does not parallelized.

I have two main questions:

- Is this behavior normal ? 
- Is it possible to better dispatch the workload between the different taskmanagers ?

![Screenshot from 2023-01-26 15-18-44](https://user-images.githubusercontent.com/23744708/215800317-58aca649-7732-4a5b-a2ed-cd8ca45a71e5.png)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TFMA on Flink does not seems to be parallelizing work. #170

System information

Describe the problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TFMA on Flink does not seems to be parallelizing work. #170

Description

System information

Describe the problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions