这是indexloc提供的服务,不要输入任何密码
Skip to content

TFMA on Flink does not seems to be parallelizing work. #170

@jccarles

Description

@jccarles

Hello I am running into an issue running the evaluator component of a tfx pipeline. I use the FlinkRunner for beam and the evaluator component is super slow as the size of data scales. It seems it is because the work is done only by a single Task Manager.

System information

I am running a TFX pipeline using python 3.7. TFX version 1.8.1 which comes with TFMA version tensorflow-model-analysis==0.39.0.
I don't have a small example to reproduce, I can work on one if you think it will help.

Describe the problem

I use the evaluator TFX component as such

evaluator = Evaluator(
        examples=example_gen.outputs[standard_component_specs.EXAMPLES_KEY],
        model=trainer.outputs[standard_component_specs.MODEL_KEY],
        eval_config=eval_config,
    )

With a simple eval_config without any splits. So we only have the eval_split which is used for evaluation.

To run the TFX pipeline we use the FlinkRunner for beam. The sidecar image is built from tensorflow/tfx:1.8.1.

We run flink with a parellism of 10. So 10 files of tf_records are in input of the evaluator component.

From what we could gather, beam tells flink to build a single task for the 3 p_transforms:

"ReadFromTFRecordToArrow" | "FlattenExamples" | "ExtractEvaluateAndWriteResults"

Our issue is that this ends up creating a single subtask for Flink, so a single task manager is doing all the work as you can see in the attached screenshot. So the issue seems to be with the beam workflow which does not parallelized.

I have two main questions:

  • Is this behavior normal ?
  • Is it possible to better dispatch the workload between the different taskmanagers ?

Screenshot from 2023-01-26 15-18-44

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions