-
Notifications
You must be signed in to change notification settings - Fork 284
Description
Hello I am running into an issue running the evaluator component of a tfx pipeline. I use the FlinkRunner for beam and the evaluator component is super slow as the size of data scales. It seems it is because the work is done only by a single Task Manager.
System information
I am running a TFX pipeline using python 3.7. TFX version 1.8.1 which comes with TFMA version tensorflow-model-analysis==0.39.0
.
I don't have a small example to reproduce, I can work on one if you think it will help.
Describe the problem
I use the evaluator TFX component as such
evaluator = Evaluator(
examples=example_gen.outputs[standard_component_specs.EXAMPLES_KEY],
model=trainer.outputs[standard_component_specs.MODEL_KEY],
eval_config=eval_config,
)
With a simple eval_config without any splits. So we only have the eval_split
which is used for evaluation.
To run the TFX pipeline we use the FlinkRunner for beam. The sidecar image is built from tensorflow/tfx:1.8.1.
We run flink with a parellism of 10. So 10 files of tf_records are in input of the evaluator component.
From what we could gather, beam tells flink to build a single task for the 3 p_transforms:
"ReadFromTFRecordToArrow" | "FlattenExamples" | "ExtractEvaluateAndWriteResults"
Our issue is that this ends up creating a single subtask for Flink, so a single task manager is doing all the work as you can see in the attached screenshot. So the issue seems to be with the beam workflow which does not parallelized.
I have two main questions:
- Is this behavior normal ?
- Is it possible to better dispatch the workload between the different taskmanagers ?