这是indexloc提供的服务,不要输入任何密码
Skip to content

Copy of [AIRFLOW-5071] JIRA: Thousands of Executor reports task instance X finished (success) although the task says its queued. Was the task killed externally? #10790

@dmariassy

Description

@dmariassy

Apache Airflow version: 1.10.9

Kubernetes version (if you are using kubernetes) (use kubectl version): Server: v1.10.13, Client: v1.17.0

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Debian GNU/Linux 9 (stretch)
  • Kernel (e.g. uname -a): Linux airflow-web-54fc4fb694-ftkp5 4.19.123-coreos #1 SMP Fri May 22 19:21:11 -00 2020 x86_64 GNU/Linux
  • Others: Redis, CeleryExecutor

What happened:

In line with the guidelines laid out in AIRFLOW-7120, I'm copying over a JIRA for a bug that has significant negative impact on our pipeline SLAs. The original ticket is AIRFLOW-5071 which has a lot of details from various users who use ExternalTaskSensors in reschedule mode and see their tasks going through the following unexpected state transitions:

running -> up_for_reschedule -> scheduled -> queued -> up_for_retry

In our case, this issue seems to affect approximately ~2000 tasks per day.

Screenshot 2020-09-08 at 09 01 03

What you expected to happen:

I would expect that tasks would go through the following state transitions instead: running -> up_for_reschedule -> scheduled -> queued -> running

How to reproduce it:

Unfortunately, I don't have configuration available that could be used to easily reproduce the issue at the moment. However, based on the thread in AIRFLOW-5071, the problem seems to arise in deployments that use a large number of sensors in reschedule mode.

Metadata

Metadata

Assignees

Labels

Stale Bug Reportarea:Schedulerincluding HA (high availability) schedulerkind:bugThis is a clearly a bugpinnedProtect from Stalebot auto closing

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions