Kubernetes operator retry regression #12111

pceric · 2020-11-05T16:04:41Z

Apache Airflow version: 1.10.12

Kubernetes version (if you are using kubernetes) (use kubectl version): 1.15.9

Environment:

Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): Debian 9 (Stretch)

What happened: As of Airflow 1.10.12, and going back to sometime around 1.10.10 or 1.10.11, the behavior of the retry mechanism in the kubernetes pod operator regressed. Previously when a pod failed due to an error, Airflow would spin up a new pod in kubernetes on retry. As of 1.10.12 Airflow now tries to re-use the same broken pod over and over:

INFO - found a running pod with labels {'dag_id': 'my_dad', 'task_id': 'my_task', 'execution_date': '2020-11-04T1300000000-e807cde8a', 'try_number': '6'} but a different try_number. Will attach to this pod and monitor instead of starting new one

This is bad because most failures we encounter are due to the underlying "physical" hardware failing and retrying on the same pod is pointless, it will never succeed.

What you expected to happen: I would expect the k8s Airflow operator to start a new pod that would allow it to be scheduled on a new k8s node that does not have an underlying "physical" hardware problem, just like it was on earlier versions of Airflow.

How to reproduce it: Run a kubernetes pod operator task with a retry count set and error the node in a way that it can never succeed.

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2020-11-05T16:04:42Z

Thanks for opening your first issue here! Be sure to follow the issue template!

philipherrmann · 2020-11-27T13:14:27Z

I think this is because of the
reattach_on_restartidea. The parameter is documented as "if the scheduler dies while the pod is running, reattach and monitor", but, I think, "while the pod is running" seems to checked in an incorrect manner. It seems that the Operator searches the pod's metadata for an "already_checked" label with value True. This label seems to be set only in the "call stack"

handle_pod_overlap > monitor_launched_pod > final_state != State.SUCCESS

This might be intended, in terms of idempotency - unclear to me. At least it unexpected to me.

Possible workarounds are setting is_delete_operator_pod = True or reattach_on_restart = False when construction the Operator. While both settings allow retries, they have different side effects.

pceric · 2020-11-28T00:17:43Z

Yes, I ended up enabling is_delete_operator_pod as a workaround, at the cost of being unable to jump into a finished pod to debug easier.

philipherrmann · 2020-12-04T14:12:37Z

I think this might be fixed with version 1.10.13 due to #11368, has anyone "already checked"?

pceric · 2020-12-04T16:17:28Z

That does look promising, although it introduced critical bug #12659. Waiting for 1.10.14 before upgrading.

dimberman · 2020-12-08T21:33:05Z

Hi @philipherrmann @pceric please let me know if this bug is fixed. I would also recommend using the cncf.kubernetes backport provider instead of the operator in airflow itself as those are being deprecated in 2.0 (you'll also get fixes much faster through providers)

pceric · 2020-12-14T21:32:54Z

I installed 1.10.14 today and while the behavior is a bit odd, it does work. If I have retries set to 5, Airflow will run 10 retries with every odd retry being a "dummy retry", gathering the output from the previous failure. But since everything works as expected I'm calling this fixed.

pceric added the kind:bug This is a clearly a bug label Nov 5, 2020

eladkal added the area:k8s label Nov 9, 2020

scholtzan mentioned this issue Nov 11, 2020

query_file_path in DAGs not recognized by bigquery_etl_query mozilla/bigquery-etl#1542

Closed

kaxil added provider:cncf-kubernetes Kubernetes provider related issues and removed area:k8s labels Nov 18, 2020

pceric closed this as completed Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes operator retry regression #12111

Kubernetes operator retry regression #12111

pceric commented Nov 5, 2020

boring-cyborg bot commented Nov 5, 2020

philipherrmann commented Nov 27, 2020

pceric commented Nov 28, 2020

philipherrmann commented Dec 4, 2020

pceric commented Dec 4, 2020

dimberman commented Dec 8, 2020

pceric commented Dec 14, 2020

Kubernetes operator retry regression #12111

Kubernetes operator retry regression #12111

Comments

pceric commented Nov 5, 2020

boring-cyborg bot commented Nov 5, 2020

philipherrmann commented Nov 27, 2020

pceric commented Nov 28, 2020

philipherrmann commented Dec 4, 2020

pceric commented Dec 4, 2020

dimberman commented Dec 8, 2020

pceric commented Dec 14, 2020