-
Notifications
You must be signed in to change notification settings - Fork 16.3k
remove ECS Operator retry mechanism on task failed to start #53083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove ECS Operator retry mechanism on task failed to start #53083
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
|
|
@o-nikolas , can you please review my solution to the issue #52943? |
o-nikolas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally like this approach of simplification, especially since we have a retry mechanism at the Airflow task level. Was there anything that this ECS specific retry was giving us that simple Airflow Task level retries won't? Also, relatedly, was this retry ever working? You mention in the issue and description that this retry was being attempted on the wrong method and with the wrong arn. Is there ever a condition where this was working? Since it will be a change in behaviour (thus a breaking change) to remove it.
|
@o-nikolas This retry was invented to allow a separate retry mechanism on infrastructure errors, which can make sense if the triggered business logic is not idempotent (so the developer turns off Airflow retry); otherwise, Airflow retry can do the same job. This retry mechanism was invented before deferrable mode (#25413) and worked as expected, while Regarding the breaking change, the retry is not working since version |
o-nikolas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, you've convinced me :) I think we can merge this as a bug fix.
@jason810496 You were looking into the other PR for this as well, do you approve of this one?
@vincbeck thoughts?
jason810496
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! LGTM as well
vincbeck
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions. |
Draft PR.
Closes: #52943
Currently, in the case of an
EcsTaskFailToStarterror, theEcsOperatorperforms a retry of the _check_success_task function (and not the task) with an emptyself.arnparameter, which returns None, leads Airflow to mark the DAG as successful despite the task failing to start.Since no actual task retry was performed, I suggest removing all the "retry" logic, which will result in Airflow marking the task as failed.
Copying relevant context from the issue: