Skip to content

Conversation

@vladimirbinshtok
Copy link
Contributor

@vladimirbinshtok vladimirbinshtok commented Jul 9, 2025

Draft PR.

Closes: #52943

Currently, in the case of an EcsTaskFailToStart error, the EcsOperator performs a retry of the _check_success_task function (and not the task) with an empty self.arn parameter, which returns None, leads Airflow to mark the DAG as successful despite the task failing to start.
Since no actual task retry was performed, I suggest removing all the "retry" logic, which will result in Airflow marking the task as failed.

Copying relevant context from the issue:

The code currently doesn't attempt to retry the task; it only retries the _check_success_task function call (which is pointless). The implemented retry mechanism doesn't support deferrable and wait_for_completion flags, which would be pretty complicated to add, as you would need to pass the current retry number to the new task outside of the standard Airflow retry policy. No other Amazon operator has a retry policy for task failed to start errors (any service can have initialization issues), so I suggest aligning the same approach with the ECS operator.
As mentioned in the document you attached, the solution to solve the Fargate start task error is to use Step Function (Step Function operator in Airflow), which will allow an independent retry mechanism.

@boring-cyborg
Copy link

boring-cyborg bot commented Jul 9, 2025

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Jul 9, 2025
@vladimirbinshtok
Copy link
Contributor Author

@o-nikolas , can you please review my solution to the issue #52943?

Copy link
Contributor

@o-nikolas o-nikolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like this approach of simplification, especially since we have a retry mechanism at the Airflow task level. Was there anything that this ECS specific retry was giving us that simple Airflow Task level retries won't? Also, relatedly, was this retry ever working? You mention in the issue and description that this retry was being attempted on the wrong method and with the wrong arn. Is there ever a condition where this was working? Since it will be a change in behaviour (thus a breaking change) to remove it.

@vladimirbinshtok
Copy link
Contributor Author

@o-nikolas This retry was invented to allow a separate retry mechanism on infrastructure errors, which can make sense if the triggered business logic is not idempotent (so the developer turns off Airflow retry); otherwise, Airflow retry can do the same job.

This retry mechanism was invented before deferrable mode (#25413) and worked as expected, while @AwsBaseHook.retry(should_retry_eni) was wrapping the start task function. While the migration to support deferrable mode, this retry was deleted (#32104), and the retry mechanism stopped working.

Regarding the breaking change, the retry is not working since version apache-airflow-providers-amazon 8.6.0, which was added in Airflow 2.7.1. Reinventing the retry with deferrable mode is a complicated task, as it requires a separate retry variable to be passed around the system.

Copy link
Contributor

@o-nikolas o-nikolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, you've convinced me :) I think we can merge this as a bug fix.

@jason810496 You were looking into the other PR for this as well, do you approve of this one?

@vincbeck thoughts?

Copy link
Member

@jason810496 jason810496 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM as well

Copy link
Contributor

@vincbeck vincbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@o-nikolas o-nikolas merged commit e00771d into apache:main Jul 15, 2025
75 checks passed
@boring-cyborg
Copy link

boring-cyborg bot commented Jul 15, 2025

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Airflow task is marked as succeeded on the EcsTaskFailToStart exception

4 participants