Fix ecs/EcsRunTaskOperator retry condition #53080

fweilun · 2025-07-09T09:43:02Z

Draft PR.

This PR adds a EcsCannotPullContainerError exception to handle scenarios where ECS tasks fail to start due to image pull issues (e.g., CannotPullContainerError).

Test added: ✅ test_should_retry_eni_false_for_pull_failure

jason810496

Nice! Thanks for the PR.

How about adding a new test to check whether does _check_success_task raise the new EcsCannotPullContainerError exception?

Here are some good references to add test case against _check_success_task:

airflow/providers/amazon/tests/unit/amazon/aws/operators/test_ecs.py

Lines 447 to 464 in 7f6bc14

    
           @mock.patch.object(EcsBaseOperator, "client") 
        
           @mock.patch("airflow.providers.amazon.aws.utils.task_log_fetcher.AwsTaskLogFetcher") 
        
           def test_check_success_tasks_raises_cloudwatch_logs(self, log_fetcher_mock, client_mock): 
        
               self.ecs.arn = "arn" 
        
               self.ecs.task_log_fetcher = log_fetcher_mock 
        
               log_fetcher_mock.get_last_log_messages.return_value = ["1", "2", "3", "4", "5"] 
        
               client_mock.describe_tasks.return_value = { 
        
                   "tasks": [{"containers": [{"name": "foo", "lastStatus": "STOPPED", "exitCode": 1}]}] 
        
               } 
        
               with pytest.raises(Exception) as ctx: 
        
                   self.ecs._check_success_task() 
        
               assert str(ctx.value) == ( 
        
                   "This task is not in success state - last 10 logs from Cloudwatch:\n1\n2\n3\n4\n5" 
        
               ) 
        
               client_mock.describe_tasks.assert_called_once_with(cluster="c", tasks=["arn"])

jason810496 · 2025-07-09T10:06:35Z

providers/amazon/src/airflow/providers/amazon/aws/hooks/ecs.py

 def should_retry(exception: Exception):
    """Check if exception is related to ECS resource quota (CPU, MEM)."""
+    if isinstance(exception, EcsCannotPullContainerError):
+        return False
+
    if isinstance(exception, EcsOperatorError):


IMO, let the EcsCannotPullContainerError error fail fast instead of retrying should be fine right ?

Based on the Documentation - CannotPullContainer task errors in Amazon ECS, it's more like configuration error from user instead of system instability.

cc @o-nikolas , @eladkal

I think there're cases suitable for retry with a reasonable wait time. e.g.,

ERROR: toomanyrequests: Too Many Requests or You have reached your pull rate limit.

fweilun · 2025-07-09T10:43:17Z

Nice! Thanks for the PR.

How about adding a new test to check whether does _check_success_task raise the new EcsCannotPullContainerError exception?

Here are some good references to add test case against _check_success_task:

airflow/providers/amazon/tests/unit/amazon/aws/operators/test_ecs.py

Lines 447 to 464 in 7f6bc14

@mock.patch.object(EcsBaseOperator, "client")

@mock.patch("airflow.providers.amazon.aws.utils.task_log_fetcher.AwsTaskLogFetcher")

def test_check_success_tasks_raises_cloudwatch_logs(self, log_fetcher_mock, client_mock):

self.ecs.arn = "arn"

self.ecs.task_log_fetcher = log_fetcher_mock

log_fetcher_mock.get_last_log_messages.return_value = ["1", "2", "3", "4", "5"]

client_mock.describe_tasks.return_value = {

"tasks": [{"containers": [{"name": "foo", "lastStatus": "STOPPED", "exitCode": 1}]}]

}

with pytest.raises(Exception) as ctx:

self.ecs._check_success_task()

assert str(ctx.value) == (

"This task is not in success state - last 10 logs from Cloudwatch:\n1\n2\n3\n4\n5"

)

client_mock.describe_tasks.assert_called_once_with(cluster="c", tasks=["arn"])

Done! Test case added.

dominikhei · 2025-07-09T19:10:19Z

providers/amazon/src/airflow/providers/amazon/aws/hooks/ecs.py



 def should_retry(exception: Exception):
    """Check if exception is related to ECS resource quota (CPU, MEM)."""


Just a tiny nit, but maybe adjust the docstring to incorporate the new behavior?

o-nikolas

Hey thanks for this PR. The ECS operator is a very popular operator so it gets a lot of users and so a lot of niche failure mechanisms come up.

Some things I'm confused or wary about with this PR:

Wasn't #52943 about more than just container pull errors? It's about any configuration error that's stopping the container from starting up initially, rather than failing with the runtime (i.e. airflow) code. Why are we focusing on just one case here?
We seem to be leaning towards a complicated control flow with custom exceptions within the ECS operator. I'm not sure things really need to be this complicated.
Why are we updating a retry function that was meant to just be for ENIs with container pull exception handling? Does this retry need to be more generic? Do we need to rethink things?

Lee-W

Looks good IMO, but would like to wait for the existing comments to be resolved

Lee-W · 2025-07-10T01:45:17Z

providers/amazon/src/airflow/providers/amazon/aws/hooks/ecs.py

 def should_retry(exception: Exception):
    """Check if exception is related to ECS resource quota (CPU, MEM)."""
+    if isinstance(exception, EcsCannotPullContainerError):
+        return False
+
    if isinstance(exception, EcsOperatorError):


I think there're cases suitable for retry with a reasonable wait time. e.g.,

ERROR: toomanyrequests: Too Many Requests or You have reached your pull rate limit.

Lee-W · 2025-07-10T01:47:05Z

also, CI needs to be fixed

fweilun · 2025-07-10T17:06:30Z

I wonder whether removing the retry mechanism might make more sense here.
#53083

jason810496 · 2025-07-11T04:14:55Z

I wonder whether removing the retry mechanism might make more sense here. #53083

+0 for leaving retry to task level retry by Airflow instead of retrying at _check_success_task function level. Both works, no strong option for them.

It seems to be more simple. However, as mentioned in #53083 (review) removing the function level retry mechanism will introduce breaking change.

IMO, if we go for removing function level retry mechanism (#53083) then we might need to close this one.

Or, if we decide to remain the function level retry for compatibility, I think by adding CannotPullContainerError str in should_retry_eni function

airflow/providers/amazon/src/airflow/providers/amazon/aws/hooks/ecs.py

Lines 40 to 47 in d3dbc28

    
           def should_retry_eni(exception: Exception): 
        
               """Check if exception is related to ENI (Elastic Network Interfaces).""" 
        
               if isinstance(exception, EcsTaskFailToStart): 
        
                   return any( 
        
                       eni_reason in exception.message 
        
                       for eni_reason in ["network interface provisioning", "ResourceInitializationError"] 
        
                   )

remove the following explicit error handling and remove the EcsCannotPullContainerError

          if "CannotPullContainerError" in task.get("stoppedReason", ""):
                raise EcsCannotPullContainerError(
                    f"The task failed to start due to: {task.get('stoppedReason', '')}"
                )

might be more simple and more generic.

Then we can go through the possible error of stoppedReason and add the missing retryable error ( CannotPullContainerError in this case ) to ["network interface provisioning", "ResourceInitializationError"] list to make it more robust ( maybe make this list as an enum or global constant ).

o-nikolas · 2025-07-11T17:58:50Z

I wonder whether removing the retry mechanism might make more sense here. #53083

+0 for leaving retry to task level retry by Airflow instead of retrying at _check_success_task function level. Both works, no strong option for them.

It seems to be more simple. However, as mentioned in #53083 (review) removing the function level retry mechanism will introduce breaking change.

IMO, if we go for removing function level retry mechanism (#53083) then we might need to close this one.

Or, if we decide to remain the function level retry for compatibility, I think by adding CannotPullContainerError str in should_retry_eni function

airflow/providers/amazon/src/airflow/providers/amazon/aws/hooks/ecs.py

Lines 40 to 47 in d3dbc28

def should_retry_eni(exception: Exception):

"""Check if exception is related to ENI (Elastic Network Interfaces)."""

if isinstance(exception, EcsTaskFailToStart):

return any(

eni_reason in exception.message

for eni_reason in ["network interface provisioning", "ResourceInitializationError"]

)

remove the following explicit error handling and remove the EcsCannotPullContainerError
          if "CannotPullContainerError" in task.get("stoppedReason", ""):
                raise EcsCannotPullContainerError(
                    f"The task failed to start due to: {task.get('stoppedReason', '')}"
                )
might be more simple and more generic.

Then we can go through the possible error of stoppedReason and add the missing retryable error ( CannotPullContainerError in this case ) to ["network interface provisioning", "ResourceInitializationError"] list to make it more robust ( maybe make this list as an enum or global constant ).

Just to +1 again, full reply here I think we should make every effort to simplify this one. If the mechanism is truly broken then we can just mark it as a bug fix and then we don't need to worry about breaking changes. It's just determining if it's broken in all/most cases which needs confirming by perhaps @vladimirbinshtok (the original requester). Once we have that we can go with this PR.

jason810496 · 2025-07-14T02:08:01Z

Sorry @fweilun, we have to close this one as #53083 will resolve the issue.
Big thanks for your help!

fweilun added 2 commits July 9, 2025 17:02

add EcsCannotPullContainerError handling for ECS task startup failure

786f3ae

apply pre-commit fixes

e4aecb7

fweilun requested review from eladkal and o-nikolas as code owners July 9, 2025 09:43

boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Jul 9, 2025

fweilun changed the title ~~Fix ecs retry condition~~ Fix ecs/EcsRunTaskOperator retry condition Jul 9, 2025

jason810496 self-requested a review July 9, 2025 09:49

jason810496 reviewed Jul 9, 2025

View reviewed changes

fweilun added 2 commits July 9, 2025 18:35

add CannotPullContainerError test case.

ac8ebe4

apply pre-commit formatting

bde63e6

dominikhei reviewed Jul 9, 2025

View reviewed changes

o-nikolas reviewed Jul 9, 2025

View reviewed changes

Lee-W approved these changes Jul 10, 2025

View reviewed changes

jason810496 closed this Jul 14, 2025

	@mock.patch.object(EcsBaseOperator, "client")
	@mock.patch("airflow.providers.amazon.aws.utils.task_log_fetcher.AwsTaskLogFetcher")
	def test_check_success_tasks_raises_cloudwatch_logs(self, log_fetcher_mock, client_mock):
	self.ecs.arn = "arn"
	self.ecs.task_log_fetcher = log_fetcher_mock

	log_fetcher_mock.get_last_log_messages.return_value = ["1", "2", "3", "4", "5"]
	client_mock.describe_tasks.return_value = {
	"tasks": [{"containers": [{"name": "foo", "lastStatus": "STOPPED", "exitCode": 1}]}]
	}

	with pytest.raises(Exception) as ctx:
	self.ecs._check_success_task()

	assert str(ctx.value) == (
	"This task is not in success state - last 10 logs from Cloudwatch:\n1\n2\n3\n4\n5"
	)
	client_mock.describe_tasks.assert_called_once_with(cluster="c", tasks=["arn"])



		def should_retry(exception: Exception):
		"""Check if exception is related to ECS resource quota (CPU, MEM)."""

Fix ecs/EcsRunTaskOperator retry condition #53080

Fix ecs/EcsRunTaskOperator retry condition #53080

Uh oh!

Conversation

fweilun commented Jul 9, 2025 • edited by eladkal Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jason810496 left a comment

Choose a reason for hiding this comment

Uh oh!

jason810496 Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Lee-W Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

fweilun commented Jul 9, 2025

Uh oh!

dominikhei Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

o-nikolas left a comment

Choose a reason for hiding this comment

Uh oh!

Lee-W left a comment

Choose a reason for hiding this comment

Uh oh!

Lee-W Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

Lee-W commented Jul 10, 2025

Uh oh!

fweilun commented Jul 10, 2025

Uh oh!

jason810496 commented Jul 11, 2025

Uh oh!

o-nikolas commented Jul 11, 2025

Uh oh!

jason810496 commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fweilun commented Jul 9, 2025 •

edited by eladkal

Loading

dominikhei Jul 9, 2025 •

edited

Loading

jason810496 commented Jul 14, 2025 •

edited

Loading