-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Fix ecs/EcsRunTaskOperator retry condition #53080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jason810496
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for the PR.
How about adding a new test to check whether does _check_success_task raise the new EcsCannotPullContainerError exception?
Here are some good references to add test case against _check_success_task:
airflow/providers/amazon/tests/unit/amazon/aws/operators/test_ecs.py
Lines 447 to 464 in 7f6bc14
| @mock.patch.object(EcsBaseOperator, "client") | |
| @mock.patch("airflow.providers.amazon.aws.utils.task_log_fetcher.AwsTaskLogFetcher") | |
| def test_check_success_tasks_raises_cloudwatch_logs(self, log_fetcher_mock, client_mock): | |
| self.ecs.arn = "arn" | |
| self.ecs.task_log_fetcher = log_fetcher_mock | |
| log_fetcher_mock.get_last_log_messages.return_value = ["1", "2", "3", "4", "5"] | |
| client_mock.describe_tasks.return_value = { | |
| "tasks": [{"containers": [{"name": "foo", "lastStatus": "STOPPED", "exitCode": 1}]}] | |
| } | |
| with pytest.raises(Exception) as ctx: | |
| self.ecs._check_success_task() | |
| assert str(ctx.value) == ( | |
| "This task is not in success state - last 10 logs from Cloudwatch:\n1\n2\n3\n4\n5" | |
| ) | |
| client_mock.describe_tasks.assert_called_once_with(cluster="c", tasks=["arn"]) |
| def should_retry(exception: Exception): | ||
| """Check if exception is related to ECS resource quota (CPU, MEM).""" | ||
| if isinstance(exception, EcsCannotPullContainerError): | ||
| return False | ||
|
|
||
| if isinstance(exception, EcsOperatorError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, let the EcsCannotPullContainerError error fail fast instead of retrying should be fine right ?
Based on the Documentation - CannotPullContainer task errors in Amazon ECS, it's more like configuration error from user instead of system instability.
cc @o-nikolas , @eladkal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there're cases suitable for retry with a reasonable wait time. e.g.,
ERROR: toomanyrequests: Too Many Requests or You have reached your pull rate limit.
Done! Test case added. |
|
|
||
|
|
||
| def should_retry(exception: Exception): | ||
| """Check if exception is related to ECS resource quota (CPU, MEM).""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a tiny nit, but maybe adjust the docstring to incorporate the new behavior?
o-nikolas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey thanks for this PR. The ECS operator is a very popular operator so it gets a lot of users and so a lot of niche failure mechanisms come up.
Some things I'm confused or wary about with this PR:
- Wasn't #52943 about more than just container pull errors? It's about any configuration error that's stopping the container from starting up initially, rather than failing with the runtime (i.e. airflow) code. Why are we focusing on just one case here?
- We seem to be leaning towards a complicated control flow with custom exceptions within the ECS operator. I'm not sure things really need to be this complicated.
- Why are we updating a retry function that was meant to just be for ENIs with container pull exception handling? Does this retry need to be more generic? Do we need to rethink things?
Lee-W
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good IMO, but would like to wait for the existing comments to be resolved
| def should_retry(exception: Exception): | ||
| """Check if exception is related to ECS resource quota (CPU, MEM).""" | ||
| if isinstance(exception, EcsCannotPullContainerError): | ||
| return False | ||
|
|
||
| if isinstance(exception, EcsOperatorError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there're cases suitable for retry with a reasonable wait time. e.g.,
ERROR: toomanyrequests: Too Many Requests or You have reached your pull rate limit.
|
also, CI needs to be fixed |
|
I wonder whether removing the retry mechanism might make more sense here. |
+0 for leaving retry to task level retry by Airflow instead of retrying at It seems to be more simple. However, as mentioned in #53083 (review) removing the function level retry mechanism will introduce breaking change. IMO, if we go for removing function level retry mechanism (#53083) then we might need to close this one. Or, if we decide to remain the function level retry for compatibility, I think by adding
remove the following explicit error handling and remove the if "CannotPullContainerError" in task.get("stoppedReason", ""):
raise EcsCannotPullContainerError(
f"The task failed to start due to: {task.get('stoppedReason', '')}"
)might be more simple and more generic. Then we can go through the possible error of |
Just to +1 again, full reply here I think we should make every effort to simplify this one. If the mechanism is truly broken then we can just mark it as a bug fix and then we don't need to worry about breaking changes. It's just determining if it's broken in all/most cases which needs confirming by perhaps @vladimirbinshtok (the original requester). Once we have that we can go with this PR. |
Closes: #52943
Draft PR.
This PR adds a EcsCannotPullContainerError exception to handle scenarios where ECS tasks fail to start due to image pull issues (e.g., CannotPullContainerError).
Test added: ✅ test_should_retry_eni_false_for_pull_failure