[ECS] One task prevents from all of the other in instance to change from PENDING to RUNNING #325

Alonreznik · 2018-11-01T09:48:47Z

Hello There.
We're facing lately a strange practice from ECS about tasks that prevent from new tasks to run in an instance.

A little about our case -
We have some tasks that need to complete their action and then exit by themselves in a stopTask command. That means we have a gracefull-shutdown process, that in sometimes make some time to complete (more than few seconds and even some long minutes).

However, when a stopTask is sent over these tasks, they do not appear anymore in the tasks at ECS console (which is fine), but they also make all other tasks in the same instance that are trying to change their state from PENDING to RUNNING.

Here is an example of one instance tasks when it happens:

Why is that behavior happens? Why one task prevent from the other to run next to it until it done? This is a bad practice in resource management (we don't use the max potential of our instances at the pending time).

The best thing will be that the stopped task will appear in the console until it really stopped in the instance, and the state changing from PENDING to RUNNING won't be affected by other tasks in the same instance.

I hope you can fix that behavior,

Thanks!

The text was updated successfully, but these errors were encountered:

petderek · 2018-11-01T22:03:26Z

This is working by design. Our scheduler assumes that while a task is stopping (or exiting gracefully) it will still use its requested resources. For example, a webserver would still need its resource pool while waiting for all connections to terminate. The alternative would be to allocate fewer resources when a task transitions to stopping, but that's not a safe assumption to make across all workloads.

Would configurable behavior help your use case? Or is it sufficient to be more clear about this behavior in the console?

Alonreznik · 2018-11-04T09:41:30Z

Hey @petderek.
Thank you for your response.

I get it works like that by design. However, i wonder why one task should need to prevent all other to be running.

The best configurable behavior for us will be a task-for-task process: calculating the resources that are held by the stopping task (which is fine and reasonable), but not preventing from other tasks to be running on the instance while it has a resource to give.

In our use case, the tasks are used for a long-poll workload that should run as a service, and not for a web client. That behavior makes our instances not be filled in time and can also "stuck" our process in a new deployment, because instances are waiting for one task to end until the other tasks are allowed to run.

So the instance is actually in kind of "disabled" or "draining" status until the long workload is done (and it can take some time).

What can we do in our use case to be supported in ECS?

Thanks

Alonreznik · 2018-11-11T16:21:14Z

Hi!
Is there any update about that thing?
How can we make any solution or workaround in order to run the task-for-task mechanism?

Thank you in advanced!

FlorianWendel · 2018-11-19T15:34:15Z

Hi everyone,

we are facing the exact same issue. @Alonreznik is right, one task is blocking all other tasks and in my opinion this does not make sense. Let me illustrate:

Assume we have one task with 10 GB memory reservation running on a container instance that has registered with 30 GB. The container instance shows 20 GB of RAM available and that is correct. Now this task is stopped (the ECS agent will make docker send a SIGTERM) but the container keeps running to finish its calculations (now it shows under stopped tasks as "desired status = STOPPED" and "last status = RUNNING"). The container instance will now show 30 GB available in the AWS ECS console which is nonsense, it should still be 20 GB since the container is still using resources as @petderek mentioned. Even worse, if we try to launch three new tasks with 10 GB memory reservation each, they will all be pending until the still running task transitions to "last status = STOPPED". Expected behavior would be that two out of the three tasks can launch immediately.

I hope my example was understandable, else feel free to ask.
And thanks for looking into this :)

yumex93 · 2018-11-19T19:52:56Z

Hey! As a workaround, you can set ECS_CONTAINER_STOP_TIMEOUT to a smaller number. This is used to configure 'Time to wait for the container to exit normally before being forcibly killed'. By default, it is set as 30s. More information can be found here. I marked this issue as a feature request and we will work on it soon.

Alonreznik · 2018-11-20T09:35:47Z

Hi @yumex93
Thank you for your response.
We will be very happy to have that feature as soon as it will be out :)

About your workaround - in most cases, we need our dockers to make a graceful shutdown before they die. Therefore, decreasing the ECS_CONTAINER_STOP_TIMEOUT will cause our workers to be killed before the shutdown is completed. Therefore the feature is more than needed :)

Thank you again for your help, we're waiting for updates about it.

Alon

FlorianWendel · 2018-11-20T10:46:00Z

@Alonreznik , @yumex93 We have the same situation, some workers even take a few hours to complete their task and we've leveraged the ECS_CONTAINER_STOP_TIMEOUT to shut those down gracefully as well. Since ECS differentiates between a "desired status" and a "last status" for tasks, I believe it should be possible to handle tasks in the process of shutting down a bit better than how it works today. For illustration of what I mean, see this screenshot:

The tasks are still running and still consume resources, but the container instance does not seem to keep track of those resources. If this is more than just confusing display, I expect it to cause issues, e.g. like the one above.

Alonreznik · 2018-12-17T11:31:32Z

Hi @yumex93, Any update with that issue?

Thanks

Alon

yhlee-aws · 2018-12-19T19:44:43Z

We are aware of this issue, and we are working on prioritizing it. We will keep this issue open to track this issue, and provide update when we have more solid plans.

Alonreznik · 2019-01-16T09:40:51Z

Hi @yunhee-l.
Thank you for your last response.
We're still facing this issue, which demands from us to luanch more servers than we need in our deployments, and makes our workloads stuck.
Any update in that case?

Thanks

Alonreznik · 2019-02-27T16:51:50Z

Hi @yunhee-l @FlorianWendel
any update?

yhlee-aws · 2019-02-27T17:08:35Z

We don't have any new updates at this point. We will update when we have more solid plans.

yhlee-aws · 2019-02-27T17:18:06Z

Related: aws/amazon-ecs-agent#731

tomotway · 2019-03-08T16:45:21Z

Hi,

Just wanted to add our experience with this with the hopes that it can be bumped in priority.

We need to run tasks that can be long running. With this behaviour as it stands it essentially locks up the ec2 instance so that it cannot take any more tasks until the first task has shut down (which could be a few hours) It wouldn't quite so bad if ecs marked the host as unusable and placed tasks on other hosts but it doesn't, it still sends them to the host that cannot start them. This has the potential to cause us service outage in that we cannot create tasks to handle workload (we tell the service to create tasks but it can't due to the lock up)

Thanks.

Alonreznik · 2019-03-10T12:46:46Z

@petderek @yumex93
This is something really makes us pay more than the resource we need each deployment. As you can see, there is more than one user who suffers that kind of basic designing.

Do you have any ETA for implementing it or deploying it? This is a real blocker for our ongoing processes.

Thank you

Alon

adnxn · 2019-03-11T16:36:37Z

@Alonreznik: thanks for following up again and communicating the importance of getting this resolved. this helps us prioritize our tasks.

we don't have an ETA right now - but have identified the exact issue and have a path forward that requires changes to our scheduling system and the ecs agent. so to give you some more context. as @petderek said earlier,

This is working by design. Our scheduler assumes that while a task is stopping (or exiting gracefully) it will still use its requested resources.

so changing this behavior will be a departure from our existing way of accounting resources when we schedule tasks. considering that the current way has been in place since the beginning of ECS, the risks involved with changing this are significant as there could be subtle rippling effects in the system. we plan to explore ways to validate this change and ensure to not introduce regressions.

the original design made the trade off towards oversubscribing resources for placement by releasing resource on the instance when tasks were stopped - but the side effect of that is the behavior you are describing. additionally, now that we've added granular sigkill timeouts for containers with #1849, we can see this problem being exasperated.

so all that is to say - we're working on this issue and we will update this thread as we work towards deploying the changes.

Alonreznik · 2019-03-14T14:13:52Z

@adnxn
Thank you for your detailed explanation. It helps very much understanding the context of the situation.

We off course get this is something that is built in the design and we accept it.

However, I assume that our intentions are not for this radical change in the core system (which is great!!). Our request is based on the ecs-agent assumption of all of the resources must be released in the instance from the last tasks, and our request is just to handle it by task (and also have some indication the task is still running on the instance backward after it got the SIGTERM).

As it looks today, the resource handling and releasing are based on the entire instance, and not on the tasks that run over the instance. So if a task releases its resources, the ecs-agent should allow scheduling these resources for new tasks (it they stand on the resource requirements).

Thank you for your help!
Much appreciated!

Please keep us posted,

Alon

Halama · 2019-05-23T09:02:33Z

Hello,
we are affected with the exactly same issue. ECS service deploying long-poll workers with stopTimeout set to 2 hours. Task in running state with desired status stopped block all new tasks scheduled on the same instance even there are free resources available.

Adding a new instances to the cluster helped US to workaround this situation, but it can be really costly if there are multiple deploys each day.

Are there any new updates about this issues? or possible workarounds.

It could definitely be solved by removing the long poll service and switch it to just calling ECS RunTask (process one job and terminate) without waiting for the result. But it would require more changes in our application architecture and also it would be more tightly coupled to ECS.

thanks
Martin

Alonreznik · 2019-08-23T23:09:48Z

Hi @coultn @adnxn
Any update or ETA about that?

Thank you

Alon

Alonreznik · 2019-10-06T08:53:55Z

Hi Guys.
Can somebody take a look about that?
This is harming our business because we have a problem with deploying new versions to our Prod. This is really problemtaic, and it shades a dark light about continuing using the ECS in our side.

Thanks

Alonreznik · 2019-10-10T15:48:26Z

@coultn

coultn · 2019-10-10T20:05:47Z

Hi, thank you for your feedback on this issue. We are aware of this behavior and are researching solutions. We will keep the github issue up to date as the status changes.

Alonreznik · 2019-10-31T09:58:38Z

Hi @coultn .
Thanks for your reply.

We must say this is something prevents our workloads to grow accordingly to our tasks, and there are situations this behavior actually stuck our production servers. Again, something that can be a no-go (or no-continue in our case) using ECS in prod.

For example, you can see a typical production workloads desire/running gap.

The Green layer is the gap between the desired and the running (orange layer) tasks. The blue is the PENDING tasks in the cluster. You can see a constant gap between these two parameters. No deployment was made today and this is something we're encountering in scaling up mechanism.

Think about the situation we're encountering. We have new tasks in our queue (SQS), and therefore we're asking from the ECS to run new tasks (means - desire tasks increasing).
Each workload is a task in the ECS, and all of them split between the servers.
When we have some workload take more than some time to complete (and there are many of them because we're asking for the workload to end it's task before it ends and then die) one workload blocks the entire instance to get new one workload, even there are free resources in the instance.

The ECS agent schedule new workloads to that instance, and then hits the one task that is still working. For the agent - he made its job - he scheduled new tasks. But the tasks are stuck in PENDING state, for hours in some cases, makes this instance to be unusable because they're just not working yet. Now think about, that you need to raise the more 100 tasks in some hours to complete a quick workloads in the line, and you have 5-6 instances with one task blocks each one, and it becomes to be a mass.

We also must say we encounter this in the last year only, after some upgrade of the agent a year or year and a half ago.

We need every day to ask for more instances in our workloads in order to open the block. This is not how a production service in AWS should be maintained, and we're facing that again and again in this case, every day.

Please help us to continue using ECS as our production orchestrator. We love this product and want it to succeed, but as it seems, it doesn't fit to long-working tasks.

Your help of rushing this in your team will be kind,

Thank you

Alon

Halama · 2019-10-31T11:46:35Z

I've discussed this with Nathan he told me that they plan fixing this, but unfortunately without a quick fix. We have similar issues with deployment and scaling and due to it lot of unnecessary rolls of new instances.

Meanwhile we are experimenting with EKS (also fo multi-cloud deployment) where this issue isn't present.

Alonreznik · 2019-10-31T11:58:17Z

I've discussed this with Nathan he told me that they plan fixing this, but unfortunately without a quick fix. We have similar issues with deployment and scaling and due to it lot of unnecessary rolls of new instances.

Meanwhile we are experimenting with EKS (also fo multi-cloud deployment) where this issue isn't present.

Hi @Halama .

Thanks for the reply and the update.

I can get this is not something that can be quick to solve, but meanwhile, ECS team can provide workarounds, such as placing-method of binpack and the newest instances, or determine the time task can be on PENDING state before it tries on the new instances. This issue is not getting any response due to many users encountering that. It is open more than a year and they're can't send any reasonable ETA (even 3 months is good for us). It was on "researching" just in the last week.

Can you please share about your migration process to EKS from ECS?

Thanks again

Alon

thom-vend · 2021-06-01T02:50:21Z

Hi, @pavneeta any update on this issue ?

AlexisConcepcion · 2021-10-06T15:27:59Z

Any update on this ?

estoesto · 2021-10-14T03:25:46Z

I'm running 1 task per host, with autoscaling, but everything gets piled up in the MQ because of this one stopping task (which runs daily and should stop gracefully). Also CICD pipelines fail since I'm leveraging aws ecs wait services-stable.
Only workaround that works for me is to modify the capacity provider to run extra instances. What a waste.

@coultn Your suggestion would solve it. Any ETA for this?

AlexisConcepcion · 2021-10-14T12:20:30Z

We recently implemented Datadog and cAdvisor as Daemons for ECS using cloudformation, we have more than 20 stacks, a few of them running about 10 instances (bigger ones). At the first try deamons took about 5 hours to be running. The key to improve performance and get the new daemon tasks running was to set the MinClusterSize=1 (it was not previously defined) and the following placement strategy on ECS-Service.yaml, ( after those modifications we deployed the daemons ).

 PlacementStrategies:
 - Type: spread
   Field: instanceId
 - Field: memory
   type: binpack

We are planning to apply it on prod soon, take in mind the placement strategy performs a rollout of your running instances, I don't thinks it is a solution but it could help!

Alonreznik · 2021-10-14T12:32:22Z

Any update about that? We love ECS, but this use case drives us into Kubernetes, which solves this case easily.

Alonreznik · 2021-10-14T12:33:50Z

BTW - 3 years (!!!!) after this issue had been opened, and still many people facing this unexpected behaviour. I think this is a good reason to make it fixed once for good.

Alonreznik · 2021-11-30T18:12:41Z

Hi @petderek. Any update?

markdascher · 2022-10-12T19:52:12Z

AWS seems to be overly cautious regarding a fix, and I think it's because the issue still isn't clearly understood by everyone involved. I'm not entirely sure that I understand it myself, but after reading the whole thread, here's what it seems to boil down to:

A 40 GB host has a single 10 GB task. It can start three more 10 GB tasks just fine. Everyone is happy.
The same 40 GB host has the same 10 GB task, but now that task is stopping. Suddenly we can't start any new tasks on this host, even though there are 30 GB available.

Scenario 2 makes no sense. It's clearly a bug. The phrase "by design" doesn't belong in this thread. I understand how it could've happened though–it's perhaps an unfortunate workaround for an older bug:

Bug A: When tasks are stopping, the system calculates available resources incorrectly. Perhaps the calculation shows "40 GB free" instead of "30 GB free."
Bug B: Rather than fixing Bug A, the ECS Agent includes logic to know when the calculation is incorrect, and then decides to freeze (with potentially catastrophic consequences) during that timeframe.

Is that accurate? Are we actually worried about the unintended consequences of fixing Bug A?

In our case shortening stopTimeout isn't a viable option, and neither are placementConstraints. Every host may have tasks stopping at the same time, so placementConstraints would just continue making them all unusable. (And even in a best case, it would result in very suboptimal placement as everything gets squeezed onto a small number of usable hosts.)

Two possible fixes:

When tasks are stopping, continue calculating available resources correctly. In the example above, that means there are only 30 GB free until the container is actually gone.
If that's too drastic, then at least make the ECS Agent try harder. If tasks are stopping, make the ECS Agent correct the calculations locally, and continue if it's safe to do so. If you're unlucky and tasks get placed onto a host that's actually full, then you're out of luck. But that's still way better than where we are now, and at least isn't completely baffling behavior.

Alonreznik · 2022-12-01T16:36:24Z

Hi everyone. It seems that this is just won't be prioritized, and the ECS team just says "we're living with the bug", while this bug just prevents from so many users to do BASIC tasks on ECS, such as just "Run tasks that works".
Can someone provide some attention on it?

AbhishekNautiyal · 2023-06-30T18:49:34Z

We are excited to share that we've addressed the known issue in ECS Agent to prevent tasks stuck in pending state on instances that have stopping tasks with long timeouts. For details on the root cause, fix, and other planned improvements, please see What's New Post, Blog Post, and documentation

We'll be closing this issue. As always, happy to receive your feedback. Let us know if you face any other issues.

Alonreznik · 2023-07-03T19:17:36Z

Holy moly!!! 5 years!! Amazing guys! I'm so excited! Thank you so much 🙏🙏

coultn transferred this issue from aws/amazon-ecs-agent Jun 10, 2019

coultn added ECS Amazon Elastic Container Service Proposed Community submitted issue labels Jun 10, 2019

coultn changed the title ~~One task prevents from all of the other in instance to change from PENDING to RUNNING~~ [ECS] One task prevents from all of the other in instance to change from PENDING to RUNNING Jun 10, 2019

pavneeta self-assigned this Jan 6, 2021

toricls unassigned pavneeta Aug 24, 2021

Alonreznik mentioned this issue Aug 15, 2022

AWS ECS task stuck in pending state aws/amazon-ecs-agent#3226

Closed

vibhav-ag self-assigned this Jan 23, 2023

This was referenced Jun 14, 2023

Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order aws/amazon-ecs-agent#3747

Merged

Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue aws/amazon-ecs-agent#3750

Merged

AbhishekNautiyal assigned AbhishekNautiyal and unassigned vibhav-ag Jun 21, 2023

AbhishekNautiyal added the Coming Soon label Jun 21, 2023

prateekchaudhry mentioned this issue Jun 21, 2023

Merge Feature/task-resource-accounting to dev aws/amazon-ecs-agent#3757

Merged

prateekchaudhry mentioned this issue Jul 12, 2023

fix memory resource accounting for multiple containers in single task aws/amazon-ecs-agent#3782

Merged

prateekchaudhry mentioned this issue Jul 20, 2023

Merge Feature/task-resource-accounting to dev aws/amazon-ecs-agent#3819

Merged

AbhishekNautiyal closed this as completed Jul 27, 2023

ankon mentioned this issue Sep 26, 2023

Bring back task accounting changes? aws/amazon-ecs-agent#3925

Closed

[ECS] One task prevents from all of the other in instance to change from PENDING to RUNNING #325

[ECS] One task prevents from all of the other in instance to change from PENDING to RUNNING #325

Comments

Alonreznik commented Nov 1, 2018

petderek commented Nov 1, 2018

Alonreznik commented Nov 4, 2018 • edited Loading

Alonreznik commented Nov 11, 2018

FlorianWendel commented Nov 19, 2018

yumex93 commented Nov 19, 2018

Alonreznik commented Nov 20, 2018 • edited Loading

FlorianWendel commented Nov 20, 2018

Alonreznik commented Dec 17, 2018

yhlee-aws commented Dec 19, 2018

Alonreznik commented Jan 16, 2019

Alonreznik commented Feb 27, 2019 • edited Loading

yhlee-aws commented Feb 27, 2019

yhlee-aws commented Feb 27, 2019

tomotway commented Mar 8, 2019

Alonreznik commented Mar 10, 2019

adnxn commented Mar 11, 2019 • edited Loading

Alonreznik commented Mar 14, 2019 • edited Loading

Halama commented May 23, 2019

Alonreznik commented Aug 23, 2019

Alonreznik commented Oct 6, 2019

Alonreznik commented Oct 10, 2019

coultn commented Oct 10, 2019

Alonreznik commented Oct 31, 2019

Halama commented Oct 31, 2019

Alonreznik commented Oct 31, 2019

thom-vend commented Jun 1, 2021

AlexisConcepcion commented Oct 6, 2021 • edited Loading

estoesto commented Oct 14, 2021 • edited Loading

AlexisConcepcion commented Oct 14, 2021

Alonreznik commented Oct 14, 2021

Alonreznik commented Oct 14, 2021

Alonreznik commented Nov 30, 2021

markdascher commented Oct 12, 2022

Alonreznik commented Dec 1, 2022

AbhishekNautiyal commented Jun 30, 2023 • edited Loading

Alonreznik commented Jul 3, 2023 • edited Loading

Alonreznik commented Nov 4, 2018 •

edited

Loading

Alonreznik commented Nov 20, 2018 •

edited

Loading

Alonreznik commented Feb 27, 2019 •

edited

Loading

adnxn commented Mar 11, 2019 •

edited

Loading

Alonreznik commented Mar 14, 2019 •

edited

Loading

AlexisConcepcion commented Oct 6, 2021 •

edited

Loading

estoesto commented Oct 14, 2021 •

edited

Loading

AbhishekNautiyal commented Jun 30, 2023 •

edited

Loading

Alonreznik commented Jul 3, 2023 •

edited

Loading