Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ECS] One task prevents from all of the other in instance to change from PENDING to RUNNING #325

Closed
Alonreznik opened this issue Nov 1, 2018 · 47 comments
Assignees
Labels
Coming Soon ECS Amazon Elastic Container Service Proposed Community submitted issue

Comments

@Alonreznik
Copy link

Hello There.
We're facing lately a strange practice from ECS about tasks that prevent from new tasks to run in an instance.

A little about our case -
We have some tasks that need to complete their action and then exit by themselves in a stopTask command. That means we have a gracefull-shutdown process, that in sometimes make some time to complete (more than few seconds and even some long minutes).

However, when a stopTask is sent over these tasks, they do not appear anymore in the tasks at ECS console (which is fine), but they also make all other tasks in the same instance that are trying to change their state from PENDING to RUNNING.

Here is an example of one instance tasks when it happens:
image

Why is that behavior happens? Why one task prevent from the other to run next to it until it done? This is a bad practice in resource management (we don't use the max potential of our instances at the pending time).

The best thing will be that the stopped task will appear in the console until it really stopped in the instance, and the state changing from PENDING to RUNNING won't be affected by other tasks in the same instance.

I hope you can fix that behavior,

Thanks!

@petderek
Copy link

petderek commented Nov 1, 2018

This is working by design. Our scheduler assumes that while a task is stopping (or exiting gracefully) it will still use its requested resources. For example, a webserver would still need its resource pool while waiting for all connections to terminate. The alternative would be to allocate fewer resources when a task transitions to stopping, but that's not a safe assumption to make across all workloads.

Would configurable behavior help your use case? Or is it sufficient to be more clear about this behavior in the console?

@Alonreznik
Copy link
Author

Alonreznik commented Nov 4, 2018

Hey @petderek.
Thank you for your response.

I get it works like that by design. However, i wonder why one task should need to prevent all other to be running.

The best configurable behavior for us will be a task-for-task process: calculating the resources that are held by the stopping task (which is fine and reasonable), but not preventing from other tasks to be running on the instance while it has a resource to give.

In our use case, the tasks are used for a long-poll workload that should run as a service, and not for a web client. That behavior makes our instances not be filled in time and can also "stuck" our process in a new deployment, because instances are waiting for one task to end until the other tasks are allowed to run.

So the instance is actually in kind of "disabled" or "draining" status until the long workload is done (and it can take some time).

What can we do in our use case to be supported in ECS?

Thanks

@Alonreznik
Copy link
Author

Hi!
Is there any update about that thing?
How can we make any solution or workaround in order to run the task-for-task mechanism?

Thank you in advanced!

@FlorianWendel
Copy link

Hi everyone,

we are facing the exact same issue. @Alonreznik is right, one task is blocking all other tasks and in my opinion this does not make sense. Let me illustrate:

Assume we have one task with 10 GB memory reservation running on a container instance that has registered with 30 GB. The container instance shows 20 GB of RAM available and that is correct. Now this task is stopped (the ECS agent will make docker send a SIGTERM) but the container keeps running to finish its calculations (now it shows under stopped tasks as "desired status = STOPPED" and "last status = RUNNING"). The container instance will now show 30 GB available in the AWS ECS console which is nonsense, it should still be 20 GB since the container is still using resources as @petderek mentioned. Even worse, if we try to launch three new tasks with 10 GB memory reservation each, they will all be pending until the still running task transitions to "last status = STOPPED". Expected behavior would be that two out of the three tasks can launch immediately.

I hope my example was understandable, else feel free to ask.
And thanks for looking into this :)

@yumex93
Copy link

yumex93 commented Nov 19, 2018

Hey! As a workaround, you can set ECS_CONTAINER_STOP_TIMEOUT to a smaller number. This is used to configure 'Time to wait for the container to exit normally before being forcibly killed'. By default, it is set as 30s. More information can be found here. I marked this issue as a feature request and we will work on it soon.

@Alonreznik
Copy link
Author

Alonreznik commented Nov 20, 2018

Hi @yumex93
Thank you for your response.
We will be very happy to have that feature as soon as it will be out :)

About your workaround - in most cases, we need our dockers to make a graceful shutdown before they die. Therefore, decreasing the ECS_CONTAINER_STOP_TIMEOUT will cause our workers to be killed before the shutdown is completed. Therefore the feature is more than needed :)

Thank you again for your help, we're waiting for updates about it.

Alon

@FlorianWendel
Copy link

@Alonreznik , @yumex93 We have the same situation, some workers even take a few hours to complete their task and we've leveraged the ECS_CONTAINER_STOP_TIMEOUT to shut those down gracefully as well. Since ECS differentiates between a "desired status" and a "last status" for tasks, I believe it should be possible to handle tasks in the process of shutting down a bit better than how it works today. For illustration of what I mean, see this screenshot:

ecs-bug

The tasks are still running and still consume resources, but the container instance does not seem to keep track of those resources. If this is more than just confusing display, I expect it to cause issues, e.g. like the one above.

@Alonreznik
Copy link
Author

Hi @yumex93, Any update with that issue?

Thanks

Alon

@yhlee-aws
Copy link

We are aware of this issue, and we are working on prioritizing it. We will keep this issue open to track this issue, and provide update when we have more solid plans.

@Alonreznik
Copy link
Author

Hi @yunhee-l.
Thank you for your last response.
We're still facing this issue, which demands from us to luanch more servers than we need in our deployments, and makes our workloads stuck.
Any update in that case?

Thanks

@Alonreznik
Copy link
Author

Alonreznik commented Feb 27, 2019

Hi @yunhee-l @FlorianWendel
any update?

@yhlee-aws
Copy link

We don't have any new updates at this point. We will update when we have more solid plans.

@yhlee-aws
Copy link

Related: aws/amazon-ecs-agent#731

@tomotway
Copy link

tomotway commented Mar 8, 2019

Hi,

Just wanted to add our experience with this with the hopes that it can be bumped in priority.

We need to run tasks that can be long running. With this behaviour as it stands it essentially locks up the ec2 instance so that it cannot take any more tasks until the first task has shut down (which could be a few hours) It wouldn't quite so bad if ecs marked the host as unusable and placed tasks on other hosts but it doesn't, it still sends them to the host that cannot start them. This has the potential to cause us service outage in that we cannot create tasks to handle workload (we tell the service to create tasks but it can't due to the lock up)

Thanks.

@Alonreznik
Copy link
Author

@petderek @yumex93
This is something really makes us pay more than the resource we need each deployment. As you can see, there is more than one user who suffers that kind of basic designing.

Do you have any ETA for implementing it or deploying it? This is a real blocker for our ongoing processes.

Thank you

Alon

@adnxn
Copy link

adnxn commented Mar 11, 2019

@Alonreznik: thanks for following up again and communicating the importance of getting this resolved. this helps us prioritize our tasks.

we don't have an ETA right now - but have identified the exact issue and have a path forward that requires changes to our scheduling system and the ecs agent. so to give you some more context. as @petderek said earlier,

This is working by design. Our scheduler assumes that while a task is stopping (or exiting gracefully) it will still use its requested resources.

so changing this behavior will be a departure from our existing way of accounting resources when we schedule tasks. considering that the current way has been in place since the beginning of ECS, the risks involved with changing this are significant as there could be subtle rippling effects in the system. we plan to explore ways to validate this change and ensure to not introduce regressions.

the original design made the trade off towards oversubscribing resources for placement by releasing resource on the instance when tasks were stopped - but the side effect of that is the behavior you are describing. additionally, now that we've added granular sigkill timeouts for containers with #1849, we can see this problem being exasperated.

so all that is to say - we're working on this issue and we will update this thread as we work towards deploying the changes.

@Alonreznik
Copy link
Author

Alonreznik commented Mar 14, 2019

@adnxn
Thank you for your detailed explanation. It helps very much understanding the context of the situation.

We off course get this is something that is built in the design and we accept it.

However, I assume that our intentions are not for this radical change in the core system (which is great!!). Our request is based on the ecs-agent assumption of all of the resources must be released in the instance from the last tasks, and our request is just to handle it by task (and also have some indication the task is still running on the instance backward after it got the SIGTERM).

As it looks today, the resource handling and releasing are based on the entire instance, and not on the tasks that run over the instance. So if a task releases its resources, the ecs-agent should allow scheduling these resources for new tasks (it they stand on the resource requirements).

Thank you for your help!
Much appreciated!

Please keep us posted,

Alon

@Halama
Copy link

Halama commented May 23, 2019

Hello,
we are affected with the exactly same issue. ECS service deploying long-poll workers with stopTimeout set to 2 hours. Task in running state with desired status stopped block all new tasks scheduled on the same instance even there are free resources available.

Adding a new instances to the cluster helped US to workaround this situation, but it can be really costly if there are multiple deploys each day.

Are there any new updates about this issues? or possible workarounds.

It could definitely be solved by removing the long poll service and switch it to just calling ECS RunTask (process one job and terminate) without waiting for the result. But it would require more changes in our application architecture and also it would be more tightly coupled to ECS.

thanks
Martin

@coultn coultn transferred this issue from aws/amazon-ecs-agent Jun 10, 2019
@coultn coultn added ECS Amazon Elastic Container Service Proposed Community submitted issue labels Jun 10, 2019
@coultn coultn changed the title One task prevents from all of the other in instance to change from PENDING to RUNNING [ECS] One task prevents from all of the other in instance to change from PENDING to RUNNING Jun 10, 2019
@Alonreznik
Copy link
Author

Hi @coultn @adnxn
Any update or ETA about that?

Thank you

Alon

@Alonreznik
Copy link
Author

Hi Guys.
Can somebody take a look about that?
This is harming our business because we have a problem with deploying new versions to our Prod. This is really problemtaic, and it shades a dark light about continuing using the ECS in our side.

Thanks

@Alonreznik
Copy link
Author

@coultn

@coultn
Copy link

coultn commented Oct 10, 2019

Hi, thank you for your feedback on this issue. We are aware of this behavior and are researching solutions. We will keep the github issue up to date as the status changes.

@Alonreznik
Copy link
Author

Hi @coultn .
Thanks for your reply.

We must say this is something prevents our workloads to grow accordingly to our tasks, and there are situations this behavior actually stuck our production servers. Again, something that can be a no-go (or no-continue in our case) using ECS in prod.

For example, you can see a typical production workloads desire/running gap.
image

The Green layer is the gap between the desired and the running (orange layer) tasks. The blue is the PENDING tasks in the cluster. You can see a constant gap between these two parameters. No deployment was made today and this is something we're encountering in scaling up mechanism.

Think about the situation we're encountering. We have new tasks in our queue (SQS), and therefore we're asking from the ECS to run new tasks (means - desire tasks increasing).
Each workload is a task in the ECS, and all of them split between the servers.
When we have some workload take more than some time to complete (and there are many of them because we're asking for the workload to end it's task before it ends and then die) one workload blocks the entire instance to get new one workload, even there are free resources in the instance.

The ECS agent schedule new workloads to that instance, and then hits the one task that is still working. For the agent - he made its job - he scheduled new tasks. But the tasks are stuck in PENDING state, for hours in some cases, makes this instance to be unusable because they're just not working yet. Now think about, that you need to raise the more 100 tasks in some hours to complete a quick workloads in the line, and you have 5-6 instances with one task blocks each one, and it becomes to be a mass.

We also must say we encounter this in the last year only, after some upgrade of the agent a year or year and a half ago.

We need every day to ask for more instances in our workloads in order to open the block. This is not how a production service in AWS should be maintained, and we're facing that again and again in this case, every day.

Please help us to continue using ECS as our production orchestrator. We love this product and want it to succeed, but as it seems, it doesn't fit to long-working tasks.

Your help of rushing this in your team will be kind,

Thank you

Alon

@Halama
Copy link

Halama commented Oct 31, 2019

I've discussed this with Nathan he told me that they plan fixing this, but unfortunately without a quick fix. We have similar issues with deployment and scaling and due to it lot of unnecessary rolls of new instances.

Meanwhile we are experimenting with EKS (also fo multi-cloud deployment) where this issue isn't present.

@Alonreznik
Copy link
Author

I've discussed this with Nathan he told me that they plan fixing this, but unfortunately without a quick fix. We have similar issues with deployment and scaling and due to it lot of unnecessary rolls of new instances.

Meanwhile we are experimenting with EKS (also fo multi-cloud deployment) where this issue isn't present.

Hi @Halama .

Thanks for the reply and the update.

I can get this is not something that can be quick to solve, but meanwhile, ECS team can provide workarounds, such as placing-method of binpack and the newest instances, or determine the time task can be on PENDING state before it tries on the new instances. This issue is not getting any response due to many users encountering that. It is open more than a year and they're can't send any reasonable ETA (even 3 months is good for us). It was on "researching" just in the last week.

Can you please share about your migration process to EKS from ECS?

Thanks again

Alon

@pavneeta pavneeta self-assigned this Jan 6, 2021
@thom-vend
Copy link

Hi, @pavneeta any update on this issue ?

@AlexisConcepcion
Copy link

AlexisConcepcion commented Oct 6, 2021

Any update on this ?

@estoesto
Copy link

estoesto commented Oct 14, 2021

I'm running 1 task per host, with autoscaling, but everything gets piled up in the MQ because of this one stopping task (which runs daily and should stop gracefully). Also CICD pipelines fail since I'm leveraging aws ecs wait services-stable.
Only workaround that works for me is to modify the capacity provider to run extra instances. What a waste.

@coultn Your suggestion would solve it. Any ETA for this?

@AlexisConcepcion
Copy link

We recently implemented Datadog and cAdvisor as Daemons for ECS using cloudformation, we have more than 20 stacks, a few of them running about 10 instances (bigger ones). At the first try deamons took about 5 hours to be running. The key to improve performance and get the new daemon tasks running was to set the MinClusterSize=1 (it was not previously defined) and the following placement strategy on ECS-Service.yaml, ( after those modifications we deployed the daemons ).

 PlacementStrategies:
 - Type: spread
   Field: instanceId
 - Field: memory
   type: binpack

We are planning to apply it on prod soon, take in mind the placement strategy performs a rollout of your running instances, I don't thinks it is a solution but it could help!

@Alonreznik
Copy link
Author

Any update about that? We love ECS, but this use case drives us into Kubernetes, which solves this case easily.

@Alonreznik
Copy link
Author

BTW - 3 years (!!!!) after this issue had been opened, and still many people facing this unexpected behaviour. I think this is a good reason to make it fixed once for good.

@Alonreznik
Copy link
Author

Hi @petderek. Any update?

@markdascher
Copy link

AWS seems to be overly cautious regarding a fix, and I think it's because the issue still isn't clearly understood by everyone involved. I'm not entirely sure that I understand it myself, but after reading the whole thread, here's what it seems to boil down to:

  1. A 40 GB host has a single 10 GB task. It can start three more 10 GB tasks just fine. Everyone is happy.
  2. The same 40 GB host has the same 10 GB task, but now that task is stopping. Suddenly we can't start any new tasks on this host, even though there are 30 GB available.

Scenario 2 makes no sense. It's clearly a bug. The phrase "by design" doesn't belong in this thread. I understand how it could've happened though–it's perhaps an unfortunate workaround for an older bug:

  • Bug A: When tasks are stopping, the system calculates available resources incorrectly. Perhaps the calculation shows "40 GB free" instead of "30 GB free."
  • Bug B: Rather than fixing Bug A, the ECS Agent includes logic to know when the calculation is incorrect, and then decides to freeze (with potentially catastrophic consequences) during that timeframe.

Is that accurate? Are we actually worried about the unintended consequences of fixing Bug A?

In our case shortening stopTimeout isn't a viable option, and neither are placementConstraints. Every host may have tasks stopping at the same time, so placementConstraints would just continue making them all unusable. (And even in a best case, it would result in very suboptimal placement as everything gets squeezed onto a small number of usable hosts.)

Two possible fixes:

  • When tasks are stopping, continue calculating available resources correctly. In the example above, that means there are only 30 GB free until the container is actually gone.
  • If that's too drastic, then at least make the ECS Agent try harder. If tasks are stopping, make the ECS Agent correct the calculations locally, and continue if it's safe to do so. If you're unlucky and tasks get placed onto a host that's actually full, then you're out of luck. But that's still way better than where we are now, and at least isn't completely baffling behavior.

@Alonreznik
Copy link
Author

Hi everyone. It seems that this is just won't be prioritized, and the ECS team just says "we're living with the bug", while this bug just prevents from so many users to do BASIC tasks on ECS, such as just "Run tasks that works".
Can someone provide some attention on it?

@AbhishekNautiyal
Copy link

AbhishekNautiyal commented Jun 30, 2023

We are excited to share that we've addressed the known issue in ECS Agent to prevent tasks stuck in pending state on instances that have stopping tasks with long timeouts. For details on the root cause, fix, and other planned improvements, please see What's New Post, Blog Post, and documentation

We'll be closing this issue. As always, happy to receive your feedback. Let us know if you face any other issues.

@Alonreznik
Copy link
Author

Alonreznik commented Jul 3, 2023

Holy moly!!! 5 years!! Amazing guys! I'm so excited! Thank you so much 🙏🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Coming Soon ECS Amazon Elastic Container Service Proposed Community submitted issue
Projects
None yet
Development

No branches or pull requests