-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error getting message from ws backend #1292
Comments
@combor Can you share the logs with me via email: penyin (at) amazon.com Thanks, |
@combor How long did the disconnection last before you restarted the instance? |
I have restarted this one straight after it reported it's faulty as I knew it's not gonna reconnect. We can repeat this if you want and wait. How long you think it will be enough to wait? |
@combor 5 minutes would be good enough to determine if/why agent is not reconnecting. |
Thanks @sharanyad A brief background about my setup might be also helpful. The cluster consists of two EC2 instances and I've got 7 services and each has two instances of a task. If I scale one of the services to 12 it starts them but after a while all services in the entire cluster are reported as unhealthy ( by ALB ) and killed. Later they stay in the PENDING state. I've got another set of logs but now I waited 15mins for reconnection. There's no reconnection to ACS message at all but the error is:
and some nice golang printf :)
Which email shall I use to send logs? |
@combor Thanks for the additional information. Please send the logs to sharanyd at amazon.com. |
Hello. I have the same issue and it is really critical. When ECS agent disconnected – all services inside the node can't work anymore. Agent version: 1.17.2
|
Thanks for reporting the issue @KIVagant . If you could send the debug level logs to sharanyd at amazon.com, that would be really helpful. Meanwhile, we are working on the issue and shall get back once there is an update. |
@sharanyad, do I need to install https://aws.amazon.com/premiumsupport/knowledge-center/debug-mode-ecs-agent-docker/ for this? |
@KIVagant put |
Also, you can set |
@combor Did you try increasing the |
I didn't. I thought 32MB should be enough. |
Hi @combor, please note that the default value of Thanks, |
Hi @combor, we've been trying to replicate this issue on our end. Here's some details about the same: SetupInstance type: m5.large {
"containerDefinitions": [
{
"entryPoint": [
"sh",
"-c"
],
"command": [
"stress -m 1 --vm-bytes 500m --vm-keep"
],
"cpu": 10,
"memory": 512,
"image": "xxxxx.dkr.ecr.us-west-2.amazonaws.com/stress:latest",
"essential": true,
"name": "stress"
}
],
"family": "memstress-500m",
} The reason for choosing this kind of task definition to debug this issue was that we noticed low amounts of free memory on your host from the logs you sent. Hence, our recommendation for setting FindingsThis has been going for a few days now and the we haven't seen the disconnection issue that you're running into. We also noticed errors setting up timeouts on websocket connections from the logs you sent as well. So, we are working on code changes to better handle errors in places such as this: amazon-ecs-agent/agent/wsclient/client.go Line 322 in edc3e26
However, since we do not have a repro on our end for this, verifying if those fixes actually fix your issue could be tricky. Are you open to running a 'debug' build of the ECS agent on your cluster if we provide you one? Also, please let us know if the Thanks, |
Handle "connection closed" error in SetReadDeadline and SetWriteDeadline methods. The strategy is to treat these errors as terminal errors for the current connection so that a new connection gets established in such cases. This should help with issues such as aws#1292.
Handle "connection closed" error in SetReadDeadline and SetWriteDeadline methods. The strategy is to treat these errors as terminal errors for the current connection so that a new connection gets established in such cases. This should help with issues such as aws#1292.
@aaithal , could you advice, please, where I can set the ECS_RESERVED_MEMORY value? Should it be an environment variable inside each ECS instances? |
Hi @KIVagant, you can set it in ECS agent's config file, for all of your ECS instances. Please refer to our documentation for more details. |
Hi @combor, @KIVagant, now that #1310 is merged, if you wan to deploy an agent build containing this fix in your test cluster to verify if it helps with the disconnect issue, please send your account IDs and the region where your cluster is deployed to aithal at amazon dot com. We can share a pre-release build of the ECS agent with you, which you can deploy in your test setup to validate the fix. Thanks, |
@aaithal, will |
|
I saw the next situation (before Right now ECS_RESERVED_MEMORY is set and I need to wait for a while (long time, in fact) to feel that the issue is gone. |
Hi @KIVagant, I have shared the pre-release build of the ECS agent with your account and sent you instructions for using/installing the same over email. Please let us know if that, along with Thanks, |
@aaithal, ECS_RESERVED_MEMORY didn't help, we had the agent outage yesterday. I'm thinking about to install the pre-release build that you have mentioned. I will keep you in touch. |
Hi, I'm working on the same project with @KIVagant and we were hitting the same issue few times again. It may not be clear from start though from what I see we were heavily overusing memory resource (100% or close to it) to the level that some of our tasks were simply OOMed, which caused OOM cycle when task tried to spawn though it can't which lead to CPU 100% and later to EBS credits burned. As ecs-agent runs also in docker and it queries docker api I suspect that machine could be so heavily overused that agent/docker are unable to operate properly which later cause agent disconnected symptom. I started initially by upgrading ecs agent by using ecs-optimized ami as per doc (atm of writing current one is 2017.09 and we were using 2017.03), however later I noticed that I guess someone may be mislead by having Reading this thread I found setting/increasing
and as from what I read in ecs sources |
Hi @pinepain,
The
The ECS agent registers all of the memory available on an instance as resource with the ECS cluster by default. For example, a Not having enough resources available for non ECS managed processes on the hosts (essentially everything outside of containers managed by ECS), including free memory and EBS IOPS can certainly lead to issues such as ECS agent getting killed or being disconnected from the backend. As per my previous recommendations, I'd suggest setting Thanks, |
@aaithal thanks for quick reply. Indeed, what I was trying to say is that it is close to total memory, and it is absolutely not a "free" memory. I see this misunderstanding time to time and hopefully further readers may find this thread useful. At my understanding this is the final formula of how memory get calculated: amazon-ecs-agent/agent/api/ecsclient/client.go Lines 178 to 179 in 1ca656c
I'm stuck at this as I see that |
Hi @pinepain, That just means that the resource consumption of ECS agent is not taken into account when making placement decisions on the instance. Also, we've released a new version of the ECS agent, where we've enhanced the websocket error handling logic. That, combined with |
Summary
ECS agent disconnects under heavy load.
Description
When I put my ECS instance under high load, like I scale my container instances from 2 to 12 the ecs agent disconnects with following errors:
After that is's marked as
Agent Connected: False
in ECS console until I restart the instance.I've got debug logs if you need more info.
The text was updated successfully, but these errors were encountered: