Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error getting message from ws backend #1292

Closed
combor opened this issue Mar 12, 2018 · 28 comments
Closed

Error getting message from ws backend #1292

combor opened this issue Mar 12, 2018 · 28 comments

Comments

@combor
Copy link

combor commented Mar 12, 2018

Summary

ECS agent disconnects under heavy load.

Description

When I put my ECS instance under high load, like I scale my container instances from 2 to 12 the ecs agent disconnects with following errors:

2018-03-12T22:58:52Z [DEBUG] ACS activity occurred
2018-03-12T22:58:52Z [WARN] Unable to extend read deadline for ACS connection: set tcp 10.0.2.236:52020: use of closed network connection
2018-03-12T22:58:52Z [WARN] Unable to set read deadline for websocket connection: set tcp 10.0.2.236:52020: use of closed network connection for https://ecs-a-1.eu-west-1.amazonaws.com/ws?agentHash=edc3e260&agentVersion=1.17.2&clusterArn=redacted&containerInstanceArn=arn%3Aaws%3Aecs%3Aeu-west-1%3A936492824651%3Acontainer-instance%2Fca59a874-33ae-484b-9f51-654c5940b037&dockerVersion=DockerVersion%3A+17.12.0-ce&sendCredentials=true&seqNum=1
2018-03-12T22:58:52Z [ERROR] Error getting message from ws backend: error: [read tcp 10.0.2.236:52020->54.239.32.153:443: use of closed network connection], messageType: [-1]

After that is's marked as Agent Connected: False in ECS console until I restart the instance.

I've got debug logs if you need more info.

@richardpen
Copy link

@combor Can you share the logs with me via email: penyin (at) amazon.com

Thanks,
Peng

@sharanyad
Copy link
Contributor

sharanyad commented Mar 13, 2018

@combor How long did the disconnection last before you restarted the instance?
From the logs, it looks like agent is trying to reconnect after 274.540326ms.
2018-03-12T22:58:52Z [INFO] Reconnecting to ACS in: 274.540326ms
Immediately agent receives a SIGTERM and shuts down, before restarting in ~30s.
2018-03-12T22:58:52Z [DEBUG] Termination handler received termination signal: terminated
.
2018-03-12T22:59:35Z [INFO] Loading configuration

@combor
Copy link
Author

combor commented Mar 13, 2018

I have restarted this one straight after it reported it's faulty as I knew it's not gonna reconnect. We can repeat this if you want and wait. How long you think it will be enough to wait?

@sharanyad
Copy link
Contributor

@combor 5 minutes would be good enough to determine if/why agent is not reconnecting.

@combor
Copy link
Author

combor commented Mar 13, 2018

Thanks @sharanyad

A brief background about my setup might be also helpful. The cluster consists of two EC2 instances and I've got 7 services and each has two instances of a task. If I scale one of the services to 12 it starts them but after a while all services in the entire cluster are reported as unhealthy ( by ALB ) and killed. Later they stay in the PENDING state. I've got another set of logs but now I waited 15mins for reconnection. There's no reconnection to ACS message at all but the error is:

2018-03-13T20:53:04Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
2018-03-13T20:53:15Z [WARN] Error disconnecting: write tcp 10.0.0.141:40894->54.239.36.166:443: i/o timeout

and some nice golang printf :)

2018-03-13T20:59:05Z [DEBUG] Managed task [arn:aws:ecs:eu-west-1:936492824651:task/3df034ea-246c-422b-bce9-a678f427ff39]: handling container change [{NONE { <nil> [] Could not transition to inspecting; timed out after waiting 30s map[] map[] 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {UNKNOWN <nil> 0 }} ContainerStatusChangeEvent}] for container [productpayments]

Which email shall I use to send logs?

@sharanyad
Copy link
Contributor

@combor Thanks for the additional information. Please send the logs to sharanyd at amazon.com.

@KIVagant
Copy link

Hello. I have the same issue and it is really critical. When ECS agent disconnected – all services inside the node can't work anymore.

Agent version: 1.17.2

2018-03-15T17:58:54Z [ERROR] Error getting message from ws backend: error: [read tcp 10.200.10.251:57936->54.***.***.***:443: use of closed network connection], messageType: [-1] 
2018-03-15T17:58:54Z [INFO] Begin building map of eligible unused images for deletion
2018-03-15T17:58:54Z [INFO] No eligible images for deletion for this cleanup cycle
2018-03-15T17:58:54Z [INFO] End of eligible images for deletion: No more eligible images for deletion; Still have 1 image states being managed
2018-03-15T17:58:54Z [INFO] Managed task [arn:aws:ecs:us-east-1:************:task/e07ae551-7961-447c-b7c9-6f65bf920569]: redundant container state change. internal to NONE, but already RUNNING
...
2018-03-15T17:58:54Z [INFO] Managed task [arn:aws:ecs:us-east-1:************:task/c437fc21-8f8b-4c46-bae7-c3782c4cc734]: redundant container state change. internal to RUNNING, but already RUNNING
2018-03-15T17:58:55Z [WARN] Unable to set read deadline for websocket connection: set tcp 10.200.10.251:56132: use of closed network connection for https://ecs-a-2.us-east-1.amazonaws.com/ws?agentHash=edc3e260&agentVersion=1.17.2&clusterArn=*******&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A************%3Acontainer-instance%2F58935ec1-e196-***-***-5ad87586e2ee&dockerVersion=DockerVersion%3A+17.12.0-ce&sendCredentials=false&seqNum=1
2018-03-15T17:58:55Z [ERROR] Error getting message from ws backend: error: [read tcp 10.200.10.251:56132->54.***.***.***:443: use of closed network connection], messageType: [-1] 
2018-03-15T17:58:55Z [INFO] Reconnecting to ACS in: 274.824595ms
2018-03-15T17:58:55Z [INFO] Error from tcs; backing off: read tcp 10.200.10.251:57936->54.***.***.***:443: use of closed network connection
2018-03-15T17:58:55Z [WARN] Error publishing metrics: tls: use of closed connection
2018-03-15T17:58:57Z [INFO] Handling http request module="Handlers" method="GET" from="172.17.0.8:59214" uri="/v2/credentials/e89d56ec-c717-469c-8a95-c75e2a57e9be"
2018-03-15T17:59:11Z [INFO] Connected to TCS endpoint
2018-03-15T17:59:59Z [INFO] DockerGoClient: Unable to retrieve stats for container fc038c3e4fbd33d243a4b18482e31eadfa029c5bc0f2f7ec1a3a65a46950f910: unexpected EOF
...
2018-03-15T18:02:11Z [INFO] DockerGoClient: Unable to retrieve stats for container 072a2df4f9f22b40a5b451887c8985be84c97200baff1f264574046f9d6b2a7f: inactivity time exceeded timeout
2018-03-15T18:02:11Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (************): Channel long idle: No message is received, close the channel], messageType: [-1] 
2018-03-15T18:02:38Z [INFO] TCS Connection hasn't had any activity for too long; disconnecting
2018-03-15T18:03:21Z [ERROR] Unable to discover poll endpoint, err: RequestError: send request failed
caused by: Post https://ecs.us-east-1.amazonaws.com/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2018-03-15T18:03:43Z [INFO] DockerGoClient: Unable to retrieve stats for container fc038c3e4fbd33d243a4b18482e31eadfa029c5bc0f2f7ec1a3a65a46950f910: inactivity time exceeded timeout
...
2018-03-15T18:03:53Z [INFO] DockerGoClient: Unable to retrieve stats for container 53dbdb429567f34bf4fb8af2b1337e94b08def6ded231e814d02acc036ce414f: unexpected EOF
2018-03-15T18:03:53Z [INFO] Reconnecting to ACS in: 382.213797ms
2018-03-15T18:06:52Z [INFO] Error from tcs; backing off: websocket: close 1002 (************): Channel long idle: No message is received, close the channel
2018-03-15T18:11:35Z [INFO] DockerGoClient: Unable to retrieve stats for container 072a2df4f9f22b40a5b451887c8985be84c97200baff1f264574046f9d6b2a7f: unexpected EOF
...
2018-03-15T18:12:44Z [INFO] DockerGoClient: Unable to retrieve stats for container 04eba7e5092b024744d356c772f23bedf91381262f14b02b11f1be638b2c934c: inactivity time exceeded timeout
2018-03-15T18:12:52Z [INFO] Managed task [arn:aws:ecs:us-east-1:************:task/e47373b5-0207-4216-a90e-265b8f53685b]: redundant container state change. internal to NONE, but already RUNNING
2018-03-15T18:12:52Z [INFO] DockerGoClient: Unable to retrieve stats for container 4423a9e55e79feb70885427787e3b71455fe17c604653418ffb4454bf8e5d547: unexpected EOF
2018-03-15T18:12:52Z [INFO] DockerGoClient: Unable to retrieve stats for container ba16e7cc8c59befc1da526824c57fc3eda6bf36f6c007056521425f8207f5ee3: inactivity time exceeded timeout
2018-03-15T18:12:54Z [WARN] Error publishing metrics: tls: use of closed connection
2018-03-15T18:12:56Z [INFO] DockerGoClient: Unable to retrieve stats for container 255af44de63ab1f0c8baab5b49d017e2b4070a6f593a6c6b79cf3ee00e22f090: inactivity time exceeded timeout
2018-03-15T18:12:56Z [INFO] Managed task [arn:aws:ecs:us-east-1:************:task/c437fc21-8f8b-4c46-bae7-c3782c4cc734]: redundant container state change. internal to NONE, but already RUNNING
...
2018-03-15T18:13:12Z [INFO] Managed task [arn:aws:ecs:us-east-1:************:task/7e258ded-fbef-4d6f-ad25-4d2042e28c44]: redundant container state change. internal to RUNNING, but already RUNNING
2018-03-15T18:13:21Z [INFO] DockerGoClient: Unable to retrieve stats for container 4423a9e55e79feb70885427787e3b71455fe17c604653418ffb4454bf8e5d547: inactivity time exceeded timeout
...
2018-03-15T18:13:23Z [INFO] DockerGoClient: Unable to retrieve stats for container dad16d2ef778257c070f93bdc41e3276c73a3f395b5639100c961d09b9bc03da: inactivity time exceeded timeout
2018-03-15T18:13:59Z [WARN] Error creating a websocket client: write tcp 10.200.10.251:45686->54.***.***.***:443: i/o timeout
2018-03-15T18:14:03Z [ERROR] Error connecting to ACS: websocket client: unable to dial ecs-a-2.us-east-1.amazonaws.com response: : write tcp 10.200.10.251:45686->54.***.***.***:443: i/o timeout
2018-03-15T18:14:05Z [INFO] Reconnecting to ACS in: 633.808565ms
2018-03-15T18:14:47Z [INFO] DockerGoClient: Unable to retrieve stats for container dad16d2ef778257c070f93bdc41e3276c73a3f395b5639100c961d09b9bc03da: unexpected EOF
2018-03-15T18:14:51Z [INFO] DockerGoClient: Unable to retrieve stats for container 53dbdb429567f34bf4fb8af2b1337e94b08def6ded231e814d02acc036ce414f: inactivity time exceeded timeout
# manually restarted

@sharanyad
Copy link
Contributor

Thanks for reporting the issue @KIVagant . If you could send the debug level logs to sharanyd at amazon.com, that would be really helpful. Meanwhile, we are working on the issue and shall get back once there is an update.

@KIVagant
Copy link

@combor
Copy link
Author

combor commented Mar 15, 2018

@KIVagant put ECS_LOGLEVEL=debug in you agent config and restart.

@combor
Copy link
Author

combor commented Mar 15, 2018

Also, you can set ECS_RESERVED_MEMORY to some higher than default value to be on safe side and ensure that you are not exhausting resources on the cluster and kill the agent.

@sharanyad
Copy link
Contributor

@sharanyad
Copy link
Contributor

@combor Did you try increasing the ECS_RESERVED_MEMORY and still observed the error? (This is because it was observed from the logs that memory on the instance was under high stress - trying to rule out that concern here)

@combor
Copy link
Author

combor commented Mar 15, 2018

I didn't. I thought 32MB should be enough.

@aaithal
Copy link
Contributor

aaithal commented Mar 19, 2018

Hi @combor, please note that the default value of ECS_RESERVED_MEMORY is 0 and not 32MB. Can you please try setting it to a non-zero value and check if that helps?

Thanks,
Anirudh

@aaithal
Copy link
Contributor

aaithal commented Mar 20, 2018

Hi @combor, we've been trying to replicate this issue on our end. Here's some details about the same:

Setup

Instance type: m5.large
ECS agent version: 1.17.2
Docker version: 17.12.0-ce
AMI: amzn-ami-2017.09.j-amazon-ecs-optimized
Number of instances in the cluster: 10
Number of services in the cluster: 1
Number of tasks in the service: 150 (average 15 tasks per instance)
Memory utilization per task: 500 MB
Free memory left on the instance once 15 tasks are scheduled: 10 MB
Task definition:

{
  "containerDefinitions": [
    {
      "entryPoint": [
        "sh",
        "-c"
      ],
      "command": [
        "stress -m 1 --vm-bytes 500m --vm-keep"
      ],
      "cpu": 10,
      "memory": 512,
      "image": "xxxxx.dkr.ecr.us-west-2.amazonaws.com/stress:latest",
      "essential": true,
      "name": "stress"
    }
  ],
  "family": "memstress-500m",
}

The reason for choosing this kind of task definition to debug this issue was that we noticed low amounts of free memory on your host from the logs you sent. Hence, our recommendation for setting ECS_RESERVED_MEMORY to a reasonably high value so that your containers don't end up consuming all of the memory on your host.

Findings

This has been going for a few days now and the we haven't seen the disconnection issue that you're running into. We also noticed errors setting up timeouts on websocket connections from the logs you sent as well. So, we are working on code changes to better handle errors in places such as this:

if err := cs.SetReadDeadline(time.Now().Add(cs.RWTimeout)); err != nil {

However, since we do not have a repro on our end for this, verifying if those fixes actually fix your issue could be tricky. Are you open to running a 'debug' build of the ECS agent on your cluster if we provide you one? Also, please let us know if the ECS_RESERVED_MEMORY setting helped you with this issue.

Thanks,
Anirudh

aaithal added a commit to aaithal/amazon-ecs-agent that referenced this issue Mar 20, 2018
Handle "connection closed" error in SetReadDeadline and SetWriteDeadline
methods. The strategy is to treat these errors as terminal errors for
the current connection so that a new connection gets established in such
cases. This should help with issues such as aws#1292.
aaithal added a commit to aaithal/amazon-ecs-agent that referenced this issue Mar 20, 2018
Handle "connection closed" error in SetReadDeadline and SetWriteDeadline
methods. The strategy is to treat these errors as terminal errors for
the current connection so that a new connection gets established in
such cases. This should help with issues such as aws#1292.
@KIVagant
Copy link

@aaithal , could you advice, please, where I can set the ECS_RESERVED_MEMORY value? Should it be an environment variable inside each ECS instances?

@aaithal
Copy link
Contributor

aaithal commented Mar 21, 2018

Hi @KIVagant, you can set it in ECS agent's config file, for all of your ECS instances. Please refer to our documentation for more details.

@aaithal
Copy link
Contributor

aaithal commented Mar 21, 2018

Hi @combor, @KIVagant, now that #1310 is merged, if you wan to deploy an agent build containing this fix in your test cluster to verify if it helps with the disconnect issue, please send your account IDs and the region where your cluster is deployed to aithal at amazon dot com. We can share a pre-release build of the ECS agent with you, which you can deploy in your test setup to validate the fix.

Thanks,
Anirudh

@KIVagant
Copy link

@aaithal, will ECS_RESERVED_MEMORY=100 be enough? There is no recommended value in documentation.

@aaithal
Copy link
Contributor

aaithal commented Mar 22, 2018

100MB seems like a good value. But, please note that its also dependent on your use-case (how many out of band daemons/processes are running on your instance, how much resources they are consuming etc) and instance type.

@KIVagant
Copy link

KIVagant commented Mar 22, 2018

I saw the next situation (before ECS_RESERVED_MEMORY was set): when one node (N1) was disconnected and I restarted docker and ecs agent there, ECS decided that that all tasks from that node were died and started to recreate them in another node (N2). In that moment N2 got +10 tasks (probably because N1 wasn't in "ready" state). And the spike of new tasks triggered the "domino principle" and ECS agent also died in N2.

Right now ECS_RESERVED_MEMORY is set and I need to wait for a while (long time, in fact) to feel that the issue is gone.

@aaithal aaithal added this to the 1.17.3 milestone Mar 22, 2018
@aaithal
Copy link
Contributor

aaithal commented Mar 26, 2018

Hi @KIVagant, I have shared the pre-release build of the ECS agent with your account and sent you instructions for using/installing the same over email. Please let us know if that, along with ECS_RESERVED_MEMORY alleviates the issue that you're running into.

Thanks,
Anirudh

@KIVagant
Copy link

KIVagant commented Apr 2, 2018

@aaithal, ECS_RESERVED_MEMORY didn't help, we had the agent outage yesterday. I'm thinking about to install the pre-release build that you have mentioned. I will keep you in touch.

@pinepain
Copy link

pinepain commented Apr 2, 2018

Hi, I'm working on the same project with @KIVagant and we were hitting the same issue few times again.

It may not be clear from start though from what I see we were heavily overusing memory resource (100% or close to it) to the level that some of our tasks were simply OOMed, which caused OOM cycle when task tried to spawn though it can't which lead to CPU 100% and later to EBS credits burned. As ecs-agent runs also in docker and it queries docker api I suspect that machine could be so heavily overused that agent/docker are unable to operate properly which later cause agent disconnected symptom.

I started initially by upgrading ecs agent by using ecs-optimized ami as per doc (atm of writing current one is 2017.09 and we were using 2017.03), however later I noticed that free -m output is really weird as well as top output. So we were running on the edge and when we hit some heavy payload the issue triggers and we were getting our great cluster failing into pieces like Babylon tower. At least this is the best guess I have atm.

I guess someone may be mislead by having Available memory value in ecs instance which is not free memory reported by free -m and it is more total memory from the same command output, which is written in Container Instance Memory Management doc.

Reading this thread I found setting/increasing ECS_RESERVED_MEMORY recommendation and I'm not sure I fully get what impact it should have, from docs:

ECS_RESERVED_MEMORY
Example Values: 32
Default Value: 0
The amount of memory, in MiB, to reserve for processes that are not managed by ECS.

and as from what I read in ecs sources ECS_RESERVED_MEMORY is indeed a memory that is not managed by ecs and thus ecs restrict itself to the total mem (one from cat /proc/meminfo | grep MemTotal) at the instance minus ECS_RESERVED_MEMORY. It means that if we have, say, some third-party app, e.g. third-party app which suppose to use 50Mb we can set ECS_RESERVED_MEMORY=100 and ecs restriction will try to leave this 50M to the system and 50M to third-party app. and that process. However, ecs-agent runs within docker and it is ecs-managed resource and thus it is less likely fall into a criteria that it is a "processes that are not managed by ECS", at least I did not find evidences that this is not true.

@aaithal
Copy link
Contributor

aaithal commented Apr 2, 2018

Hi @pinepain,

I guess someone may be mislead by having Available memory value in ecs instance which is not free memory reported by free -m and it is more total memory from the same command output, which is written in Container Instance Memory Management doc.

The Available Memory reported there is the memory available for placement and not the memory reported by free.

Reading this thread I found setting/increasing ECS_RESERVED_MEMORY recommendation and I'm not sure I fully get what impact it should have

The ECS agent registers all of the memory available on an instance as resource with the ECS cluster by default. For example, a t2.micro instance type has 1 GiB of memory and ECS agent registers that with ECS as resource available in the cluster. If you set ECS_RESERVED_MEMORY to 100 MiB on this instance, then ECS agent will only register 900 MiB with ECS as available resource on this instance. This means that ECS will only place tasks that use up to 900 MiB on this instance, leaving 100 MiB for non ECS managed entities (including the ECS agent).

Not having enough resources available for non ECS managed processes on the hosts (essentially everything outside of containers managed by ECS), including free memory and EBS IOPS can certainly lead to issues such as ECS agent getting killed or being disconnected from the backend. As per my previous recommendations, I'd suggest setting ECS_RESERVED_MEMORY to a value that leaves enough head room for your system processes. I'd also suggest either using larger EBS volumes or provisioned IOPS volumes to get around the EBS IO credits issue.

Thanks,
Anirudh

@pinepain
Copy link

pinepain commented Apr 2, 2018

@aaithal thanks for quick reply.

Indeed, what I was trying to say is that it is close to total memory, and it is absolutely not a "free" memory. I see this misunderstanding time to time and hopefully further readers may find this thread useful. At my understanding this is the final formula of how memory get calculated:

cpu, mem := getCpuAndMemory()
remainingMem := mem - int64(client.config.ReservedMemory)

leaving 100 MiB for non ECS managed entities (including the ECS agent).

I'm stuck at this as I see that ecs-agent is also a dockerized process. Does it mean that all dockerized processes (tasks) but ecs agent will use (as per example) 900MiB and ecs-agent + the rest of system will reside in that 100MiB. Does it mean that dockerized ecs agent is not "managed" by ecs in terms used in memory management documentation?

@aaithal
Copy link
Contributor

aaithal commented Apr 9, 2018

Hi @pinepain,

That just means that the resource consumption of ECS agent is not taken into account when making placement decisions on the instance. Also, we've released a new version of the ECS agent, where we've enhanced the websocket error handling logic. That, combined with ECS_RESERVED_MEMORY should fix this issue. Closing this for now. Please let us know if you have follow up questions/comments.

@aaithal aaithal closed this as completed Apr 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants