-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change default docker metric gathering behavior #2452
Conversation
f02e7c6
to
73bc2f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments about readme changes, LGTM 🚀
800ca4f
to
5704544
Compare
@@ -1369,56 +1369,56 @@ func (dg *dockerGoClient) Stats(ctx context.Context, id string, inactivityTimeou | |||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes here are because functional tests caught that polling metrics was taking longer to populate the task stats API versus streaming stats. This is because polling stats attempts to jitter the initial stats poll in order to avoid hammering the docker stats API on startup.
So change the behavior of polling metrics to poll for docker stats immediately on startup. This is not ideal from a load perspective but is a necessity to avoid unintentional behavior changes.
1. Change default docker metric gathering behavior from streaming metrics to polling. 2. Change the default polling interval to half of the TACS publishing interval (currently 20s), so that every publish interval we have two docker metrics. 3. Change the minimum polling interval to 5s to prevent customers from configuring polling to be just as resource-intensive as streaming metrics. These changes are being made because we have found that docker streaming stats consumes considerable resources from the agent, dockerd daemon, and containerd daemon.
to avoid changing behavior from streaming stats, we need to populate the stats endpoint immediately when the stats engine starts. So instead of jittering the first stats gather we need to just do it immediately.
5704544
to
c4871aa
Compare
544d296
to
8bb1a6b
Compare
8bb1a6b
to
1d59927
Compare
5/12 release broke our tests
1d59927
to
28ba4aa
Compare
Summary
metrics to polling.
interval of 20s), so that every publish interval we have two docker
metrics (previously 15s).
configuring resource-intensive polling (previously 1s).
log a warning and set the interval to the minimum (previously set to default).
log a warning and set the interval to the maximum (previously set to default).
These changes are being made because we have found that docker streaming
stats consumes considerable resources from the agent, dockerd daemon, and
containerd daemon.
The graph below shows the improvement in cpu utilization on a single-instance cluster with an m5.large with 120 containers. On agent 1.39.0 with default settings our cluster utilization maxes out at 86% because agent/dockerd/containerd is using ~14% of the instance's cpu resources. With this change (1.40.0) we max out at 94.5%. This means we see a 60% reduction in resources consumed by ECS daemons and 10% improvement in overall cluster utilization.
NOTE: higher cpu utilization in these graphs is a good thing, as it means the customer's containers/tasks are utilizing more of the cluster's cpu, and not daemons required for ECS.
Improvement gets more dramatic with more containers:
and not as dramatic but still substantial with fewer:
Testing
New tests cover the changes: no
Description for the changelog
Agent's default stats gathering is changing from docker streaming stats to polling. This should not affect the metrics that customers ultimately see in cloudwatch, but it does affect how the agent gathers the underlying metrics from docker. This change was made for considerable performance gains. Customers with high CPU loads may see their cluster utilization increase; this is a good thing because it means the containers are utilizing more of the cluster, and agent/dockerd/containerd are utilizing less.
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.