Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default docker metric gathering behavior #2452

Merged
merged 3 commits into from
May 18, 2020

Conversation

sparrc
Copy link
Contributor

@sparrc sparrc commented May 13, 2020

Summary

  1. Change default docker metric gathering behavior from streaming
    metrics to polling.
  2. Change the default polling interval to 10s (half of the TACS publishing
    interval of 20s), so that every publish interval we have two docker
    metrics (previously 15s).
  3. Change the minimum polling interval to 5s to prevent customers from
    configuring resource-intensive polling (previously 1s).
  4. When a customer configures an interval below the minimum interval,
    log a warning and set the interval to the minimum (previously set to default).
  5. When a customer configures an interval above the maximum interval,
    log a warning and set the interval to the maximum (previously set to default).

These changes are being made because we have found that docker streaming
stats consumes considerable resources from the agent, dockerd daemon, and
containerd daemon.

The graph below shows the improvement in cpu utilization on a single-instance cluster with an m5.large with 120 containers. On agent 1.39.0 with default settings our cluster utilization maxes out at 86% because agent/dockerd/containerd is using ~14% of the instance's cpu resources. With this change (1.40.0) we max out at 94.5%. This means we see a 60% reduction in resources consumed by ECS daemons and 10% improvement in overall cluster utilization.

NOTE: higher cpu utilization in these graphs is a good thing, as it means the customer's containers/tasks are utilizing more of the cluster's cpu, and not daemons required for ECS.

image

Improvement gets more dramatic with more containers:

image

and not as dramatic but still substantial with fewer:

image

Testing

New tests cover the changes: no

Description for the changelog

Agent's default stats gathering is changing from docker streaming stats to polling. This should not affect the metrics that customers ultimately see in cloudwatch, but it does affect how the agent gathers the underlying metrics from docker. This change was made for considerable performance gains. Customers with high CPU loads may see their cluster utilization increase; this is a good thing because it means the containers are utilizing more of the cluster, and agent/dockerd/containerd are utilizing less.

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@sparrc sparrc force-pushed the change-default-metric-gathering branch from f02e7c6 to 73bc2f0 Compare May 13, 2020 21:22
@sparrc sparrc changed the title [WIP] Change default docker metric gathering behavior Change default docker metric gathering behavior May 13, 2020
@sparrc sparrc added this to the 1.40.0 milestone May 13, 2020
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Copy link
Contributor

@shubham2892 shubham2892 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments about readme changes, LGTM 🚀

@sparrc sparrc force-pushed the change-default-metric-gathering branch from 800ca4f to 5704544 Compare May 14, 2020 17:51
@@ -1369,56 +1369,56 @@ func (dg *dockerGoClient) Stats(ctx context.Context, id string, inactivityTimeou
}()
Copy link
Contributor Author

@sparrc sparrc May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here are because functional tests caught that polling metrics was taking longer to populate the task stats API versus streaming stats. This is because polling stats attempts to jitter the initial stats poll in order to avoid hammering the docker stats API on startup.

So change the behavior of polling metrics to poll for docker stats immediately on startup. This is not ideal from a load perspective but is a necessity to avoid unintentional behavior changes.

sparrc added 2 commits May 15, 2020 11:08
1. Change default docker metric gathering behavior from streaming
metrics to polling.
2. Change the default polling interval to half of the TACS publishing
interval (currently 20s), so that every publish interval we have two
docker metrics.
3. Change the minimum polling interval to 5s to prevent customers from
configuring polling to be just as resource-intensive as streaming
metrics.

These changes are being made because we have found that docker streaming
stats consumes considerable resources from the agent, dockerd daemon, and
containerd daemon.
to avoid changing behavior from streaming stats, we need to populate the
stats endpoint immediately when the stats engine starts. So instead of
jittering the first stats gather we need to just do it immediately.
@sparrc sparrc force-pushed the change-default-metric-gathering branch from 5704544 to c4871aa Compare May 15, 2020 18:10
@sparrc sparrc force-pushed the change-default-metric-gathering branch from 544d296 to 8bb1a6b Compare May 15, 2020 20:03
@sparrc sparrc force-pushed the change-default-metric-gathering branch from 8bb1a6b to 1d59927 Compare May 15, 2020 20:45
@sparrc sparrc force-pushed the change-default-metric-gathering branch from 1d59927 to 28ba4aa Compare May 15, 2020 23:10
@sparrc sparrc merged commit afa5b19 into aws:dev May 18, 2020
@sparrc sparrc deleted the change-default-metric-gathering branch May 18, 2020 19:07
@fenxiong fenxiong mentioned this pull request Jun 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants