Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing - flaky TestTelemetry #1903

Closed
suneyz opened this issue Feb 28, 2019 · 4 comments
Closed

Testing - flaky TestTelemetry #1903

suneyz opened this issue Feb 28, 2019 · 4 comments

Comments

@suneyz
Copy link
Contributor

suneyz commented Feb 28, 2019

Summary

Test failed intermittently

Description

=== RUN   TestTelemetry
--- FAIL: TestTelemetry (758.44s)
	utils_unix.go:130: Created directory /tmp/tmp.FAnqwdppda/ecs_integ_testdata942367773 to store test data in
	utils_unix.go:146: Launching agent with image: amazon/amazon-ecs-agent:latest
	utils_unix.go:233: Agent started as docker container: bb6db4e85aefc7989d0c51a4820783b1870e091be65ad2deb533439d3aff7dc5
	utils.go:175: Found agent metadata: {Cluster:ecstest-telemetry-00123be9-7e9a-4805-874a-828a984b86ec ContainerInstanceArn:0x44204a8120 Version:Amazon ECS Agent - v1.25.2 (a3d87cb4)}
	utils.go:196: Task definition: ecs-metrics-test-686599a94de2d6516a1a6c429a2e3054:1
	utils.go:216: Started task: arn:aws:ecs:us-west-2:535959970326:task/6bbc1d32-3b96-4d0a-aabc-a164e160780e
	functionaltests_test.go:559: 
			Error Trace:	functionaltests_test.go:559
			            				functionaltests_unix_test.go:380
			Error:      	Received unexpected error:
			            	non-zero utilization for idle cluster
			Test:       	TestTelemetry
			Messages:   	Task stopped: verify metrics for CPU utilization failed
	functionaltests_test.go:563: 
			Error Trace:	functionaltests_test.go:563
			            				functionaltests_unix_test.go:380
			Error:      	Received unexpected error:
			            	non-zero utilization for idle cluster
			Test:       	TestTelemetry
			Messages:   	Task stopped, verify metrics for memory utilization failed
	utils.go:183: Preserving test dir for failed test /tmp/tmp.FAnqwdppda/ecs_integ_testdata942367773

Expected Behavior

Observed Behavior

Environment Details

Supporting Log Snippets

@yumex93
Copy link
Contributor

yumex93 commented Mar 4, 2019

This happens multiple times and failed with different reasons.

=== RUN   TestTelemetry
--- FAIL: TestTelemetry (752.16s)
	utils_unix.go:130: Created directory /var/log/functional-tests/ecs_integ_testdata034479138 to store test data in
	utils_unix.go:146: Launching agent with image: amazon/amazon-ecs-agent:latest
	utils_unix.go:233: Agent started as docker container: db93ba1d973cb40e33b9a5fbb7d3450f46e77e8634eb002424c06d0aa308842e
	utils.go:175: Found agent metadata: {Cluster:ecstest-telemetry-15e970a5-54f7-44e3-9d81-023f0b2db226 ContainerInstanceArn:0xc420283ff0 Version:Amazon ECS Agent - v1.26.0 (ebac2200)}
	utils.go:196: Task definition: ecs-metrics-test-686599a94de2d6516a1a6c429a2e3054:1
	utils.go:216: Started task: arn:aws:ecs:us-west-2:355690550782:task/d877ba44-ee3a-4a19-ae31-8aa476d066d4
	functionaltests_test.go:559: 
			Error Trace:	functionaltests_test.go:559
			            				functionaltests_unix_test.go:380
			Error:      	Received unexpected error:
			            	non-zero utilization for idle cluster
			Test:       	TestTelemetry
			Messages:   	Task stopped: verify metrics for CPU utilization failed
	functionaltests_test.go:563: 
			Error Trace:	functionaltests_test.go:563
			            				functionaltests_unix_test.go:380
			Error:      	Received unexpected error:
			            	non-zero utilization for idle cluster
			Test:       	TestTelemetry
			Messages:   	Task stopped, verify metrics for memory utilization failed
	utils.go:183: Preserving test dir for failed test /var/log/functional-tests/ecs_integ_testdata034479138
=== RUN   TestTelemetry
--- FAIL: TestTelemetry (754.69s)
	utils_unix.go:130: Created directory /var/log/functional-tests/ecs_integ_testdata041051029 to store test data in
	utils_unix.go:146: Launching agent with image: amazon/amazon-ecs-agent:latest
	utils_unix.go:233: Agent started as docker container: d2ece0f21f9ce7a47cb2d929484d327ab249f8085fbae702999aaaf35bfe4268
	utils.go:175: Found agent metadata: {Cluster:ecstest-telemetry-92e53e29-16b0-4126-ac27-6b050e724c93 ContainerInstanceArn:0xc420420070 Version:Amazon ECS Agent - v1.26.0 (ebac2200)}
	utils.go:196: Task definition: ecs-metrics-test-356c9a551fa37a7563b192ec719173eb:1
	utils.go:216: Started task: arn:aws:ecs:us-west-2:355690550782:task/00290c0e-6c4b-4322-9386-4dd658e49d78
	functionaltests_test.go:537: 
			Error Trace:	functionaltests_test.go:537
			            				functionaltests_unix_test.go:380
			Error:      	Max difference between 20.833333333333336 and 14.49838981816643 allowed is 5, but difference was 6.334943515166906
			Test:       	TestTelemetry

@fierlion
Copy link
Member

fierlion commented Mar 5, 2019

src/amazon-ecs-agent/agent/functional_tests/util/utils.go:

          if idleCluster {
                  if *datapoint.Average != 0.0 {
                          return nil, fmt.Errorf("non-zero utilization for idle cluster")
                  }
          } 

so we are either seeing an idleCluster = true with a non-zero utilization, or we are seeing differences outside the expected delta of 5. Are these related in any way?

@fierlion
Copy link
Member

#1955

docker stats has noise/jitter which means that the reported CloudWatch have the same noise. I ran the telemetry task and saw a range of values for an idle cluster between 0.0% and 2.7% cpu usage as reported by docker stats via CloudWatch metrics, and once the compute was running saw peaks and valleys of max +/- 5.6%

@fierlion
Copy link
Member

fierlion commented May 9, 2019

A fix for this was shipped in agent 1.26

@fierlion fierlion closed this as completed May 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants