Update to 1.26.0 #1916

petderek · 2019-02-28T23:24:45Z

Summary

Updating Agent to 1.26.0. Changelog:

Feature - Container Ordering #1904
Feature - Container level timeouts #1904
Feature - AWS Appmesh CNI plugin support #1898
Enhancement - Shutdown order is now observed #1904
Bug - Image cleanup errors fixed #1897

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Merge 'dev' branch into 'container-ordering' branch

* start timeout is governed by the agent and is the context timeout for the StartContainer api call * stop timeout is the parameter passed to StopContainer and is time the docker daemon waits after a StopContainer call to issue a SIGKILL to the container

Add log driver secret to LogConfig and agent capability

go vet fixes for unkeyed fields and copying lock value

Mongo images are replaced with redis and crux images. Using Mongo images were leading to unpredictable results in functional and integration tests everytime Mongo images were updated on docker hub.

'Success' condition on a dependency container will allow a target container to start only when the dependency container has exitted successfully with exitcode 0. 'Complete' condition on a dependency container will allow a target container to start only when the dependency container has exitted. 'Healthy' condition on a dependency container will allow a target container to start only when the dependency container is reported to be healthy.

…to pending test process

Prior to this commit, we only tracked state we explicitly tried to change when the task was starting. We did not respond to the event stream or any other source of information from Docker. This means that when we are waiting for certain dependency conditions ("SUCCESS", "COMPLETE", or "HEALTHY") the task progression logic does not update the agent-internal model of container state. Since we rely on that state for determining when conditions are met, tasks would get stuck in infinite startup loops. This change adds a call to engine.checkTaskState(), which explicity updates any changed container state. We only call this function if we know that we are waiting on the aforementioned subset of dependency conditions. Co-authored-by: Utsa Bhattacharjya <[email protected]>

engine: adding poll function during progressTask

We now apply shutdown order in any dependency case, including dependsOn directives, links, or volumes. What this means is that agent will now make a best effort attempt to shutdown containers in the inverse order they were created. For example, a container using a link for communication will wait until the linked container has terminated before terminating itself. Likewise, a container named in another container's dependsOn directive will wait until that dependent container terminates. One note about the current implementation is that the dependencies aren't assumed to be transitive, so if a chain exists such that: A -> B -> C Container "C" will shutdown before "B", but it won't check status against container "A" explicity. If A depends on C, we expect: A -> B -> C A -> C The lack of transitive dependency logic applies is consistent with startup order as of this commit.

The link / volume dependency tests are now affected by shutdown order, so the tests now take longer. Previously, it would take a max of 30s (the default docker stop timeout for agent). Now, since the containers stop in order, it will take a max of 30s * n, where n is the number of containers. Increasing the test timeout is a short term fix until we have granular start/stop timeouts plumbed through the ecs service.

Instead of explicitly checking against many conditions, we now validate that the expected condition has progressed beyond started This mirrors prior behavior in the codebase, and reduces cyclo complexity.

dependencygraph: Enforce shutdown order

The ‘StartTimeout’ now will only serve as the the time duration after which if a container has a dependency on another container and the conditions are ‘SUCCESS’, ‘HEALTHY’ and ‘COMPLETE’, then the dependency will not be resolved. For example: • If a container A has a dependency on container B and the condition is ‘START’, the StartTimeout for container B will roughly be the time required for it to exit successfully with exit code 0 • If a container A has a dependency on container B and the condition is ‘COMPLETE’, the StartTimeout for container B will roughly be the time required for it to exit. • If a container A has a dependency on container B and the condition is ‘HEALTHY’, the StartTimeout for container B will roughly be the time required for it to emit a ‘HEALTHY’ status. If the StartTimeout exceeds in any of the above cases, container A will not be able to transition to ‘CREATED’ status. It effectively reverts the implementation of StartTimeout in commit: 79bd517

This is the first batch of integration tests for container ordering. The tests handle the basic use cases for each of the conditions that introduces new behavior into agent (HEALTHY,COMPLETE,SUCCESS).

Container ordering integ tests

* "START" Dependency condition has been changed to "CREATE" as it waits for the dependency to atleast get created * "RUNNING" Dependency Condition has been changed to "START" as it waits for the dependency to get started.

Here, the time duration(StartTimeout) mentioned by the user for a container is expired or not is checked before resolving the dependency for target container. For example, * if a target container 'A' has dependency on container 'B' and the dependency condiiton is 'SUCCESS', then the dependency will not be resolved if B times out before exitting successfully with exit code 0. * if a target container 'A' has dependency on container 'B' and the dependency condiiton is 'COMPLETE', then the dependency will not be resolved if B times out before exitting. * if a target container 'A' has dependency on container 'B' and the dependency condiiton is 'HEALTHY', then the dependency will not be resolved if B times out before emtting 'Healthy' status. The advantage of this is that the user will get to know that something is wrong with the task if the task is stuck in pending..

Dependency Condition Naming change:

Remove the functionality of StartTimeout as Docker API Start Timeout

* Remove need to pull 'latest' server core By removing the :latest tag from all windowsservercore containers, we will have the tests use the container thats already baked into the AMI. * Remove depdency on golang and python containers We are removing the need to use any containers other than servercore and nanoserver. This reduces the number of downloads needed and the number of builds that happen before the tests start running. * Explicit timeouts on order tests The ordering tests are broken at the moment, so we are capping them with a fixed timeout.

Faster windows test

Checking dependency resolution after timeout and successful exit check

1. Add proxy config into acs model 2. Convert acs model to app mesh config in task 3. Pass app mesh config from task to app mesh plugin and invoke add, del command on network setup and clean

1. Add dockerfile to build amazon-vpc-cni-plugins 2. Add build target in makefile for amazon-vpc-cni-plugins

If user didn't input task metadata endpoint ip of agent or instance metadata endpoint ip in appmesh enabled task, agent should add these two default IPs into appmesh egressIgnoredIPs field as we don't want them to be redirect to envoy proxy

This should eventually make windows tests faster to run. Fixes a bug where task context cancel causes an infinite steady state loop. Previously if the context expired, waitSteady() will spin forever since the timeout no longer works. This introduces a check for context expiration earlier in the code.

ACS Model change for container ordering

The default helper function will now allocate 1024 cpu shares or 100% cpu-percent on windows. This will enable the windows based tests to finish in more predictable ways. When Windows tests were constrained, simple tasks liek "sleep 10" were taking much longer than the expected 10 seconds.

Adding Integ Tests for Granular Stop Timeout

…ild tag

If we set "prefer-cached" as the pull behavior, then the PullStartedAt and PullStoppedAt fields may not be present in the endpoint. This causes tests to fail. This change logs an error but prevents that specific failure case.

Merge branch 'container-ordering-feature' into dev

Change task server endpoint to 127.0.0.1

Fix windows validator test

Merge appmesh to dev branch

update nginx version

aws#1890 was merged using rebase and merge which is wrong, and dev is still "56 commits ahead, 1 commit behind master." Fixing by merging again.

Fix merge master to dev "update to 1.25.3"

See changelog entries for complete changes.

sharanyad · 2019-02-28T23:30:00Z

CHANGELOG.md

@@ -1,5 +1,12 @@
 # Changelog

+## 1.26.0


Can we make the changelog items more descriptive?

something like:
Enable start and stop ordering for containers in a task
Container level configurable start and stop container timeouts

not sure what this is : Shutdown order is now observed

ubhattacharjya and others added 30 commits January 30, 2019 13:31

Model changes for ACS for container ordering

e36e2c8

Adding adapter to have volumes/links use DependsOn field

5fbfab0

Refactor dependency graph code for volume/links dependency resolution

5c3fa51

Merge branch 'dev' into container-ordering feature

680db40

Add log driver secret to LogConfig and agent capability

4e818ef

Merge pull request aws#1853 from ubhattacharjya/mergeBranch

2cf4734

Merge 'dev' branch into 'container-ordering' branch

statemanager: add container start/stop statemanger

3265bf9

Merge pull request aws#1818 from tommyhahn/branch_logging_driver

e325804

Add log driver secret to LogConfig and agent capability

Merge pull request aws#1849 from adnxn/dev-granular-timeouts

f4da625

readme: update config level container timeouts

0e8a379

cleaned travis config format, and upgraded to go 1.11

9ab5eff

go vet fixes for unkeyed fields and copying lock value

Replace mongo docker images with crux and redis

7b9b9fc

Mongo images are replaced with redis and crux images. Using Mongo images were leading to unpredictable results in functional and integration tests everytime Mongo images were updated on docker hub.

Revert "Add log driver secret to LogConfig and agent capability" due …

9708e51

…to pending test process

Merge pull request aws#1876 from petderek/container-ordering-task-sync

3bc095b

engine: adding poll function during progressTask

dependencygraph: simplify container start logic

89bb2e8

Instead of explicitly checking against many conditions, we now validate that the expected condition has progressed beyond started This mirrors prior behavior in the codebase, and reduces cyclo complexity.

Merge pull request aws#1866 from petderek/container-ordering-feature

582327c

dependencygraph: Enforce shutdown order

engine: add ordering integration tests

5bac35e

This is the first batch of integration tests for container ordering. The tests handle the basic use cases for each of the conditions that introduces new behavior into agent (HEALTHY,COMPLETE,SUCCESS).

Merge pull request aws#1881 from petderek/container-ordering-integ-tests

af9e1f5

Container ordering integ tests

Dependency Condition Naming change:

4ac84e4

* "START" Dependency condition has been changed to "CREATE" as it waits for the dependency to atleast get created * "RUNNING" Dependency Condition has been changed to "START" as it waits for the dependency to get started.

Merge pull request aws#1882 from ubhattacharjya/Naming

7ba5947

Dependency Condition Naming change:

Merge pull request aws#1880 from ubhattacharjya/ChangeStartTimeout

83c928b

Remove the functionality of StartTimeout as Docker API Start Timeout

Merge pull request aws#1886 from petderek/fast-windows-test

ee7a419

Faster windows test

ubhattacharjya and others added 28 commits February 24, 2019 17:17

Merge pull request aws#1884 from ubhattacharjya/bugFix

314e658

Checking dependency resolution after timeout and successful exit check

ACS model change

6ef7266

Accomodate ACS model change

01b0505

Add app mesh support to agent.

25f450c

1. Add proxy config into acs model 2. Convert acs model to app mesh config in task 3. Pass app mesh config from task to app mesh plugin and invoke add, del command on network setup and clean

Add amazon-vpc-cni-plugins into agent

774d936

1. Add dockerfile to build amazon-vpc-cni-plugins 2. Add build target in makefile for amazon-vpc-cni-plugins

Adding shutdown order test

d986f36

Merge pull request aws#1885 from ubhattacharjya/ACS

fc2d930

ACS Model change for container ordering

Adding Integ Tests for Granular Stop Timeout

e34ceb7

Change task server endpoint to 127.0.0.1

a3d87cb

Fix Shutdown Order test

bc3f30c

Merge pull request aws#1887 from ubhattacharjya/StopTimeoutTest

95de818

Adding Integ Tests for Granular Stop Timeout

Update amazon-vpc-cni-plugins to latest sha to include CGO_ENABLED bu…

ea64925

…ild tag

Merge branch 'container-ordering-feature' into dev

4a67ab7

update to 1.25.3

6e98779

functional tests: validator pull image update

b36a2fa

If we set "prefer-cached" as the pull behavior, then the PullStartedAt and PullStoppedAt fields may not be present in the endpoint. This causes tests to fail. This change logs an error but prevents that specific failure case.

Merge pull request aws#1904 from ubhattacharjya/MergeContOrder

90793e4

Merge branch 'container-ordering-feature' into dev

Merge pull request aws#1901 from suneyz/iam

8529bd8

Change task server endpoint to 127.0.0.1

Merge pull request aws#1905 from petderek/fix-win-test

6d37bed

Fix windows validator test

Merge pull request aws#1898 from aws/app-mesh

b3629cc

Merge appmesh to dev branch

update nginx version

5848638

Merge pull request aws#1907 from yumex93/update_nginx

08c9e1d

update nginx version

Merge branch 'master' into dev

94ecd6a

aws#1890 was merged using rebase and merge which is wrong, and dev is still "56 commits ahead, 1 commit behind master." Fixing by merging again.

Merge pull request aws#1908 from fenxiong/merge-dev

bb0c31a

Fix merge master to dev "update to 1.25.3"

Fix error detection case when image that is being deleted does not exist

243ddda

Update to 1.26.0

741db5f

See changelog entries for complete changes.

petderek closed this Feb 28, 2019

sharanyad reviewed Feb 28, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to 1.26.0 #1916

Update to 1.26.0 #1916

petderek commented Feb 28, 2019

sharanyad Feb 28, 2019

Update to 1.26.0 #1916

Update to 1.26.0 #1916

Conversation

petderek commented Feb 28, 2019

Summary

Licensing

sharanyad Feb 28, 2019

Choose a reason for hiding this comment