Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update v4 metadata to return network despite nil #2719

Merged
merged 1 commit into from
Nov 12, 2020

Conversation

fierlion
Copy link
Member

Summary

This updates the v4 metadata endpoint to return network metadata for all available. Previously we threw a http.StatusInternalServerError (500) in the case of any container in the task not having available network metadata.

Implementation details

removed Error cases for individual missing metadata and added a Warn log level entry.

Testing

make gogenerate
make test
(will update integ/functional tests as necessary given automated tests)

New tests cover the changes: no

Description for the changelog

Update v4 metadata to return network despite nil

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@fierlion fierlion changed the title [WIP] update v4 metadata to return network despite nil update v4 metadata to return network despite nil Nov 11, 2020
@sparrc
Copy link
Contributor

sparrc commented Nov 11, 2020

could you delete the error log on this line so this doesnt get double-logged?

seelog.Errorf("Unable to get container network response for container '%s'", containerID)
return nil, errors.Errorf("unable to generate network response for container '%s'", containerID)
,

also i would prefer to reword the error returned there^ to soemthing like no network settings found for container....

@fierlion fierlion merged commit 7a82ee7 into aws:dev Nov 12, 2020
@fierlion fierlion added this to the 1.48.0 milestone Nov 12, 2020
@fierlion fierlion deleted the metadata_running_check branch November 12, 2020 00:25
fierlion added a commit to fierlion/amazon-ecs-agent that referenced this pull request Nov 12, 2020
shubham2892 pushed a commit that referenced this pull request Nov 12, 2020
fenxiong pushed a commit to fenxiong/amazon-ecs-agent that referenced this pull request Nov 30, 2020
fenxiong pushed a commit that referenced this pull request Dec 2, 2020
sparrc added a commit that referenced this pull request Feb 5, 2021
* Update go version to 1.15 in travis.yml, github actions, dockerfiles

* Retry to rename files on error as os.Rename calls are prone to failure on Windows

* adding fsxwindowsfileserver task resource

* Release 1.46.0

* Fix gocyclo import

* fsxwindowsfileserver: fixing task concurrency issues with drive assignment

* modify Makefile to use different Golang version while building Agent for Windows

* Update unit test coverage minimum

* update README.md to add new environment variable ECS_FSX_WINDOWS_FILE_SERVER_SUPPORTED

* Explicitly initialize seelog

The logger package's initialization logic is init() which will be
executed by loading the package.

However some of our programs are using the agent as a library and
haven't configured seelog beforehand. Due to that, loading the package
logs

```
1600397429848419666 [Error] node must have children
```
which breaks the programs.

I think it would be safer to do less on init() and explicitly
initialize the logger from the agent's main().

Signed-off-by: Kazuyoshi Kato <[email protected]>

* Release 1.47.0

* stop container timeout: increase buffer from 30s to 2m

* Fix fluentbit integ test container to version 2.7.0

* improve logging for eni attachment issues

* Switch back to latest firelens image in test and update the test.

* save RCI client token to agent state

* integ tests: cleanup and minimize dockerhub pulls (#2714)

* Remove namespace tests, already converted to functional

* remove taskmetadata-validator

* remove GPU integ tests

* remove parallel-pull-fts

* remove awscli, squid

* cleanup gremlin, netkitten, pause-container builds, and some unused dockerfiles

* change tianon/true to gcr distroless static

* remove ubuntu:20.04 and nginx:stable

* remove redundant registry auth integ test

* update v4 metadata to return network stats despite missing stats for single container (#2719)

* removes redundant log statement (#2724)

* Fix jumbled min & max for engine connection retry delays

* add p4d instance type to list of GPU instances

* Revert "removes redundant log statement (#2724)"

This reverts commit 38ac07d.

* Revert "update v4 metadata to return network stats despite missing stats for single container (#2719)"

This reverts commit 7a82ee7.

* Introduce an environment variable ECS_PULL_DEPENDENT_CONTAINERS_UPFRONT.
The environment variable default value is true. It allows dependent containers start
pulling images before dependsOn conditions in a task definition has been satisfied.

Containers have known status PULLED are defined as pulled containers. These containers
are stored in a pulled container map, and can be retrieved from Task Metadata Endpoint
version 4 (TMDEv4) path ${ECS_CONTAINER_METADATA_URI_V4}/task when the environment
variable is set to true.

* load pulled containers map from state

* Set ECS_PULL_DEPENDENT_CONTAINERS_UPFRONT=false as default

* scripts: add ci-ecr-pull and ci-ecr-push (#2720)

These scripts are for pushing the dockerhub images that we need for our
ci system (integ test and build) to ECR, and for pulling those images
down within the CI system.

ci-ecr-pull will tag the ECR images like 'golang:1.15.4' so that users
who do not run this script on their laptop or elsewhere can still pull
from dockerhub, only if this script is run will the builds reference the
ECR images.

The reason behind this is that dockerhub is now rate-limiting anonymous
pulls, so in order to avoid being rate-limited we should pull our images
from ECR as much as we can.

This should also speed up builds because we will be pulling our large
build images (golang, gcc) from in-region s3 rather than over the
internet.

One side-effect of this change is that we have to reference full version
numbers in our dockerfiles (1.15.4 instead of 1.15), in order to avoid
querying docker for the latest version, and to avoid dockerhub pulls
happening when an image is updated.

* remove explicit docker pull of test registry images (#2735)

Only build fluentd container on x86_64

* Update ECS_PULL_DEPENDENT_CONTAINERS_UPFRONT default value in README file

* Release 1.48.0

* Add structured logging on top of seelog

* Fix an edge case that causes container dependency deadlock.

Scenario: two containers container A and container B, container A is a non-essential container,  container B has a dependency on container A.

Workflow leading to deadlock:
1. We start to pull container A's mage;
2. We fail to pull container A's image;
3. We try to create container A anyway (this is intended as we want to try cached image if possible);
4. We fail to create the container A due to missing image;
5. We try to stop container A;
6. We can't stop container A because when container B has a dependency on A, we implicitly have a reversed dependency on shutdown, such that container A won't stop before container B stops;
7. However we can't move forward container B either, because it's waiting on container A. This ends up in a deadlock.

Fix: return an error when checking container B's dependency resolution when it sees that container A will never start.

* Revert "save RCI client token to agent state"

This reverts commit e21488b.

* Revert "Add structured logging on top of seelog"

This reverts commit e11ee0b.

* Release 1.48.1

* fixing agent path vulnerability on windows

* update v4 metadata to return network stats despite missing stats for single container (#2719)

* removes redundant log statement (#2724)

* Add unit test for task metadata endpoint with missing container network.

* update to latest aws-sdk-go (#2752)

* docker_client: Add missing error handling in getContainerStatsNotStreamed

* Fetch goversion for windows from a file

* Delete 'output' file (#2762)

* Improve error and info logging around v2 credentials requests (#2705)

also cleanup an unrelated warning message that could have previously
printed the full access key id of the container instance in plain text.

Signed-off-by: Cam <[email protected]>

* Remove last remaining references to travis (#2763)

also add static checks on 'push', so that we have test runs on the dev
branch which will enable us to use the build status badge.

* sync aws-sdk-go version (#2764)

* Update ECS_POLL_METRICS value to be default false (#2768)

* Make create container timeout configurable

Co-authored-by: Mythri Garaga Manjunatha <[email protected]>
Co-authored-by: Anuj Singh <[email protected]>
Co-authored-by: Meghna Srivastav <[email protected]>
Co-authored-by: amazon-ecs-bot <[email protected]>
Co-authored-by: Kazuyoshi Kato <[email protected]>
Co-authored-by: Cam <[email protected]>
Co-authored-by: Sharanya Devaraj <[email protected]>
Co-authored-by: Cameron Sparr <[email protected]>
Co-authored-by: Ray Allan <[email protected]>
Co-authored-by: Sai Prasad <[email protected]>
Co-authored-by: Shubham <[email protected]>
Co-authored-by: Chien-Han Lin <[email protected]>
Co-authored-by: Angel Velazquez <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants