ecs-agent: bump to 1.48.1 #1201

jahkeup · 2020-11-10T22:51:20Z

Issue number:

n/a

Description of changes:

Update ecs-agent to v1.48.1

Testing done:

@srgothi92 was able to run ECS tests against a build with these changes. The upgrade/downgrade test he ran failed on the downgrade portion - we expected this. The next release, which includes the change, will be a one-way-only upgrade as the underlying datastore is migrated and does not support a seamless rollback. Operators that want to rollback will need to take steps appropriate for their container instances and its resources.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

packages/ecs-agent/0003-bottlerocket-remove-unsupported-capabilities.patch

tjkirch · 2020-11-10T23:11:46Z

I don't see anything about the change in data store. How does this affect existing users of the ECS variant? What happens when they downgrade to a version without this change?

jahkeup · 2020-11-10T23:15:25Z

I don't see anything about the change in data store. How does this affect existing users of the ECS variant? What happens when they downgrade to a version without this change?

@tjkirch I'm not sure I know exactly which change you're asking after. Can you point me to an example in a prior bump?

FWIW: I'm not done with this change, not at all :) (which is why its still in draft)

tjkirch · 2020-11-10T23:27:44Z

I don't see anything about the change in data store. How does this affect existing users of the ECS variant? What happens when they downgrade to a version without this change?

@tjkirch I'm not sure I know exactly which change you're asking after. Can you point me to an example in a prior bump?

https://github.com/aws/amazon-ecs-agent/releases/tag/v1.44.0

In 1.44.0 they changed from a json file to boltdb, and we weren't sure whether ecs-agent itself handles upgrades and, in particular, downgrades. We've been holding off on this upgrade to make sure that we're not losing important state in either direction.

[edit] GAH, sorry, didn't mean to hit the "close and comment" button, just "comment"

samuelkarp · 2020-11-10T23:29:31Z

In 1.44.0 they changed from a json file to boltdb, and we weren't sure whether ecs-agent itself handles upgrades and, in particular, downgrades.

I'm pretty sure it handles upgrades but not downgrades.

jahkeup · 2020-11-11T01:22:18Z

In 1.44.0 they changed from a json file to boltdb, and we weren't sure whether ecs-agent itself handles upgrades and, in particular, downgrades.

I'm pretty sure it handles upgrades but not downgrades.

Gotcha, I haven't gotten to the point that I can test this yet.

@tjkirch, @samuelkarp were there some ideas floated on how an enforced, unidirectional update, if required, might be accomplished?

With the rollback image locally available, we'd expect that users could still downgrade.. so the question is probably even wider: should, or can, we even try to prevent a downgrade when incoherent state is nearly certain?

My sense is that we're not in a position to even try to rollback while also returning from altered/migrated state. That would mean we'd have to choose the older state file. Returning to the older state file, I strongly suspect, would be potentially dangerous & incorrect (eg: awsvpc ENIs, outdated local task & container configurations, and I'm sure others..). In other cases, especially for an automatic rollback, the older state file could probably be put back in place (if migrated during startup) and continue startup.

jahkeup · 2020-11-13T17:40:13Z

We might want to hold off on bumping until aws/amazon-ecs-agent#2708 is released. The change persists a registration token which is used by an instance to obtain a registration with a stable container instance ARN. Without this change, it appears that container instances will be reregistered with new container instance ARNs each time (using a new random uuid each boot). This likely would impact instance-specific associated resources (like ENIs), though I can't find docs that outline exactly what would happen.

samuelkarp · 2020-11-13T17:41:51Z

Without this change, it appears that container instances will be reregistered with new container instance ARNs each time (using a new random uuid each boot).

This is so that ECS service registers using the same container instance ARN for the container instance that issues quick back to back RCI requests.

It looks like that's there for retries, not reboots, but I don't have an objection to waiting.

jahkeup · 2020-11-13T17:50:56Z

Additionally, we'll need to take care to handle rollback by also removing the BoltDB that was migrated to (if it was). If we don't, then the Agent will come up with the prior's migrated state which is problematic for cases where a rollback'd instance was used and then later upgraded again.

samuelkarp · 2020-11-24T09:35:41Z

aws/amazon-ecs-agent#2738 is another issue we should be aware of. We should check whether this impacts 1.47.0 as well.

jahkeup · 2020-11-24T18:29:08Z

@tjkirch and I discussed some steps that the migrations can take to eliminate the risks of upgrade and downgrade update movements for the ECS agent. The working plan is that I'll add a migration which will err on the side of caution.

Migration: On Upgrade

no-op: do nothing

Migration: On Downgrade

a) when state.json is present AND state.boltdb is not, leave it
b) when state.boltdb is present, remove it
- i) if state.json is also present, remove it

Reasoning

no-op upgrade: there won't be a BoltDB datastore coming into this version.
a: the agent hasn't migrated from the state file - we can expect that the next agent start will pick up where the state says it left off at.
b: the agent has a BoltDB state datastore (either from migration or from a fresh run) - either way, a downgrade from the version of Bottlerocket which introduces the agent (with BoltDB) is going to need a different state file.

Removing the BoltDB allows for a downgrade to an older agent where its json state file is used. Later on, the agent will migrate per its standard operating procedures.
b + i: the agent accessed the JSON state file and may have successfully migrated to the BoltDB datastore. At this point it isn't practical for Bottlerocket to determine if the state remained in sync (which is possible, but unlikely).

We could build some :handwave: shenanigans to facilitate this upgrade path (rather than deleting ALL state altogether), but the implementation would need to be able to authoritatively conclude that both the BoltDB and the JSON state datastores are equivalent and okay to reuse.

I'm not interested in implementing a clever migration that reaches into these datastores, not for any direction. To me, using a clever solution that accurately detects "safe" datastore transitions would be error prone, fragile in nature, and it relies on internal (to the agent) data structures.

jahkeup · 2020-11-24T18:47:36Z

aws/amazon-ecs-agent#2738 is another issue we should be aware of. We should check whether this impacts 1.47.0 as well.

A recent comment suggest this primarily impacts baking an AMI where the state wasn't removed prior to launching instances from a baked image.

So it looks like the issue is coming up for folks baking AMIs that are going from 1.47.0 to 1.48.0 (or otherwise causing state to exist). This issue is now coming up because of the change I mentioned before that fixes UUID handling. In the context of AMI baking, there's at least one customer pointing out that they simply had to include more files in their post-bake image-cleanup step.

We can't exactly stop this from being an issue for customers that bake AMIs - @samuelkarp, do you have a method or check in mind that I could perform for you?

samuelkarp · 2020-11-24T18:53:37Z

We should still support folks baking AMIs based on Bottlerocket. What we used to do in the ECS agent was to check the stored instance ID in the state file and discard the existing state if the instance ID retrieved from IMDS didn't match. If this bug only affects 1.48.0 we should probably hold off from upgrading to that version.

samuelkarp · 2020-11-24T22:08:23Z

b) when state.boltdb is present, remove it

i) if state.json is also present, remove it

This seems like the wrong approach to me. The state file is the only way the agent knows its own identity; if it gets removed completely, the next time the agent starts up it'll act as if it's starting up for the first time and register itself anew in the cluster. This can lead to multiple container instance IDs for a single EC2 instance: one that is active and connected and one (or more) that is active and disconnected, effectively a zombie record. The state file also tracks resources that should be garbage-collected later, including containers themselves (and their associated ephemeral write layers) as well as container images.

samuelkarp · 2020-12-04T19:33:15Z

aws/amazon-ecs-agent#2738 was resolved in v1.48.1 via aws/amazon-ecs-agent#2740.

samuelkarp · 2020-12-04T19:44:42Z

Migration: On Upgrade

no-op: do nothing

Migration: On Downgrade

a) when state.json is present AND state.boltdb is not, leave it
b) when state.boltdb is present, fail the migration

Reasoning

b) This case is not generally solvable by our migration system and operator intervention is needed. The ECS agent tracks a variety of state in its state file, most of which is important to its correct operation. For example, the agent tracks its own identity (container instance ARN), the tasks that have run and exited (for communication back up to the control plane and subsequent garbage collection of the ephemeral container write layers), and the container images that have been pulled (similarly for garbage collection once containers using those images have been removed and sufficient time has passed). Removing the state file will leave the instance with uncollected garbage and will result in both the creation of a new container instance record in ECS and a "zombie" record that will remain until the EC2 instance is terminated.

tjkirch · 2020-12-04T22:43:21Z

Removing the state file will leave the instance with uncollected garbage and will result in both the creation of a new container instance record in ECS and a "zombie" record that will remain until the EC2 instance is terminated.

I agree with the analysis, but I'm not sure trashing the instance is better than having it serve with a new container instance ID. Either way there's a zombie / useless instance, but if we clean up state, at least it's handling tasks. (What would you want to happen with a failed migration? Even if we rolled back, the ecs-agent state would be wrong, as you described.)

samuelkarp · 2020-12-04T23:51:48Z

Either way there's a zombie / useless instance, but if we clean up state, at least it's handling tasks.

My preference for failures is generally to fail loudly and let an operator intervene instead of attempting to recover. There are enough edge cases in the state that it seems wrong to me to try and solve this at the OS level: ephemeral layer cleanup, image cleanup, potential attachment (ENI & EBS volume) cleanup, Docker volume plugin cleanup, zombie records in the ECS API, etc. If an instance fails to successfully come back up after a roll-back, an operator can know that something's wrong by seeing the failure. On the other hand, if the instance attempts to recover itself it makes the failure less obvious (and might end up running new tasks that can be broken by the existing non-cleaned state on the instance).

tjkirch · 2020-12-04T23:59:00Z

I can understand that, but it does put us in a very bad place. "when state.boltdb is present, fail the migration" means we have to (loudly) mark this upgrade as one-way, something we try very hard to avoid in the project in general.

Is there any way to get support from the agent itself in downgrading state back to JSON? Something like... agent is updated to support both formats for a while, instead of the one-way cutoff we'd have to suffer now. Or... agent package provides a separate conversion tool, which we store in persistent storage and call from the migration.

samuelkarp · 2020-12-05T00:02:02Z

Is there any way to get support from the agent itself in downgrading state back to JSON? Something like... agent is updated to support both formats for a while, instead of the one-way cutoff we'd have to suffer now. Or... agent package provides a separate conversion tool, which we store in persistent storage and call from the migration.

We can ask, but the ECS agent has never supported backwards migrations even in the older JSON file store. (It's not even the format that's necessarily changing, but if a new version of the ECS agent bumps the schema and adjusts the fields in the state file, an older version of the agent will already refuse to load it.)

Signed-off-by: Jacob Vallejo <[email protected]>

webern

Sounds like testing is solid.

samuelkarp · 2021-01-11T22:16:31Z

packages/ecs-agent/ecs-agent.spec

@@ -51,6 +50,7 @@ Requires: %{_cross_os}iptables
 %prep
 %autosetup -Sgit -n %{gorepo}-%{gover} -p1
 %cross_go_setup %{gorepo}-%{gover} %{goproject} %{goimport}
+cp %{S:9} "agent/version/version.go"


This makes me uncomfortable but 🤷‍♂️ I get why you're doing it.

Can you add a comment here to be more explicit?

Yeah, I'm not happy about the problem/solution either. I think this is an improvement over revving a patch with release data - we can revisit this if upstream later permits ldflags as distributed.

I'll add a comment to explain what's going on here.

Added a brief comment in the latest push.

packages/ecs-agent/ecs-agent.spec

The patch would need to be revved each time the commit and version were bumped. Instead, this allows for the linker to provide build data without needing to update the patch beforehand.

This bump includes changes [^1] that were made to a number of container images built for the agent. The nested build that was used to drive the build of `pause` no longer functioned. The build process now directly builds the executable to sidestep the new build - which relies directly on Docker. [^1]: aws/amazon-ecs-agent@7be883b#diff-6ef5a00ce2b6a82260b403d7c14a7ad519480b5f928012a5e3fd355a29310bbe Signed-off-by: Jacob Vallejo <[email protected]> Tested-By: Shailesh Gothi <[email protected]>

jahkeup force-pushed the ecs-agent/1.47.0 branch from bc7034d to 58147e8 Compare November 10, 2020 22:55

samuelkarp reviewed Nov 10, 2020

View reviewed changes

packages/ecs-agent/0003-bottlerocket-remove-unsupported-capabilities.patch Show resolved Hide resolved

tjkirch closed this Nov 10, 2020

tjkirch reopened this Nov 10, 2020

jahkeup force-pushed the ecs-agent/1.47.0 branch from 58147e8 to 18253d8 Compare December 16, 2020 20:12

jahkeup changed the title ~~ecs-agent: bump to v1.47.0~~ ecs-agent: bump to 1.48.1 Dec 16, 2020

ecs-agent: bump to v1.47.0

ed578fa

Signed-off-by: Jacob Vallejo <[email protected]>

gregdek added the status/needs-triage Pending triage or re-evaluation label Jan 7, 2021

tjkirch removed the status/needs-triage Pending triage or re-evaluation label Jan 7, 2021

etungsten mentioned this pull request Jan 7, 2021

Package updates #1264

Merged

3 tasks

jahkeup force-pushed the ecs-agent/1.47.0 branch from 18253d8 to f48003f Compare January 8, 2021 21:29

jahkeup marked this pull request as ready for review January 8, 2021 21:41

webern approved these changes Jan 9, 2021

View reviewed changes

samuelkarp reviewed Jan 11, 2021

View reviewed changes

jahkeup added 2 commits January 11, 2021 14:36

ecs-agent: replace version.go with minimal drop-in

21f94c5

The patch would need to be revved each time the commit and version were bumped. Instead, this allows for the linker to provide build data without needing to update the patch beforehand.

jahkeup force-pushed the ecs-agent/1.47.0 branch from f48003f to 4b86f67 Compare January 11, 2021 22:37

samuelkarp approved these changes Jan 11, 2021

View reviewed changes

jahkeup merged commit bb71298 into bottlerocket-os:develop Jan 12, 2021

jahkeup deleted the ecs-agent/1.47.0 branch January 12, 2021 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ecs-agent: bump to 1.48.1 #1201

ecs-agent: bump to 1.48.1 #1201

jahkeup commented Nov 10, 2020 •

edited

Loading

tjkirch commented Nov 10, 2020

jahkeup commented Nov 10, 2020

tjkirch commented Nov 10, 2020 •

edited

Loading

samuelkarp commented Nov 10, 2020

jahkeup commented Nov 11, 2020

jahkeup commented Nov 13, 2020

samuelkarp commented Nov 13, 2020 •

edited

Loading

jahkeup commented Nov 13, 2020

samuelkarp commented Nov 24, 2020 •

edited

Loading

jahkeup commented Nov 24, 2020

jahkeup commented Nov 24, 2020

samuelkarp commented Nov 24, 2020

samuelkarp commented Nov 24, 2020

samuelkarp commented Dec 4, 2020

samuelkarp commented Dec 4, 2020

tjkirch commented Dec 4, 2020

samuelkarp commented Dec 4, 2020

tjkirch commented Dec 4, 2020

samuelkarp commented Dec 5, 2020

webern left a comment

samuelkarp Jan 11, 2021

jahkeup Jan 11, 2021

jahkeup Jan 11, 2021

ecs-agent: bump to 1.48.1 #1201

ecs-agent: bump to 1.48.1 #1201

Conversation

jahkeup commented Nov 10, 2020 • edited Loading

tjkirch commented Nov 10, 2020

jahkeup commented Nov 10, 2020

tjkirch commented Nov 10, 2020 • edited Loading

samuelkarp commented Nov 10, 2020

jahkeup commented Nov 11, 2020

jahkeup commented Nov 13, 2020

samuelkarp commented Nov 13, 2020 • edited Loading

jahkeup commented Nov 13, 2020

samuelkarp commented Nov 24, 2020 • edited Loading

jahkeup commented Nov 24, 2020

Migration: On Upgrade

Migration: On Downgrade

Reasoning

jahkeup commented Nov 24, 2020

samuelkarp commented Nov 24, 2020

samuelkarp commented Nov 24, 2020

samuelkarp commented Dec 4, 2020

samuelkarp commented Dec 4, 2020

Migration: On Upgrade

Migration: On Downgrade

Reasoning

tjkirch commented Dec 4, 2020

samuelkarp commented Dec 4, 2020

tjkirch commented Dec 4, 2020

samuelkarp commented Dec 5, 2020

webern left a comment

Choose a reason for hiding this comment

samuelkarp Jan 11, 2021

Choose a reason for hiding this comment

jahkeup Jan 11, 2021

Choose a reason for hiding this comment

jahkeup Jan 11, 2021

Choose a reason for hiding this comment

jahkeup commented Nov 10, 2020 •

edited

Loading

tjkirch commented Nov 10, 2020 •

edited

Loading

samuelkarp commented Nov 13, 2020 •

edited

Loading

samuelkarp commented Nov 24, 2020 •

edited

Loading