Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ecs-agent: bump to 1.48.1 #1201

Merged
merged 3 commits into from
Jan 12, 2021
Merged

Conversation

jahkeup
Copy link
Member

@jahkeup jahkeup commented Nov 10, 2020

Issue number:

n/a

Description of changes:

Update ecs-agent to v1.48.1

Testing done:

@srgothi92 was able to run ECS tests against a build with these changes. The upgrade/downgrade test he ran failed on the downgrade portion - we expected this. The next release, which includes the change, will be a one-way-only upgrade as the underlying datastore is migrated and does not support a seamless rollback. Operators that want to rollback will need to take steps appropriate for their container instances and its resources.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@tjkirch
Copy link
Contributor

tjkirch commented Nov 10, 2020

I don't see anything about the change in data store. How does this affect existing users of the ECS variant? What happens when they downgrade to a version without this change?

@jahkeup
Copy link
Member Author

jahkeup commented Nov 10, 2020

I don't see anything about the change in data store. How does this affect existing users of the ECS variant? What happens when they downgrade to a version without this change?

@tjkirch I'm not sure I know exactly which change you're asking after. Can you point me to an example in a prior bump?

FWIW: I'm not done with this change, not at all :) (which is why its still in draft)

@tjkirch
Copy link
Contributor

tjkirch commented Nov 10, 2020

I don't see anything about the change in data store. How does this affect existing users of the ECS variant? What happens when they downgrade to a version without this change?

@tjkirch I'm not sure I know exactly which change you're asking after. Can you point me to an example in a prior bump?

https://github.com/aws/amazon-ecs-agent/releases/tag/v1.44.0

In 1.44.0 they changed from a json file to boltdb, and we weren't sure whether ecs-agent itself handles upgrades and, in particular, downgrades. We've been holding off on this upgrade to make sure that we're not losing important state in either direction.

[edit] GAH, sorry, didn't mean to hit the "close and comment" button, just "comment"

@tjkirch tjkirch closed this Nov 10, 2020
@tjkirch tjkirch reopened this Nov 10, 2020
@samuelkarp
Copy link
Contributor

In 1.44.0 they changed from a json file to boltdb, and we weren't sure whether ecs-agent itself handles upgrades and, in particular, downgrades.

I'm pretty sure it handles upgrades but not downgrades.

@jahkeup
Copy link
Member Author

jahkeup commented Nov 11, 2020

In 1.44.0 they changed from a json file to boltdb, and we weren't sure whether ecs-agent itself handles upgrades and, in particular, downgrades.

I'm pretty sure it handles upgrades but not downgrades.

Gotcha, I haven't gotten to the point that I can test this yet.

@tjkirch, @samuelkarp were there some ideas floated on how an enforced, unidirectional update, if required, might be accomplished?

With the rollback image locally available, we'd expect that users could still downgrade.. so the question is probably even wider: should, or can, we even try to prevent a downgrade when incoherent state is nearly certain?

My sense is that we're not in a position to even try to rollback while also returning from altered/migrated state. That would mean we'd have to choose the older state file. Returning to the older state file, I strongly suspect, would be potentially dangerous & incorrect (eg: awsvpc ENIs, outdated local task & container configurations, and I'm sure others..). In other cases, especially for an automatic rollback, the older state file could probably be put back in place (if migrated during startup) and continue startup.

@jahkeup
Copy link
Member Author

jahkeup commented Nov 13, 2020

We might want to hold off on bumping until aws/amazon-ecs-agent#2708 is released. The change persists a registration token which is used by an instance to obtain a registration with a stable container instance ARN. Without this change, it appears that container instances will be reregistered with new container instance ARNs each time (using a new random uuid each boot). This likely would impact instance-specific associated resources (like ENIs), though I can't find docs that outline exactly what would happen.

@samuelkarp
Copy link
Contributor

samuelkarp commented Nov 13, 2020

Without this change, it appears that container instances will be reregistered with new container instance ARNs each time (using a new random uuid each boot).

This is so that ECS service registers using the same container instance ARN for the container instance that issues quick back to back RCI requests.

It looks like that's there for retries, not reboots, but I don't have an objection to waiting.

@jahkeup
Copy link
Member Author

jahkeup commented Nov 13, 2020

Additionally, we'll need to take care to handle rollback by also removing the BoltDB that was migrated to (if it was). If we don't, then the Agent will come up with the prior's migrated state which is problematic for cases where a rollback'd instance was used and then later upgraded again.

@samuelkarp
Copy link
Contributor

samuelkarp commented Nov 24, 2020

aws/amazon-ecs-agent#2738 is another issue we should be aware of. We should check whether this impacts 1.47.0 as well.

@jahkeup
Copy link
Member Author

jahkeup commented Nov 24, 2020

@tjkirch and I discussed some steps that the migrations can take to eliminate the risks of upgrade and downgrade update movements for the ECS agent. The working plan is that I'll add a migration which will err on the side of caution.

Migration: On Upgrade
  • no-op: do nothing
Migration: On Downgrade
  • a) when state.json is present AND state.boltdb is not, leave it
  • b) when state.boltdb is present, remove it
    • i) if state.json is also present, remove it
Reasoning
  • no-op upgrade: there won't be a BoltDB datastore coming into this version.

  • a: the agent hasn't migrated from the state file - we can expect that the next agent start will pick up where the state says it left off at.

  • b: the agent has a BoltDB state datastore (either from migration or from a fresh run) - either way, a downgrade from the version of Bottlerocket which introduces the agent (with BoltDB) is going to need a different state file.

    Removing the BoltDB allows for a downgrade to an older agent where its json state file is used. Later on, the agent will migrate per its standard operating procedures.

  • b + i: the agent accessed the JSON state file and may have successfully migrated to the BoltDB datastore. At this point it isn't practical for Bottlerocket to determine if the state remained in sync (which is possible, but unlikely).

    We could build some :handwave: shenanigans to facilitate this upgrade path (rather than deleting ALL state altogether), but the implementation would need to be able to authoritatively conclude that both the BoltDB and the JSON state datastores are equivalent and okay to reuse.

    I'm not interested in implementing a clever migration that reaches into these datastores, not for any direction. To me, using a clever solution that accurately detects "safe" datastore transitions would be error prone, fragile in nature, and it relies on internal (to the agent) data structures.

@jahkeup
Copy link
Member Author

jahkeup commented Nov 24, 2020

aws/amazon-ecs-agent#2738 is another issue we should be aware of. We should check whether this impacts 1.47.0 as well.

A recent comment suggest this primarily impacts baking an AMI where the state wasn't removed prior to launching instances from a baked image.

So it looks like the issue is coming up for folks baking AMIs that are going from 1.47.0 to 1.48.0 (or otherwise causing state to exist). This issue is now coming up because of the change I mentioned before that fixes UUID handling. In the context of AMI baking, there's at least one customer pointing out that they simply had to include more files in their post-bake image-cleanup step.

We can't exactly stop this from being an issue for customers that bake AMIs - @samuelkarp, do you have a method or check in mind that I could perform for you?

@samuelkarp
Copy link
Contributor

We should still support folks baking AMIs based on Bottlerocket. What we used to do in the ECS agent was to check the stored instance ID in the state file and discard the existing state if the instance ID retrieved from IMDS didn't match. If this bug only affects 1.48.0 we should probably hold off from upgrading to that version.

@samuelkarp
Copy link
Contributor

b) when state.boltdb is present, remove it

i) if state.json is also present, remove it

This seems like the wrong approach to me. The state file is the only way the agent knows its own identity; if it gets removed completely, the next time the agent starts up it'll act as if it's starting up for the first time and register itself anew in the cluster. This can lead to multiple container instance IDs for a single EC2 instance: one that is active and connected and one (or more) that is active and disconnected, effectively a zombie record. The state file also tracks resources that should be garbage-collected later, including containers themselves (and their associated ephemeral write layers) as well as container images.

@samuelkarp
Copy link
Contributor

aws/amazon-ecs-agent#2738 was resolved in v1.48.1 via aws/amazon-ecs-agent#2740.

@samuelkarp
Copy link
Contributor

Migration: On Upgrade
  • no-op: do nothing
Migration: On Downgrade
  • a) when state.json is present AND state.boltdb is not, leave it
  • b) when state.boltdb is present, fail the migration
Reasoning

b) This case is not generally solvable by our migration system and operator intervention is needed. The ECS agent tracks a variety of state in its state file, most of which is important to its correct operation. For example, the agent tracks its own identity (container instance ARN), the tasks that have run and exited (for communication back up to the control plane and subsequent garbage collection of the ephemeral container write layers), and the container images that have been pulled (similarly for garbage collection once containers using those images have been removed and sufficient time has passed). Removing the state file will leave the instance with uncollected garbage and will result in both the creation of a new container instance record in ECS and a "zombie" record that will remain until the EC2 instance is terminated.

@tjkirch
Copy link
Contributor

tjkirch commented Dec 4, 2020

Removing the state file will leave the instance with uncollected garbage and will result in both the creation of a new container instance record in ECS and a "zombie" record that will remain until the EC2 instance is terminated.

I agree with the analysis, but I'm not sure trashing the instance is better than having it serve with a new container instance ID. Either way there's a zombie / useless instance, but if we clean up state, at least it's handling tasks. (What would you want to happen with a failed migration? Even if we rolled back, the ecs-agent state would be wrong, as you described.)

@samuelkarp
Copy link
Contributor

Either way there's a zombie / useless instance, but if we clean up state, at least it's handling tasks.

My preference for failures is generally to fail loudly and let an operator intervene instead of attempting to recover. There are enough edge cases in the state that it seems wrong to me to try and solve this at the OS level: ephemeral layer cleanup, image cleanup, potential attachment (ENI & EBS volume) cleanup, Docker volume plugin cleanup, zombie records in the ECS API, etc. If an instance fails to successfully come back up after a roll-back, an operator can know that something's wrong by seeing the failure. On the other hand, if the instance attempts to recover itself it makes the failure less obvious (and might end up running new tasks that can be broken by the existing non-cleaned state on the instance).

@tjkirch
Copy link
Contributor

tjkirch commented Dec 4, 2020

I can understand that, but it does put us in a very bad place. "when state.boltdb is present, fail the migration" means we have to (loudly) mark this upgrade as one-way, something we try very hard to avoid in the project in general.

Is there any way to get support from the agent itself in downgrading state back to JSON? Something like... agent is updated to support both formats for a while, instead of the one-way cutoff we'd have to suffer now. Or... agent package provides a separate conversion tool, which we store in persistent storage and call from the migration.

@samuelkarp
Copy link
Contributor

Is there any way to get support from the agent itself in downgrading state back to JSON? Something like... agent is updated to support both formats for a while, instead of the one-way cutoff we'd have to suffer now. Or... agent package provides a separate conversion tool, which we store in persistent storage and call from the migration.

We can ask, but the ECS agent has never supported backwards migrations even in the older JSON file store. (It's not even the format that's necessarily changing, but if a new version of the ECS agent bumps the schema and adjusts the fields in the state file, an older version of the agent will already refuse to load it.)

@jahkeup jahkeup changed the title ecs-agent: bump to v1.47.0 ecs-agent: bump to 1.48.1 Dec 16, 2020
Signed-off-by: Jacob Vallejo <[email protected]>
@gregdek gregdek added the status/needs-triage Pending triage or re-evaluation label Jan 7, 2021
@tjkirch tjkirch removed the status/needs-triage Pending triage or re-evaluation label Jan 7, 2021
@etungsten etungsten mentioned this pull request Jan 7, 2021
3 tasks
@jahkeup jahkeup marked this pull request as ready for review January 8, 2021 21:41
Copy link
Contributor

@webern webern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like testing is solid.

@@ -51,6 +50,7 @@ Requires: %{_cross_os}iptables
%prep
%autosetup -Sgit -n %{gorepo}-%{gover} -p1
%cross_go_setup %{gorepo}-%{gover} %{goproject} %{goimport}
cp %{S:9} "agent/version/version.go"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me uncomfortable but 🤷‍♂️ I get why you're doing it.

Can you add a comment here to be more explicit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not happy about the problem/solution either. I think this is an improvement over revving a patch with release data - we can revisit this if upstream later permits ldflags as distributed.

I'll add a comment to explain what's going on here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a brief comment in the latest push.

packages/ecs-agent/ecs-agent.spec Show resolved Hide resolved
The patch would need to be revved each time the commit and version
were bumped. Instead, this allows for the linker to provide build data
without needing to update the patch beforehand.
This bump includes changes [^1] that were made to a number of
container images built for the agent. The nested build that was used
to drive the build of `pause` no longer functioned. The build process
now directly builds the executable to sidestep the new build - which
relies directly on Docker.

[^1]: aws/amazon-ecs-agent@7be883b#diff-6ef5a00ce2b6a82260b403d7c14a7ad519480b5f928012a5e3fd355a29310bbe

Signed-off-by: Jacob Vallejo <[email protected]>
Tested-By: Shailesh Gothi <[email protected]>
@jahkeup jahkeup merged commit bb71298 into bottlerocket-os:develop Jan 12, 2021
@jahkeup jahkeup deleted the ecs-agent/1.47.0 branch January 12, 2021 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants