-
Notifications
You must be signed in to change notification settings - Fork 519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ecs-agent: bump to 1.48.1 #1201
Conversation
bc7034d
to
58147e8
Compare
packages/ecs-agent/0003-bottlerocket-remove-unsupported-capabilities.patch
Show resolved
Hide resolved
I don't see anything about the change in data store. How does this affect existing users of the ECS variant? What happens when they downgrade to a version without this change? |
@tjkirch I'm not sure I know exactly which change you're asking after. Can you point me to an example in a prior bump? FWIW: I'm not done with this change, not at all :) (which is why its still in draft) |
https://github.com/aws/amazon-ecs-agent/releases/tag/v1.44.0 In 1.44.0 they changed from a json file to boltdb, and we weren't sure whether ecs-agent itself handles upgrades and, in particular, downgrades. We've been holding off on this upgrade to make sure that we're not losing important state in either direction. [edit] GAH, sorry, didn't mean to hit the "close and comment" button, just "comment" |
I'm pretty sure it handles upgrades but not downgrades. |
Gotcha, I haven't gotten to the point that I can test this yet. @tjkirch, @samuelkarp were there some ideas floated on how an enforced, unidirectional update, if required, might be accomplished? With the rollback image locally available, we'd expect that users could still downgrade.. so the question is probably even wider: should, or can, we even try to prevent a downgrade when incoherent state is nearly certain? My sense is that we're not in a position to even try to rollback while also returning from altered/migrated state. That would mean we'd have to choose the older state file. Returning to the older state file, I strongly suspect, would be potentially dangerous & incorrect (eg: awsvpc ENIs, outdated local task & container configurations, and I'm sure others..). In other cases, especially for an automatic rollback, the older state file could probably be put back in place (if migrated during startup) and continue startup. |
We might want to hold off on bumping until aws/amazon-ecs-agent#2708 is released. The change persists a registration token which is used by an instance to obtain a registration with a stable container instance ARN. Without this change, it appears that container instances will be reregistered with new container instance ARNs each time (using a new random uuid each boot). This likely would impact instance-specific associated resources (like ENIs), though I can't find docs that outline exactly what would happen. |
It looks like that's there for retries, not reboots, but I don't have an objection to waiting. |
Additionally, we'll need to take care to handle rollback by also removing the BoltDB that was migrated to (if it was). If we don't, then the Agent will come up with the prior's migrated state which is problematic for cases where a rollback'd instance was used and then later upgraded again. |
aws/amazon-ecs-agent#2738 is another issue we should be aware of. We should check whether this impacts 1.47.0 as well. |
@tjkirch and I discussed some steps that the migrations can take to eliminate the risks of upgrade and downgrade update movements for the ECS agent. The working plan is that I'll add a migration which will err on the side of caution. Migration: On Upgrade
Migration: On Downgrade
Reasoning
|
A recent comment suggest this primarily impacts baking an AMI where the state wasn't removed prior to launching instances from a baked image. So it looks like the issue is coming up for folks baking AMIs that are going from We can't exactly stop this from being an issue for customers that bake AMIs - @samuelkarp, do you have a method or check in mind that I could perform for you? |
We should still support folks baking AMIs based on Bottlerocket. What we used to do in the ECS agent was to check the stored instance ID in the state file and discard the existing state if the instance ID retrieved from IMDS didn't match. If this bug only affects 1.48.0 we should probably hold off from upgrading to that version. |
This seems like the wrong approach to me. The state file is the only way the agent knows its own identity; if it gets removed completely, the next time the agent starts up it'll act as if it's starting up for the first time and register itself anew in the cluster. This can lead to multiple container instance IDs for a single EC2 instance: one that is active and connected and one (or more) that is active and disconnected, effectively a zombie record. The state file also tracks resources that should be garbage-collected later, including containers themselves (and their associated ephemeral write layers) as well as container images. |
aws/amazon-ecs-agent#2738 was resolved in v1.48.1 via aws/amazon-ecs-agent#2740. |
Migration: On Upgrade
Migration: On Downgrade
Reasoningb) This case is not generally solvable by our migration system and operator intervention is needed. The ECS agent tracks a variety of state in its state file, most of which is important to its correct operation. For example, the agent tracks its own identity (container instance ARN), the tasks that have run and exited (for communication back up to the control plane and subsequent garbage collection of the ephemeral container write layers), and the container images that have been pulled (similarly for garbage collection once containers using those images have been removed and sufficient time has passed). Removing the state file will leave the instance with uncollected garbage and will result in both the creation of a new container instance record in ECS and a "zombie" record that will remain until the EC2 instance is terminated. |
I agree with the analysis, but I'm not sure trashing the instance is better than having it serve with a new container instance ID. Either way there's a zombie / useless instance, but if we clean up state, at least it's handling tasks. (What would you want to happen with a failed migration? Even if we rolled back, the ecs-agent state would be wrong, as you described.) |
My preference for failures is generally to fail loudly and let an operator intervene instead of attempting to recover. There are enough edge cases in the state that it seems wrong to me to try and solve this at the OS level: ephemeral layer cleanup, image cleanup, potential attachment (ENI & EBS volume) cleanup, Docker volume plugin cleanup, zombie records in the ECS API, etc. If an instance fails to successfully come back up after a roll-back, an operator can know that something's wrong by seeing the failure. On the other hand, if the instance attempts to recover itself it makes the failure less obvious (and might end up running new tasks that can be broken by the existing non-cleaned state on the instance). |
I can understand that, but it does put us in a very bad place. "when state.boltdb is present, fail the migration" means we have to (loudly) mark this upgrade as one-way, something we try very hard to avoid in the project in general. Is there any way to get support from the agent itself in downgrading state back to JSON? Something like... agent is updated to support both formats for a while, instead of the one-way cutoff we'd have to suffer now. Or... agent package provides a separate conversion tool, which we store in persistent storage and call from the migration. |
We can ask, but the ECS agent has never supported backwards migrations even in the older JSON file store. (It's not even the format that's necessarily changing, but if a new version of the ECS agent bumps the schema and adjusts the fields in the state file, an older version of the agent will already refuse to load it.) |
58147e8
to
18253d8
Compare
Signed-off-by: Jacob Vallejo <[email protected]>
18253d8
to
f48003f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like testing is solid.
@@ -51,6 +50,7 @@ Requires: %{_cross_os}iptables | |||
%prep | |||
%autosetup -Sgit -n %{gorepo}-%{gover} -p1 | |||
%cross_go_setup %{gorepo}-%{gover} %{goproject} %{goimport} | |||
cp %{S:9} "agent/version/version.go" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me uncomfortable but 🤷♂️ I get why you're doing it.
Can you add a comment here to be more explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not happy about the problem/solution either. I think this is an improvement over revving a patch with release data - we can revisit this if upstream later permits ldflags
as distributed.
I'll add a comment to explain what's going on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a brief comment in the latest push.
The patch would need to be revved each time the commit and version were bumped. Instead, this allows for the linker to provide build data without needing to update the patch beforehand.
This bump includes changes [^1] that were made to a number of container images built for the agent. The nested build that was used to drive the build of `pause` no longer functioned. The build process now directly builds the executable to sidestep the new build - which relies directly on Docker. [^1]: aws/amazon-ecs-agent@7be883b#diff-6ef5a00ce2b6a82260b403d7c14a7ad519480b5f928012a5e3fd355a29310bbe Signed-off-by: Jacob Vallejo <[email protected]> Tested-By: Shailesh Gothi <[email protected]>
f48003f
to
4b86f67
Compare
Issue number:
n/a
Description of changes:
Update
ecs-agent
tov1.48.1
Testing done:
@srgothi92 was able to run ECS tests against a build with these changes. The upgrade/downgrade test he ran failed on the
downgrade
portion - we expected this. The next release, which includes the change, will be a one-way-only upgrade as the underlying datastore is migrated and does not support a seamless rollback. Operators that want to rollback will need to take steps appropriate for their container instances and its resources.Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.