Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CloudFormation signal program (issue #1581) #1728

Merged

Conversation

mello7tre
Copy link
Contributor

@mello7tre mello7tre commented Sep 1, 2021

Issue number:
#1581

Description of changes:
Created a new rust program, cfsignal to send signal to CloudFormation
Stack.
Program is a sort of cfn-signal
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-signal.html
but as cfn-signal need python cannot be used by bottlerocket.

cfsignal read configuration from a cfsignal.toml file configured reading
user-data, so it depends on settings-applier.service.
It cannot send a signal for a failure happening before
settings-applier.service and network-online.target are started.

It is able to send a failure signal for any other service starting from
(included):
activate-multi-user.service

It use systemctl action is-system-running with --wait option.
This way we can know if any service, after systemd boot process
finished, is in a failure status.

Detail
I have created a new setting named cloudformation.
It support the keys:

signal: boolean
stack_name: string
logical_resource_id: string

If signal is false (default) program do nothing.
Other way it excute systemctl --wait is-system-running to find out the system state.
Then it send the relative signal to the the stack_name stack.

Testing done:
Created and updated and AutoScalingGroup configured to wait signaling by created instances.
Tested that instance send SUCCESS signal if all service started successfully.

bash-5.0# systemctl status cfsignal
● cfsignal.service - Send signal to CloudFormation Stack
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/cfsignal.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Thu 2022-02-24 03:20:40 UTC; 10min ago
    Process: 1578 ExecStart=/usr/bin/cfsignal (code=exited, status=0/SUCCESS)
   Main PID: 1578 (code=exited, status=0/SUCCESS)

Feb 24 03:20:40 ip-192-168-16-4.us-west-2.compute.internal cfsignal[1578]: 03:20:40 [INFO] System status is: running [0]
Feb 24 03:20:40 ip-192-168-16-4.us-west-2.compute.internal cfsignal[1578]: 03:20:40 [INFO] Connecting to IMDS
Feb 24 03:20:40 ip-192-168-16-4.us-west-2.compute.internal cfsignal[1578]: 03:20:40 [INFO] Received meta-data/instance-id
Feb 24 03:20:40 ip-192-168-16-4.us-west-2.compute.internal cfsignal[1578]: 03:20:40 [INFO] Received dynamic/instance-identity/document
Feb 24 03:20:40 ip-192-168-16-4.us-west-2.compute.internal cfsignal[1578]: 03:20:40 [INFO] Region: "us-west-2" - InstanceID: "i-051f097c469cd6a7f" - Signal: "SUCCESS"
Feb 24 03:20:40 ip-192-168-16-4.us-west-2.compute.internal systemd[1]: cfsignal.service: Succeeded.

Tested that instance send FAILURE signal if a service after activate-multi-user.service fail.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@webern
Copy link
Contributor

webern commented Sep 3, 2021

Thank you for the PR @mello7tre! We are looking at it 👀 .

Copy link
Contributor

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mello7tre thank you so much for putting this together! I took a look and had a few suggestions for changes. Please let me know if you have any questions about the suggestions. (And if you don't have the time to look at those suggestions, we can keep this open until you do or we can take a look at making those changes for you.)

packages/os/cfsignal.service Show resolved Hide resolved
sources/cfsignal/Cargo.toml Outdated Show resolved Hide resolved
sources/cfsignal/README.tpl Show resolved Hide resolved
sources/cfsignal/src/main.rs Outdated Show resolved Hide resolved
sources/models/shared-defaults/cf-signal.toml Outdated Show resolved Hide resolved
sources/cfsignal/src/main.rs Outdated Show resolved Hide resolved
sources/cfsignal/src/service_check.rs Outdated Show resolved Hide resolved
sources/cfsignal/src/service_check.rs Outdated Show resolved Hide resolved
sources/cfsignal/src/service_check.rs Outdated Show resolved Hide resolved
sources/cfsignal/Cargo.toml Outdated Show resolved Hide resolved
@mello7tre
Copy link
Contributor Author

thanks @samuelkarp for the comments.
Give me some time and i will try to make the changes.
Mine only concern is about changing systemd.service Type, i have explained it better above.

sources/cfsignal/Cargo.toml Outdated Show resolved Hide resolved
sources/models/shared-defaults/cf-signal.toml Outdated Show resolved Hide resolved
sources/models/shared-defaults/cf-signal.toml Outdated Show resolved Hide resolved
sources/cfsignal/src/main.rs Outdated Show resolved Hide resolved
sources/cfsignal/src/main.rs Outdated Show resolved Hide resolved
sources/cfsignal/src/error.rs Outdated Show resolved Hide resolved
sources/cfsignal/src/cloudformation.rs Outdated Show resolved Hide resolved
sources/cfsignal/src/cloudformation.rs Outdated Show resolved Hide resolved
sources/cfsignal/src/cloudformation.rs Outdated Show resolved Hide resolved
sources/cfsignal/src/system_check.rs Show resolved Hide resolved
@mello7tre
Copy link
Contributor Author

just pushed the requested changes.
I need more time to work on:
#1728 (comment)
#1728 (comment)

will came later...

@jfrconley
Copy link

Is there any movement on this issue? Waiting to adopt Bottlerocket until I can be sure that my asg will wait for instances to register with ECS

@jpculp
Copy link
Member

jpculp commented Dec 21, 2021

All, we're sorry this has been left open for quite some time. From what I can tell it seems to be on the right track. @mello7tre, would you be able to re-base, squash the commits, and request re-review?

@mello7tre
Copy link
Contributor Author

Rebased as requested.

@etungsten etungsten self-assigned this Feb 8, 2022
@etungsten
Copy link
Contributor

etungsten commented Feb 9, 2022

Hi @mello7tre, sorry about the delay. I'm gonna do my best and help you get this over the finish line. I took a look over the existing comments and resolved the ones that have been addressed.

The only comment that still needs addressing is #1728 (comment).

Another thing is we want to make sure cfsignal only runs on first boot as opposed to on every boot. We can achieve this by making cfsignal create an sentinel file at some path and the service unit can conditionally run based on the file's presence. The early-boot-config service is an example that does this:

# We only want to run once, at first boot. This file is created by early-boot-config
# after a successful run.
ConditionPathExists=!/var/lib/bottlerocket/early-boot-config.ran

And the sentinel file is created like so:
fs::write(MARKER_FILE, "").unwrap_or_else(|e| {
warn!(
"Failed to create marker file {}, may unexpectedly run again: {}",
MARKER_FILE, e
)
});

Let me know what you think and how I can help.

And finally, since we're introducing new settings, we need to create a simple migration binary to migrate our settings datastore between OS versions. You can read more about migrations here and how to write them here. The closest example would be the add-network migration here:

/// We added a set of settings for configuring service network behavior and their associated
/// configuration file. Remove the whole `settings.network`, `configuration-files.proxy-env` prefix
/// if we downgrade.
fn run() -> Result<()> {
migrate(AddPrefixesMigration(vec![
"settings.network",
"configuration-files.proxy-env",
]))
}

Of course, we can take care of this for you if you'd prefer not to dabble with it. Just let us know. Thanks again for your contribution!

@mello7tre
Copy link
Contributor Author

Hi @etungsten, just pushed the requested changes.
Have a look if all is right.

Copy link
Contributor

@etungsten etungsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. We need to do a rebase to fix some minor conflicts and fix the migration crate path.

sources/cfsignal/src/cloudformation.rs Outdated Show resolved Hide resolved
sources/cfsignal/src/cloudformation.rs Outdated Show resolved Hide resolved
sources/cfsignal/src/cloudformation.rs Outdated Show resolved Hide resolved
sources/cfsignal/src/main.rs Outdated Show resolved Hide resolved
Comment on lines 52 to 61
let system_check = Box::new(SystemCheck {});
let system_status = system_check.system_running()?;
info!(
"System status is: {} [{}]",
system_status.status, system_status.exit_code
);

// run only if the opt-in flag is set
if config.should_signal {

Copy link
Contributor

@etungsten etungsten Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only do anything with the system check status if should_signal is set, we can have the check inside the if block.

Suggested change
let system_check = Box::new(SystemCheck {});
let system_status = system_check.system_running()?;
info!(
"System status is: {} [{}]",
system_status.status, system_status.exit_code
);
// run only if the opt-in flag is set
if config.should_signal {
// run only if the opt-in flag is set
if config.should_signal {
let system_check = Box::new(SystemCheck {});
let system_status = system_check.system_running()?;
info!(
"System status is: {} [{}]",
system_status.status, system_status.exit_code
);

Copy link
Contributor Author

@mello7tre mello7tre Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be useful to know which would be the system status even without signaling it to CloudFormation, maybe to try it out in a dry mode or to debug.

packages/os/cfsignal.service Outdated Show resolved Hide resolved
sources/cfsignal/src/main.rs Outdated Show resolved Hide resolved
sources/cfsignal/src/main.rs Outdated Show resolved Hide resolved
@mello7tre
Copy link
Contributor Author

mello7tre commented Feb 11, 2022

applied requested changes
rebased on aeec3aa
squashed commits

@mello7tre
Copy link
Contributor Author

i made a little mess rebasing, so i fixed it and force pushed the right "version"

Copy link
Contributor

@etungsten etungsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I'm gonna start asking others to take a look.

https://github.com/bottlerocket-os/bottlerocket/pull/1728/files#r803845979 still needs a quick fix.

Release.toml Outdated Show resolved Hide resolved
sources/cfsignal/README.md Outdated Show resolved Hide resolved
sources/cfsignal/src/system_check.rs Outdated Show resolved Hide resolved
sources/cfsignal/src/error.rs Outdated Show resolved Hide resolved
@mello7tre
Copy link
Contributor Author

Hi @etungsten, i think that to speed up the development is better if, from now on, you directly make all relevant changes.
So feel free to change it as you prefer, i just would like that this feature will be included in a future release.

Regards, Alberto.

@etungsten
Copy link
Contributor

Push above addresses comments. Needed to label cfsignal with api_exec so it can write the sentinel file to a persistent storage location like /var/lib/bottlerocket.
Made sure cfsignal only runs on first boot. On subsequent boots, the cfsignal service unit never starts:

Feb 11 21:13:21 systemd[1]: Condition check resulted in Send signal to CloudFormation Stack being skipped.

@etungsten etungsten requested a review from cbgbt February 22, 2022 19:37
Copy link
Contributor

@webern webern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎸

README.md Show resolved Hide resolved
sources/cfsignal/Cargo.toml Show resolved Hide resolved
sources/cfsignal/src/system_check.rs Show resolved Hide resolved
sources/cfsignal/src/main.rs Show resolved Hide resolved
@etungsten etungsten dismissed their stale review February 23, 2022 01:04

i'm making the changes

@etungsten etungsten force-pushed the feat/cloudformation-signal branch 2 times, most recently from 4295732 to b4cd707 Compare February 23, 2022 01:16
sources/cfsignal/src/main.rs Outdated Show resolved Hide resolved
sources/cfsignal/src/main.rs Outdated Show resolved Hide resolved
sources/cfsignal/src/system_check.rs Outdated Show resolved Hide resolved
sources/cfsignal/src/main.rs Outdated Show resolved Hide resolved
sources/cfsignal/src/main.rs Outdated Show resolved Hide resolved
packages/os/cfsignal.service Outdated Show resolved Hide resolved
packages/os/cfsignal.service Outdated Show resolved Hide resolved
sources/cfsignal/src/error.rs Show resolved Hide resolved
sources/cfsignal/src/main.rs Show resolved Hide resolved
sources/models/shared-defaults/cf-signal.toml Show resolved Hide resolved
@etungsten
Copy link
Contributor

etungsten commented Feb 24, 2022

Push above and below addresses @arnaldo2792 's comments.

Tested changes and it works as expected:

bash-5.0# systemctl status cfsignal
● cfsignal.service - Send signal to CloudFormation Stack
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/cfsignal.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Thu 2022-02-24 03:20:40 UTC; 10min ago
    Process: 1578 ExecStart=/usr/bin/cfsignal (code=exited, status=0/SUCCESS)
   Main PID: 1578 (code=exited, status=0/SUCCESS)

Feb 24 03:20:40 ip-192-168-16-4.us-west-2.compute.internal cfsignal[1578]: 03:20:40 [INFO] System status is: running [0]
Feb 24 03:20:40 ip-192-168-16-4.us-west-2.compute.internal cfsignal[1578]: 03:20:40 [INFO] Connecting to IMDS
Feb 24 03:20:40 ip-192-168-16-4.us-west-2.compute.internal cfsignal[1578]: 03:20:40 [INFO] Received meta-data/instance-id
Feb 24 03:20:40 ip-192-168-16-4.us-west-2.compute.internal cfsignal[1578]: 03:20:40 [INFO] Received dynamic/instance-identity/document
Feb 24 03:20:40 ip-192-168-16-4.us-west-2.compute.internal cfsignal[1578]: 03:20:40 [INFO] Region: "us-west-2" - InstanceID: "i-051f097c469cd6a7f" - Signal: "SUCCESS"
Feb 24 03:20:40 ip-192-168-16-4.us-west-2.compute.internal systemd[1]: cfsignal.service: Succeeded.

Copy link
Contributor

@arnaldo2792 arnaldo2792 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🦆

Copy link
Contributor

@zmrow zmrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🍨

Thanks!

Copy link
Contributor

@bcressey bcressey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This looks ready to merge.

@mello7tre, I know this has been a long time coming! I really appreciate the contribution as well as your extraordinary patience. We're planning to ship it in the next release.

README.md Outdated Show resolved Hide resolved
@mello7tre
Copy link
Contributor Author

Nice! This looks ready to merge.

@mello7tre, I know this has been a long time coming! I really appreciate the contribution as well as your extraordinary patience. We're planning to ship it in the next release.

Thanks, i am glad to have given a little contribution to the project.

@etungsten
Copy link
Contributor

Push above rebases onto develop

@etungsten
Copy link
Contributor

Push above addresses @bcressey 's comments.

Created a new rust program, cfsignal to send signal to CloudFormation
Stack.
Program is a sort of cfn-signal
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-signal.html
but as cfn-signal need python cannot be used by bottlerocket.

cfsignal read configuration from a cfsignal.toml file configured reading
user-data, so it depends on settings-applier.service.
It cannot send a signal for a failure happening before
settings-applier.service and network-online.target are started.

It is able to send a failure signal for any other service starting from
(included):
activate-multi-user.service

It use systemctl action is-system-running with --wait option.
This way we can know if any service, after systemd boot process
finished, is in a failure status.

Requested changes:

* removed author
* signal parameter renamed to should_signal (is more specific that
should_send)
* added README.md
* removed commented out lines
* use imdsclient in place of ec2_instance_metadata
* refactor service_check.rs and renamed to system_check.rs

use weak dependency (WantedBy)for cfsignal.service

use tokio LTS, only with needed features

restart command

some code refactor

* use directly signal_resource as function
* code simplification in system_check.rs
* use standard boilerplate for main function

semaphore file and migration

* Use semaphore file to only run on first boot
* Add migration file for downgrading
* client.signal_resource collapsed
* Fix to packages/os/os.spec: toml file is not copyed (introduced during
  rebase)

Readme changes
@etungsten
Copy link
Contributor

Push above moves the migration to the v1.6.1 to v1.6.2 migration chain since we're planning this for v1.6.2 release.

@etungsten
Copy link
Contributor

Tested migration again and it still works as expected

@etungsten etungsten merged commit 92bdb47 into bottlerocket-os:develop Mar 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

10 participants