Add duplicate check on ephemeral security group and key pair creation for buildhost #659 by onetechnical · Pull Request #1258 · algorand/go-algorand

onetechnical · 2020-07-17T13:26:33Z

Summary

The buildhost may fail if the security group or key pair is pre-existing. This may happen if startup fails, as the old security groups and key pairs were not cleaned up with errors. This will also result in instances left running on the account.

This change will check for pre-existing security groups and key pairs, and re-generate the INSTANCE_ID if either exist. The fall through error condition was also modified to run the shutdown script, to delete the security group, key pair, and instance.

Test Plan

Verify normal launch:

export ALGODIR=~/go/src/github.com/algorand/go-algorand/
mkdir /tmp/build1
cd /tmp/build1
$ALGODIR/scripts/buildhost/start_ec2_instance.sh us-west-2 ami-0c579621aaac8bade a1.2xlarge

Test duplicate INSTANCE_ID (tests security group). Modify in start_ec2_instance.sh in place to hardcode the first $RANDOM call to the number in /tmp/build1/key-name, then run:

mkdir /tmp/build2
cd /tmp/build2
$ALGODIR/scripts/buildhost/start_ec2_instance.sh us-west-2 ami-0c579621aaac8bade a1.2xlarge

This should see the duplicate and fetch a new number before launching. Restore start_ec2_instance.sh.

Verify shutdown still works in build1 and build2:

cd /tmp/build1
$ALGODIR/scripts/buildhost/shutdown_ec2_instance.sh us-west-2
cd /tmp/build2
$ALGODIR/scripts/buildhost/shutdown_ec2_instance.sh us-west-2

- Added pipefail option - On fallthrough error, delete sg, keypair, and instance - Quote interpolated variables - Remove useless uses of cat

tsachiherman · 2020-07-17T13:42:55Z

+aws ec2 delete-security-group --group-id "$(cat sgid)"
+aws ec2 delete-key-pair --key-name "$(cat key-name)"
+aws ec2-terminate-instances --instance-ids "$(cat instance-id)"


In case of an error code in any of these, we might want to keep the respective files, so that the stop_ec2_instance.sh would have a chance of trying to destroy these resources "slowly".

As Tsachi noted we could keep the files maybe to clean up later, but can do that in a future PR. Because I'm switching to calling the shutdown script it would involve parameterizing that, but I think the build host might clean up the temp dirs anyway.

tsachiherman

I haven't tested that, but it looks great. on small comment that could be improved in future PR.

…ical/go-algorand into onetechnical/buildhost-sg-issue

This will use the shutdown routines and waits, and also cleans up the files.

btoll · 2020-07-17T18:58:30Z

When I first started building the legacy pipeline, I borrowed these scripts and wrote all the temp files it produces to a subfolder which I then removed on all conditions (both success and failure). This also enabled consistent cleanup of the security groups and key pairs, since these are ephemeral instances.

My thinking was it would be better to always cleanup like that instead of checking for pre-existing groups or keys, since they were always generated by the script and are only intended to be short-lived.

onetechnical · 2020-07-17T20:22:47Z

When I first started building the legacy pipeline, I borrowed these scripts and wrote all the temp files it produces to a subfolder which I then removed on all conditions (both success and failure). This also enabled consistent cleanup of the security groups and key pairs, since these are ephemeral instances.

My thinking was it would be better to always cleanup like that instead of checking for pre-existing groups or keys, since they were always generated by the script and are only intended to be short-lived.

The problem is it only cleans up the folder, not the ephemeral instance/security group/key pair if there are errors. Hopefully this PR will fix that for most cases.

btoll · 2020-07-17T20:41:27Z

When I first started building the legacy pipeline, I borrowed these scripts and wrote all the temp files it produces to a subfolder which I then removed on all conditions (both success and failure). This also enabled consistent cleanup of the security groups and key pairs, since these are ephemeral instances.
My thinking was it would be better to always cleanup like that instead of checking for pre-existing groups or keys, since they were always generated by the script and are only intended to be short-lived.

The problem is it only cleans up the folder, not the ephemeral instance/security group/key pair if there are errors. Hopefully this PR will fix that for most cases.

It depends how you set it up. In the case of the legacy pipeline, I catch an error signal and call the shutdown script.

For example, I set this at the top of any shell script that could have error conditions:

trap 'bash ./scripts/release/common/ec2/shutdown.sh' ERR

That then cleans up all the ephemeral stuff that was created and finally removes the subdirectory. It only took a minor bit of refactoring, but it worked (works) really well.

btoll · 2020-07-17T21:03:19Z

I'm not requesting changes, by the way, just talking about another way to solve the problem.

onetechnical added 6 commits July 17, 2020 07:58

Add pipefail and fix error check for creating security group

e7346a0

- Added pipefail option - On fallthrough error, delete sg, keypair, and instance - Quote interpolated variables - Remove useless uses of cat

Check for security group, and if it exists, fetch new number

c58a4b1

Remove unused variable, quote interpolated vars and remove useless cat

4e28b9a

Add quotes around parameter

08e1d62

Fix comment

8c23248

Add checking for duplicate key pair name as well

6748257

onetechnical requested review from bricerisingalgorand, btoll, egieseke and tsachiherman July 17, 2020 13:26

onetechnical self-assigned this Jul 17, 2020

onetechnical commented Jul 17, 2020

View reviewed changes

Comment thread scripts/buildhost/start_ec2_instance.sh Outdated

Typo fix

d9d72c8

tsachiherman reviewed Jul 17, 2020

View reviewed changes

tsachiherman previously approved these changes Jul 17, 2020

View reviewed changes

onetechnical added 2 commits July 17, 2020 09:45

Wrong order for fall through error; delete instance first

422b4d8

Merge branch 'onetechnical/buildhost-sg-issue' of github.com:onetechn…

8bd0768

…ical/go-algorand into onetechnical/buildhost-sg-issue

onetechnical dismissed tsachiherman’s stale review via 8bd0768 July 17, 2020 13:48

tsachiherman previously approved these changes Jul 17, 2020

View reviewed changes

Modify fallthrough error to call shutdown script.

9f7ae69

This will use the shutdown routines and waits, and also cleans up the files.

onetechnical dismissed tsachiherman’s stale review via 9f7ae69 July 17, 2020 17:53

btoll reviewed Jul 17, 2020

View reviewed changes

Comment thread scripts/buildhost/start_ec2_instance.sh

Remove accidental tabs.

78bcb25

btoll approved these changes Jul 17, 2020

View reviewed changes

algojohnlee merged commit b364bdf into algorand:master Jul 20, 2020

onetechnical deleted the onetechnical/buildhost-sg-issue branch October 23, 2020 14:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add duplicate check on ephemeral security group and key pair creation for buildhost #659#1258

Add duplicate check on ephemeral security group and key pair creation for buildhost #659#1258
algojohnlee merged 11 commits intoalgorand:masterfrom
onetechnical:onetechnical/buildhost-sg-issue

onetechnical commented Jul 17, 2020 •

edited

Loading

Uh oh!

Uh oh!

tsachiherman Jul 17, 2020

Uh oh!

onetechnical Jul 17, 2020

Uh oh!

tsachiherman left a comment

Uh oh!

Uh oh!

btoll commented Jul 17, 2020

Uh oh!

onetechnical commented Jul 17, 2020

Uh oh!

btoll commented Jul 17, 2020

Uh oh!

btoll commented Jul 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

onetechnical commented Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

Uh oh!

tsachiherman Jul 17, 2020

Choose a reason for hiding this comment

Uh oh!

onetechnical Jul 17, 2020

Choose a reason for hiding this comment

Uh oh!

tsachiherman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

btoll commented Jul 17, 2020

Uh oh!

onetechnical commented Jul 17, 2020

Uh oh!

btoll commented Jul 17, 2020

Uh oh!

btoll commented Jul 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

onetechnical commented Jul 17, 2020 •

edited

Loading