Skip to content

Add duplicate check on ephemeral security group and key pair creation for buildhost #659#1258

Merged
algojohnlee merged 11 commits intoalgorand:masterfrom
onetechnical:onetechnical/buildhost-sg-issue
Jul 20, 2020
Merged

Add duplicate check on ephemeral security group and key pair creation for buildhost #659#1258
algojohnlee merged 11 commits intoalgorand:masterfrom
onetechnical:onetechnical/buildhost-sg-issue

Conversation

@onetechnical
Copy link
Copy Markdown
Contributor

@onetechnical onetechnical commented Jul 17, 2020

Summary

The buildhost may fail if the security group or key pair is pre-existing. This may happen if startup fails, as the old security groups and key pairs were not cleaned up with errors. This will also result in instances left running on the account.

This change will check for pre-existing security groups and key pairs, and re-generate the INSTANCE_ID if either exist. The fall through error condition was also modified to run the shutdown script, to delete the security group, key pair, and instance.

Test Plan

Verify normal launch:

export ALGODIR=~/go/src/github.com/algorand/go-algorand/
mkdir /tmp/build1
cd /tmp/build1
$ALGODIR/scripts/buildhost/start_ec2_instance.sh us-west-2 ami-0c579621aaac8bade a1.2xlarge

Test duplicate INSTANCE_ID (tests security group). Modify in start_ec2_instance.sh in place to hardcode the first $RANDOM call to the number in /tmp/build1/key-name, then run:

mkdir /tmp/build2
cd /tmp/build2
$ALGODIR/scripts/buildhost/start_ec2_instance.sh us-west-2 ami-0c579621aaac8bade a1.2xlarge

This should see the duplicate and fetch a new number before launching. Restore start_ec2_instance.sh.

Verify shutdown still works in build1 and build2:

cd /tmp/build1
$ALGODIR/scripts/buildhost/shutdown_ec2_instance.sh us-west-2
cd /tmp/build2
$ALGODIR/scripts/buildhost/shutdown_ec2_instance.sh us-west-2

Comment thread scripts/buildhost/start_ec2_instance.sh Outdated
Comment thread scripts/buildhost/start_ec2_instance.sh Outdated
Comment on lines +127 to +129
aws ec2 delete-security-group --group-id "$(cat sgid)"
aws ec2 delete-key-pair --key-name "$(cat key-name)"
aws ec2-terminate-instances --instance-ids "$(cat instance-id)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of an error code in any of these, we might want to keep the respective files, so that the stop_ec2_instance.sh would have a chance of trying to destroy these resources "slowly".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Tsachi noted we could keep the files maybe to clean up later, but can do that in a future PR. Because I'm switching to calling the shutdown script it would involve parameterizing that, but I think the build host might clean up the temp dirs anyway.

tsachiherman
tsachiherman previously approved these changes Jul 17, 2020
Copy link
Copy Markdown
Contributor

@tsachiherman tsachiherman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested that, but it looks great. on small comment that could be improved in future PR.

tsachiherman
tsachiherman previously approved these changes Jul 17, 2020
This will use the shutdown routines and waits, and also cleans up the
files.
Comment thread scripts/buildhost/start_ec2_instance.sh
@btoll
Copy link
Copy Markdown

btoll commented Jul 17, 2020

When I first started building the legacy pipeline, I borrowed these scripts and wrote all the temp files it produces to a subfolder which I then removed on all conditions (both success and failure). This also enabled consistent cleanup of the security groups and key pairs, since these are ephemeral instances.

My thinking was it would be better to always cleanup like that instead of checking for pre-existing groups or keys, since they were always generated by the script and are only intended to be short-lived.

@onetechnical
Copy link
Copy Markdown
Contributor Author

When I first started building the legacy pipeline, I borrowed these scripts and wrote all the temp files it produces to a subfolder which I then removed on all conditions (both success and failure). This also enabled consistent cleanup of the security groups and key pairs, since these are ephemeral instances.

My thinking was it would be better to always cleanup like that instead of checking for pre-existing groups or keys, since they were always generated by the script and are only intended to be short-lived.

The problem is it only cleans up the folder, not the ephemeral instance/security group/key pair if there are errors. Hopefully this PR will fix that for most cases.

@btoll
Copy link
Copy Markdown

btoll commented Jul 17, 2020

When I first started building the legacy pipeline, I borrowed these scripts and wrote all the temp files it produces to a subfolder which I then removed on all conditions (both success and failure). This also enabled consistent cleanup of the security groups and key pairs, since these are ephemeral instances.
My thinking was it would be better to always cleanup like that instead of checking for pre-existing groups or keys, since they were always generated by the script and are only intended to be short-lived.

The problem is it only cleans up the folder, not the ephemeral instance/security group/key pair if there are errors. Hopefully this PR will fix that for most cases.

It depends how you set it up. In the case of the legacy pipeline, I catch an error signal and call the shutdown script.

For example, I set this at the top of any shell script that could have error conditions:

trap 'bash ./scripts/release/common/ec2/shutdown.sh' ERR

That then cleans up all the ephemeral stuff that was created and finally removes the subdirectory. It only took a minor bit of refactoring, but it worked (works) really well.

@btoll
Copy link
Copy Markdown

btoll commented Jul 17, 2020

I'm not requesting changes, by the way, just talking about another way to solve the problem.

@algojohnlee algojohnlee merged commit b364bdf into algorand:master Jul 20, 2020
@onetechnical onetechnical deleted the onetechnical/buildhost-sg-issue branch October 23, 2020 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants