Add duplicate check on ephemeral security group and key pair creation for buildhost #659#1258
Add duplicate check on ephemeral security group and key pair creation for buildhost #659#1258algojohnlee merged 11 commits intoalgorand:masterfrom onetechnical:onetechnical/buildhost-sg-issue
Conversation
- Added pipefail option - On fallthrough error, delete sg, keypair, and instance - Quote interpolated variables - Remove useless uses of cat
| aws ec2 delete-security-group --group-id "$(cat sgid)" | ||
| aws ec2 delete-key-pair --key-name "$(cat key-name)" | ||
| aws ec2-terminate-instances --instance-ids "$(cat instance-id)" |
There was a problem hiding this comment.
In case of an error code in any of these, we might want to keep the respective files, so that the stop_ec2_instance.sh would have a chance of trying to destroy these resources "slowly".
There was a problem hiding this comment.
As Tsachi noted we could keep the files maybe to clean up later, but can do that in a future PR. Because I'm switching to calling the shutdown script it would involve parameterizing that, but I think the build host might clean up the temp dirs anyway.
tsachiherman
left a comment
There was a problem hiding this comment.
I haven't tested that, but it looks great. on small comment that could be improved in future PR.
…ical/go-algorand into onetechnical/buildhost-sg-issue
This will use the shutdown routines and waits, and also cleans up the files.
|
When I first started building the legacy pipeline, I borrowed these scripts and wrote all the temp files it produces to a subfolder which I then removed on all conditions (both success and failure). This also enabled consistent cleanup of the security groups and key pairs, since these are ephemeral instances. My thinking was it would be better to always cleanup like that instead of checking for pre-existing groups or keys, since they were always generated by the script and are only intended to be short-lived. |
The problem is it only cleans up the folder, not the ephemeral instance/security group/key pair if there are errors. Hopefully this PR will fix that for most cases. |
It depends how you set it up. In the case of the legacy pipeline, I catch an error signal and call the shutdown script. For example, I set this at the top of any shell script that could have error conditions: That then cleans up all the ephemeral stuff that was created and finally removes the subdirectory. It only took a minor bit of refactoring, but it worked (works) really well. |
|
I'm not requesting changes, by the way, just talking about another way to solve the problem. |
Summary
The buildhost may fail if the security group or key pair is pre-existing. This may happen if startup fails, as the old security groups and key pairs were not cleaned up with errors. This will also result in instances left running on the account.
This change will check for pre-existing security groups and key pairs, and re-generate the INSTANCE_ID if either exist. The fall through error condition was also modified to run the shutdown script, to delete the security group, key pair, and instance.
Test Plan
Verify normal launch:
Test duplicate INSTANCE_ID (tests security group). Modify in start_ec2_instance.sh in place to hardcode the first $RANDOM call to the number in /tmp/build1/key-name, then run:
This should see the duplicate and fetch a new number before launching. Restore start_ec2_instance.sh.
Verify shutdown still works in build1 and build2: