-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(deploy): Disable nu-orca instances during deploy #635
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, has a few comments/questions.
String jobId = getJobExecutor().startJob(request); | ||
|
||
// Wait for the proxy to spin up. | ||
DaemonTaskHandler.safeSleep(TimeUnit.SECONDS.toMillis(5)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe poll this a few times rather than waiting and trying once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tricky thing is that there is nothing clean to poll on (no clear success condition). If we try to open a connection to the local port, it's possible the container isn't accepting requests yet, or that there is already something else bound to the local port (false negative and false positive response respectively).
if (!unknownVersions.isEmpty()) { | ||
String versions = String.join(", ", unknownVersions.stream().map(orca::getVersionedName).collect(Collectors.toList())); | ||
throw new HalException(new ProblemBuilder(Problem.Severity.ERROR, "The following orca versions (" + versions + ") could not safely be drained of work.") | ||
.setRemediation("Please make sure that no pipelines are running, and manually destroy the server groups at those versions.").build()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the correct user action if Halyard can't reap the old Orcas? Is the remediation is to manually drain the Orcas, destroy them, and then retry the deploy again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be exactly that help text, namely making sure that you aren't running pipelines, and then deleting those orcas by hand. Checking for no running pipelines is kinda race-condition prone since there is no guarantee that those nu-orca nodes won't pick up work again. The safest thing to do is to do this by hand.
This gracefully handles a mix of nu & old orcas by only disabling (and then flagging for either deletion or scaling down depending on the age) orca instances that implement the new
/admin/instance/enabled
endpoint. All other orcas are untouched & reported as "unkillable". See sample behavior here: