-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail fast on cloud node removal #372
Fail fast on cloud node removal #372
Conversation
The agent is never named `ghost`, rather `slave0`
@@ -114,14 +119,17 @@ public class ExecutorStepDynamicContextTest { | |||
sessions.then(j -> { | |||
// Start up a build and then reboot and take the node offline | |||
assertEquals(0, j.jenkins.getLabel("ghost").getNodes().size()); // Make sure test impl is correctly deleted | |||
assertNull(j.jenkins.getNode("ghost")); // Make sure test impl is correctly deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useless assertion, since the node would just have the label ghost
but usually named slave0
based on generation.
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
Outdated
Show resolved
Hide resolved
@@ -207,4 +215,36 @@ public class ExecutorStepDynamicContextTest { | |||
}); | |||
} | |||
|
|||
@Test public void onceRetentionStrategyNodeDisappearance() throws Throwable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially a copy of normalNodeDisappearance
checking some behavioural differences.
src/test/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepDynamicContextTest.java
Outdated
Show resolved
Hide resolved
Looks like some race condition, the |
…this is causing some havoc.
Re-launching CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused by the doubled-up cause of interruption.
src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
Outdated
Show resolved
Hide resolved
src/test/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepDynamicContextTest.java
Show resolved
Hide resolved
src/test/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepDynamicContextTest.java
Outdated
Show resolved
Hide resolved
s.setRetentionStrategy(new OnceRetentionStrategy(0)); | ||
var run = p.scheduleBuild2(0).waitForStart(); | ||
j.waitForMessage("+ sleep infinity", run); | ||
j.jenkins.removeNode(s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is what Reaper
in kubernetes
would do as soon as an agent pod is deleted.
src/test/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepDynamicContextTest.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I do not understand why there is a new cause of interruption at all. The only change that should need to be made is for RemovedNodeListener
to interrupt the build immediately rather than after a delay, right?
src/main/java/org/jenkinsci/plugins/workflow/support/pickles/ExecutorPickle.java
Outdated
Show resolved
Hide resolved
* Use only one cause * When build is cancelled immediately, use RemovedNodeCause * When build is cancelled after observing timeout, use RemovedNodeTimeoutCause * Introduce a marker interface to simplify matching in AgentErrorCondition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simpler now, thanks. Some optional suggestions.
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Outdated
Show resolved
Hide resolved
if (isOneShotAgent(node)) { | ||
LOGGER.fine(() -> "Cancelling owner run for one-shot agent " + node.getNodeName() + " immediately"); | ||
cancelOwnerExecution(node, new RemovedNodeCause()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The crucial part FTR.)
|
return node instanceof AbstractCloudSlave || | ||
(node instanceof Slave && ((Slave) node).getRetentionStrategy() instanceof OnceRetentionStrategy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately this heuristic does not match EC2AbstractSlave extends Slave
nor EC2RetentionStrategy extends RetentionStrategy
. I guess we need to hard-code support for those nonstandard implementations. CC @car-roll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we start introducing a marker interface for this usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, though I think it would suffice for EC2AbstractSlave
to extend AbstractCloudSlave
and EC2Computer
to extend AbstractCloudComputer
, with some minor refactoring to delete then-redundant logic. CloudBees-internal reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cancels the node block immediately instead of waiting for the node to come in certain conditions when we are sure the node can't come back: using a cloud node, using
OnceRetentionStrategy
.Testing done
Submitter checklist