Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build fails to resume if controller crashes before queue is saved #408

Merged
merged 17 commits into from
Nov 22, 2024

Conversation

Vlatombe
Copy link
Member

@Vlatombe Vlatombe commented Nov 20, 2024

This covers a case where a queue item for a node was filed, then the controller shut down without going through clean up.
Because of that, the queue wasn't saved, and when the controller restarts, the queue item is gone.

This modifies the behaviour of ExecutorStepExecution#onResume to detect this condition and re-schedule a new queue item if needed.

Testing done

See provided unit test.

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@Vlatombe Vlatombe requested a review from a team as a code owner November 20, 2024 17:15
@Vlatombe Vlatombe marked this pull request as draft November 20, 2024 17:15
try (var tailLog = new TailLog(rjr, "p", 1)) {
iar.createAgent(rjr, InboundAgentRule.Options.newBuilder().name("J").label("mib").color(PrefixedOutputStream.Color.YELLOW).webSocket().build());
rjr.runRemotely(ExecutorStepDynamicContextRJRTest::setupJobAndStart);
rjr.stopJenkinsForcibly();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same test passes with

rjr.stopJenkins();

@@ -67,7 +67,7 @@
<properties>
<changelist>999999-SNAPSHOT</changelist>
<!-- TODO Until in plugin-pom -->
<jenkins-test-harness.version>2182.v0138ccb_c0b_cb_</jenkins-test-harness.version>
<jenkins-test-harness.version>2357.vf2a_982b_b_910f</jenkins-test-harness.version>
Copy link
Member Author

@Vlatombe Vlatombe Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need jenkinsci/jenkins-test-harness#876 for RealJenkinsRule#stopJenkinsForcibly

@jglick jglick added the bug label Nov 20, 2024
@Vlatombe
Copy link
Member Author

The given test sometimes succeeds...

@Vlatombe
Copy link
Member Author

Vlatombe commented Nov 21, 2024

The case where it succeeds is when the stop happens before entering the second node step (still in sleep), so I need a better assertion before stopping.

@Vlatombe Vlatombe marked this pull request as ready for review November 21, 2024 15:39
pom.xml Outdated Show resolved Hide resolved
@jglick jglick self-requested a review November 21, 2024 15:58
@jglick jglick changed the title Pipeline fails to resume if Jenkins was shutdown forcibly Build fails to resume if controller crashes before queue is saved Nov 22, 2024
@jglick
Copy link
Member

jglick commented Nov 22, 2024

(For reference, this is a corner case which would previously trigger AnomalousStatus, as most recently touched in #405.)

if (flowNode == null) {
LOGGER.fine(() -> "No FlowNode found for node block " + getContext() + "; can't recover" );
} else {
var action = flowNode.getAction(QueueItemActionImpl.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To remind myself: this would have been added right away in

public boolean start() throws Exception {
final PlaceholderTask task = new PlaceholderTask(getContext(), step.getLabel());
Queue.WaitingItem waitingItem = Queue.getInstance().schedule2(task, 0).getCreateItem();
if (waitingItem == null) {
// There can be no duplicates. But could be refused if a QueueDecisionHandler rejects it for some odd reason.
throw new IllegalStateException("failed to schedule task");
}
getContext().get(FlowNode.class).addAction(new QueueItemActionImpl(waitingItem.getId()));

await("Waiting for agent J to reconnect").atMost(Duration.ofSeconds(30)).until(() -> r.jenkins.getComputer("J").isOnline(), is(true));
var actions = await().until(() -> b.getActions(InputAction.class), allOf(iterableWithSize(1), hasItem(new InputActionWithId("Branch1"))));
proceed(actions, "Branch1", p.getName() + "#" + b.number);
// This is quicker than waitForMessage that can wait for up to 10 minutes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 maybe we should start developing reusable Matchers in JTH so you could write for example

await().until(() -> b, hasMessage("Complete branch 2 ?"));

which would be more flexible & composable than the current assertions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, would be much nicer than the current assertions available on JenkinsRule.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#408 (comment) seems like a real bug. Otherwise looks good.

@jglick jglick enabled auto-merge November 22, 2024 16:16
@jglick jglick merged commit f6c9e89 into jenkinsci:master Nov 22, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants