Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove queue locking requirement when adding new nodes #5450

Merged
merged 3 commits into from
May 19, 2021
Merged

Remove queue locking requirement when adding new nodes #5450

merged 3 commits into from
May 19, 2021

Conversation

res0nance
Copy link
Contributor

@res0nance res0nance commented Apr 28, 2021

See JENKINS-65501.

While investigating I saw that technically addNode can be made lock-free so this is a stab at it. This would greatly help cloud implementations when provisioning a large number of nodes but of course this will not help removal as that will still require locking.

It can be made non locking because a Node with no computers will not have any executors, this will ensure that it does not confuse Queue#maintain(). But because updateComputerList deals with other computers as well this method needs to lock. Adding a separate method for a single new computer should remove this locking requirement.

References:
Queue#maintain()

Proposed changelog entries

  • Entry 1: Remove the requirement for locking the queue when adding a new node

Proposed upgrade guidelines

N/A

Submitter checklist

  • (If applicable) Jira issue is well described
  • Changelog entries and upgrade guidelines are appropriate for the audience affected by the change (users or developer, depending on the change). Examples
    • Fill-in the Proposed changelog entries section only if there are breaking changes or other changes which may require extra steps from users during the upgrade
  • Appropriate autotests or explanation to why this change has no tests
  • For dependency updates: links to external changelogs and, if possible, full diffs

Desired reviewers

@mention

Maintainer checklist

Before the changes are marked as ready-for-merge:

  • There are at least 2 approvals for the pull request and no outstanding requests for change
  • Conversations in the pull request are over OR it is explicit that a reviewer does not block the change
  • Changelog entries in the PR title and/or Proposed changelog entries are correct
  • Proper changelog labels are set so that the changelog can be generated automatically
  • If the change needs additional upgrade steps from users, upgrade-guide-needed label is set and there is a Proposed upgrade guidelines section in the PR title. (example)
  • If it would make sense to backport the change to LTS, a Jira issue must exist, be a Bug or Improvement, and be labeled as lts-candidate to be considered (see query).

@res0nance res0nance marked this pull request as draft April 28, 2021 14:53
@res0nance res0nance changed the title Add lock free new computer update Remove queue locking requirement when adding new nodes Apr 29, 2021
@res0nance res0nance added rfe For changelog: Minor enhancement. use `major-rfe` for changes to be highlighted squash-merge-me Unclean or useless commit history, should be merged only with squash-merge labels Apr 29, 2021
}
fireOnFailure(f, cause);
}
Jenkins jenkins = Jenkins.get();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire change is removing locking and changing 8 spaces of indent

protected void updateNewComputer(final Node n, boolean automaticSlaveLaunch) {
final String nodeName = n.getNodeName();
final Map<Node, Computer> computers = getComputerMap();
if (computers.containsKey(n)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity checking although technically not needed

@res0nance res0nance marked this pull request as ready for review April 29, 2021 06:21
Copy link
Member

@timja timja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, any specific testing been done?

@res0nance
Copy link
Contributor Author

Looks good, any specific testing been done?

I've only managed some basic testing, I don't have the facilities to do a good enough stress test.
Not sure what's a good way to test it since it involves multiple threads.

@timja timja requested review from jglick and a team May 3, 2021 20:20
@jglick jglick requested a review from stephenc May 3, 2021 20:47
@stephenc
Copy link
Member

stephenc commented May 4, 2021

@jglick it is years since I looked at this code. I have forgotten most everything I knew. Removing myself from the reviewers

Copy link
Member

@stephenc stephenc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No comment

@jglick
Copy link
Member

jglick commented May 4, 2021

🤷 Thought I would try!

@timja timja requested a review from a team May 4, 2021 14:22
Copy link
Member

@timja timja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks sensible 🤷

@oleg-nenashev oleg-nenashev self-requested a review May 13, 2021 21:26
Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see anything wrong in this pull request. I suggest to go ahead with merging, but indeed we should be monitoring the Jenkins issue tracker for reported regressions. We have plenty of time until the next baseline, so it is a good time to take some risks

We may merge it in 24 hours if there is no negative feedback. Please see the merge process documentation for more information about the merge process

@oleg-nenashev oleg-nenashev added the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label May 16, 2021
@timja timja merged commit b1aeef9 into jenkinsci:master May 19, 2021
});
old.set(nodes.put(node.getNodeName(), node));
jenkins.updateNewComputer(node);
jenkins.trimLabels();
// TODO there is a theoretical race whereby the node instance is updated/removed after lock release
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there is no longer a lock above, is this comment obsolete?

@res0nance res0nance deleted the lock-free-node-add branch May 30, 2021 09:04
@basil
Copy link
Member

basil commented Jul 28, 2021

With this PR, org.jenkinsci.plugins.pipeline.modeldefinition.AgentTest#agentOnGroup hangs. Without this PR, the test passes.

@jglick
Copy link
Member

jglick commented Jul 28, 2021

Indeed:

Still waiting to schedule task
Waiting for next available executor on ‘slave0’

Although the config screen for slave0 shows it as having 2 executors, the executor widget shows just one—which is occupied, hence the hung build I guess. If you Save that config screen, the build resumes and the test passes.

@jglick
Copy link
Member

jglick commented Jul 28, 2021

I suspect this somehow broke

/**
* Calling path, *means protected by Queue.withLock
*
* Computer.doConfigSubmit -> Computer.replaceBy ->Jenkins.setNodes* ->Computer.setNode
* AbstractCIBase.updateComputerList->Computer.inflictMortalWound*
* AbstractCIBase.updateComputerList->AbstractCIBase.updateComputer* ->Computer.setNode
* AbstractCIBase.updateComputerList->AbstractCIBase.killComputer->Computer.kill
* Computer.constructor->Computer.setNode
* Computer.kill is called after numExecutors set to zero(Computer.inflictMortalWound) so not need the Queue.lock
*
* @param n number of executors
*/
@GuardedBy("hudson.model.Queue.lock")
private void setNumExecutors(int n) {

@res0nance
Copy link
Contributor Author

I suspect this somehow broke

/**
* Calling path, *means protected by Queue.withLock
*
* Computer.doConfigSubmit -> Computer.replaceBy ->Jenkins.setNodes* ->Computer.setNode
* AbstractCIBase.updateComputerList->Computer.inflictMortalWound*
* AbstractCIBase.updateComputerList->AbstractCIBase.updateComputer* ->Computer.setNode
* AbstractCIBase.updateComputerList->AbstractCIBase.killComputer->Computer.kill
* Computer.constructor->Computer.setNode
* Computer.kill is called after numExecutors set to zero(Computer.inflictMortalWound) so not need the Queue.lock
*
* @param n number of executors
*/
@GuardedBy("hudson.model.Queue.lock")
private void setNumExecutors(int n) {

Not sure how this causes a problem, but the test issue seems to indicate as much

@basil
Copy link
Member

basil commented Sep 6, 2022

Appears to cause JENKINS-69534

@@ -141,17 +141,10 @@ public void addNode(final @NonNull Node node) throws IOException {

Node oldNode = nodes.get(node.getNodeName());
if (node != oldNode) {
// TODO we should not need to lock the queue for adding nodes but until we have a way to update the
// computer list for just the new node
AtomicReference<Node> old = new AtomicReference<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the inner class gone, there is no more need for an AtomicReference so this could be simplified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback rfe For changelog: Minor enhancement. use `major-rfe` for changes to be highlighted squash-merge-me Unclean or useless commit history, should be merged only with squash-merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants