Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-21605] Further reduce number of recorded upstream causes #9248

Merged
merged 2 commits into from
May 10, 2024
Merged

[JENKINS-21605] Further reduce number of recorded upstream causes #9248

merged 2 commits into from
May 10, 2024

Conversation

cottsay
Copy link
Contributor

@cottsay cottsay commented May 8, 2024

See JENKINS-21605.

There are two commits in this pull request:

  1. Fix non-determinism in nested cause tests

    For deeplyNestedCauses, quickly triggering a jobs with no quiet period runs the possibility that a job is triggered while it is still in the queue. Triggering and waiting for the jobs to execute synchronously avoids this and also simplifies the code.

    The broadlyNesstedCauses test suffers from the same problem, but is additionally not structured in the way the original issue describes to reproduce the issue. Using the quiet period argument to scheduleBuild2, we can reliably produce a graph where each build is triggered by the other two jobs. An additional final build outside of the loop cuts off the 1 second quiet period from the last loop's iteration so that the test is never waiting for the quiet period of any jobs even though half of the schedule operations specify one.

  2. Re-work nested cause trimming algorithm

    This one is a little less straightforward. The algorithm was originally implemented to trim only based on the depth of the causal tree, and was later changed in an attempt to mitigate circumstances where the breadth of the causal tree was causing the same problem (excessive build.xml size). This latter change doesn't seem to be working properly, and I can't make sense of the way the code is written. Though it does sometimes reduce the number of recorded upstream causes, it doesn't really do so predictably and is inadequate for our real-world application.

    The revised algorithm has the following goals:

    • Record no more than 10 "upstream causes" deep into the tree.
    • Record no more than 25 "upstream causes" total in the tree.
    • Never duplicate any portion of the tree. If a sub-tree was already recorded, simply record no causes under that node.
    • Record a DeeplyNestedUpstreamCause anywhere the depth and breadth limits are exceeded and information is culled so that it is clear where information was lost.

Both the original implementation and the revised implementation presented here suffer from one known limitation: Each direct cause of a build is trimmed independently from each other. This means that the depth and cumulative limits in the trimming as well as the de-duplication, are applied to each direct cause with no knowledge of each other. While each sub-tree is therefore bounded to some maximum recorded size regardless of the de-facto causal tree, there is still no bound on the number of recorded direct causes (and their trimmed sub-trees). It will be left to a subsequent improvement to address that issue since it will likely involve API changes, but ideally we should at the very least stop recording sub-trees when some limit is exceeded.

I'd like to mention one additional tidbit I uncovered when working through this issue. The size on disk of the build.xml is not the only factor related to the recording of upstream causes that can cause Jenkins to perform poorly. Before any of the upstream cause trimming was implemented, duplicated sub-trees were handled in the XML representation using references. In a scenario where there is a lot of duplicated sub-trees in the overall causal tree, there may very well be an extremely compact XML representation on the order of ~300 KiB, but Jenkins may spend several minutes (!!!) serializing the data structures for recording in the build.xml. When the trimming was implemented, it eliminated the possibility that there could ever be "references" in the XML because even upstream cause trees which required no trimming are re-allocated during the trimming process, breaking the relationship between the duplicated sub trees as far as the XML serializer is concerned. It might be possible for us to restore the XML referencing at some point to further reduce the size of the build.xml, but only after the performance trade-offs are better understood.

Testing done

This change was deployed to the Jenkins instance in our staging environment and subjected to a worst-case scenario involving a causal tree of roughly ~2000 jobs with a single job on each end (one trigger job, one finalization job). In this real-world workload, the size on disk of the largest build.xml files were reduced by roughly 90%. The finalization job's build.xml in particular went from 80 MiB to 9.4 MiB. Of the intermediate jobs, all 92 of the jobs which previously had a build.xml over 1 MiB were reduced to less than 0.6 MiB.

The 9.4 MiB "finalization" build was triggered by ~1000 of the intermediate jobs and therefore has an immensely wide causal tree. The improved trimming of each sub-tree accounts for the huge reduction in build.xml size from the previous 80 MiB, but the lack of an overall trimming approach at the direct cause level results in the (still unbearable) 9.4 MiB build.xml. Again, addressing this issue is left to future work.

Proposed changelog entries

  • Further reduce number of recorded upstream causes.

Proposed upgrade guidelines

N/A

Submitter checklist

Desired reviewers

Before the changes are marked as ready-for-merge:

Maintainer checklist

cottsay added 2 commits May 7, 2024 15:32
For deeplyNestedCauses, quickly triggering a jobs with no quiet period
runs the possibility that a job is triggered while it is still in the
queue. Triggering and waiting for the jobs to execute synchronously
avoids this and also simplifies the code.

The broadlyNesstedCauses test suffers from the same problem, but is
additionally not structured in the way the original issue describes to
reproduce the issue. Using the quiet period argument to scheduleBuild2,
we can reliably produce a graph where each build is triggered by the
other two jobs. An additional final job outside of the loop cuts off the
1 second quiet period from the last loop's iteration so that the test is
never waiting for the quiet period of any jobs even though half of the
schedule operations specify one.
Copy link

welcome bot commented May 8, 2024

Yay, your first pull request towards Jenkins core was created successfully! Thank you so much!

A contributor will provide feedback soon. Meanwhile, you can join the chats and community forums to connect with other Jenkins users, developers, and maintainers.

Copy link
Member

@timja timja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, untested.

Thanks for the pull request.

@timja timja requested a review from jglick May 9, 2024 07:20
@timja timja added the bug For changelog: Minor bug. Will be listed after features label May 9, 2024
Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not spend the time to dig into the revised algorithm but it sounds reasonable from the description.

I can't make sense of the way the code is written

If it was me, it was twelve years ago, so your guess is as good as mine!

It might be possible for us to restore the XML referencing at some point to further reduce the size of the build.xml, but only after the performance trade-offs are better understood.

Simpler to just tune the maximum retained size low enough that it never matters.

@@ -74,9 +74,6 @@ public class CauseTest {
assertTrue("keeps full history:\n" + buildXml, buildXml.contains("<upstreamBuild>1</upstreamBuild>"));
buildXml = new XmlFile(Run.XSTREAM, new File(last.getRootDir(), "build.xml")).asString();
assertFalse("too big:\n" + buildXml, buildXml.contains("<upstreamBuild>1</upstreamBuild>"));
for (QueueTaskFuture<FreeStyleBuild> future : futures) {
j.assertBuildStatusSuccess(j.waitForCompletion(future.waitForStart()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way we do require that all scheduled or started builds are completed (or canceled before they start) before the end of the test; otherwise there can be a mess cleaning up after the test’s temp dir. I think you are doing that here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 105 handles all of the builds in this test as well as the extra handling added by the looks of it

Copy link
Contributor Author

@cottsay cottsay May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For deeplyNestedCauses, every scheduled build is now passed to j.waitForCompletion immediately so this snippet is no longer necessary.

For broadlyNestedCauses, as @timja mentioned, line 105 will clean up any lingering scheduled builds, however every single build was either:

  1. scheduled with a quiet period of 1 and recorded in futures
  2. scheduled with a quiet period of 0 and immediately awaited for completion

Additionally, each case of (1) is followed by an instance of (2) so the aforementioned cleanup on line 105 should never actually do anything. Though some builds were scheduled without immediately awaiting, there is always a subsequent scheduling of that same job that does immediately await.

I left the cleanup only as a backstop.

@timja
Copy link
Member

timja commented May 9, 2024

/label ready-for-merge


This PR is now ready for merge, after ~24 hours, we will merge it if there's no negative feedback.

Thanks!

@comment-ops-bot comment-ops-bot bot added the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label May 9, 2024
@MarkEWaite MarkEWaite merged commit 53b5e90 into jenkinsci:master May 10, 2024
16 checks passed
Copy link

welcome bot commented May 10, 2024

Congratulations on getting your very first Jenkins core pull request merged 🎉🥳

This is a fantastic achievement, and we're thrilled to have you as part of our community! Thank you for your valuable input, and we look forward to seeing more of your contributions in the future!

We would like to invite you to join the community chats and forums to meet other Jenkins contributors 😊
Don't forget to check out the participation page to learn more about how to contribute to Jenkins.


@cottsay cottsay deleted the nested-cause-rework branch May 13, 2024 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For changelog: Minor bug. Will be listed after features ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants