[JENKINS-21605] Further reduce number of recorded upstream causes #9248

cottsay · 2024-05-08T21:52:13Z

There are two commits in this pull request:

Fix non-determinism in nested cause tests

For deeplyNestedCauses, quickly triggering a jobs with no quiet period runs the possibility that a job is triggered while it is still in the queue. Triggering and waiting for the jobs to execute synchronously avoids this and also simplifies the code.

The broadlyNesstedCauses test suffers from the same problem, but is additionally not structured in the way the original issue describes to reproduce the issue. Using the quiet period argument to scheduleBuild2, we can reliably produce a graph where each build is triggered by the other two jobs. An additional final build outside of the loop cuts off the 1 second quiet period from the last loop's iteration so that the test is never waiting for the quiet period of any jobs even though half of the schedule operations specify one.
Re-work nested cause trimming algorithm

This one is a little less straightforward. The algorithm was originally implemented to trim only based on the depth of the causal tree, and was later changed in an attempt to mitigate circumstances where the breadth of the causal tree was causing the same problem (excessive build.xml size). This latter change doesn't seem to be working properly, and I can't make sense of the way the code is written. Though it does sometimes reduce the number of recorded upstream causes, it doesn't really do so predictably and is inadequate for our real-world application.

The revised algorithm has the following goals:
- Record no more than 10 "upstream causes" deep into the tree.
- Record no more than 25 "upstream causes" total in the tree.
- Never duplicate any portion of the tree. If a sub-tree was already recorded, simply record no causes under that node.
- Record a DeeplyNestedUpstreamCause anywhere the depth and breadth limits are exceeded and information is culled so that it is clear where information was lost.

Both the original implementation and the revised implementation presented here suffer from one known limitation: Each direct cause of a build is trimmed independently from each other. This means that the depth and cumulative limits in the trimming as well as the de-duplication, are applied to each direct cause with no knowledge of each other. While each sub-tree is therefore bounded to some maximum recorded size regardless of the de-facto causal tree, there is still no bound on the number of recorded direct causes (and their trimmed sub-trees). It will be left to a subsequent improvement to address that issue since it will likely involve API changes, but ideally we should at the very least stop recording sub-trees when some limit is exceeded.

I'd like to mention one additional tidbit I uncovered when working through this issue. The size on disk of the build.xml is not the only factor related to the recording of upstream causes that can cause Jenkins to perform poorly. Before any of the upstream cause trimming was implemented, duplicated sub-trees were handled in the XML representation using references. In a scenario where there is a lot of duplicated sub-trees in the overall causal tree, there may very well be an extremely compact XML representation on the order of ~300 KiB, but Jenkins may spend several minutes (!!!) serializing the data structures for recording in the build.xml. When the trimming was implemented, it eliminated the possibility that there could ever be "references" in the XML because even upstream cause trees which required no trimming are re-allocated during the trimming process, breaking the relationship between the duplicated sub trees as far as the XML serializer is concerned. It might be possible for us to restore the XML referencing at some point to further reduce the size of the build.xml, but only after the performance trade-offs are better understood.

Testing done

This change was deployed to the Jenkins instance in our staging environment and subjected to a worst-case scenario involving a causal tree of roughly ~2000 jobs with a single job on each end (one trigger job, one finalization job). In this real-world workload, the size on disk of the largest build.xml files were reduced by roughly 90%. The finalization job's build.xml in particular went from 80 MiB to 9.4 MiB. Of the intermediate jobs, all 92 of the jobs which previously had a build.xml over 1 MiB were reduced to less than 0.6 MiB.

The 9.4 MiB "finalization" build was triggered by ~1000 of the intermediate jobs and therefore has an immensely wide causal tree. The improved trimming of each sub-tree accounts for the huge reduction in build.xml size from the previous 80 MiB, but the lack of an overall trimming approach at the direct cause level results in the (still unbearable) 9.4 MiB build.xml. Again, addressing this issue is left to future work.

Proposed changelog entries

Further reduce number of recorded upstream causes.

Proposed upgrade guidelines

N/A

Submitter checklist

Give feedback

The Jira issue, if it exists, is well-described.
The changelog entries and upgrade guidelines are appropriate for the audience affected by the change (users or developers, depending on the change) and are in the imperative mood (see examples). Fill in the Proposed upgrade guidelines section only if there are breaking changes or changes that may require extra steps from users during upgrade.
There is automated testing or an explanation as to why this change has no tests.
New public classes, fields, and methods are annotated with @Restricted or have @since TODO Javadocs, as appropriate.
New deprecations are annotated with @Deprecated(since = "TODO") or @Deprecated(forRemoval = true, since = "TODO"), if applicable.
New or substantially changed JavaScript is not defined inline and does not call eval to ease future introduction of Content Security Policy (CSP) directives (see documentation).
For dependency updates, there are links to external changelogs and, if possible, full differentials.
For new APIs and extension points, there is a link to at least one consumer.
Options

Desired reviewers

Before the changes are marked as ready-for-merge:

Maintainer checklist

Give feedback

There are at least two (2) approvals for the pull request and no outstanding requests for change.
Conversations in the pull request are over, or it is explicit that a reviewer is not blocking the change.
Changelog entries in the pull request title and/or Proposed changelog entries are accurate, human-readable, and in the imperative mood.
Proper changelog labels are set so that the changelog can be generated automatically.
If the change needs additional upgrade steps from users, the upgrade-guide-needed label is set and there is a Proposed upgrade guidelines section in the pull request title (see example).
If it would make sense to backport the change to LTS, a Jira issue must exist, be a Bug or Improvement, and be labeled as lts-candidate to be considered (see query).
Options

For deeplyNestedCauses, quickly triggering a jobs with no quiet period runs the possibility that a job is triggered while it is still in the queue. Triggering and waiting for the jobs to execute synchronously avoids this and also simplifies the code. The broadlyNesstedCauses test suffers from the same problem, but is additionally not structured in the way the original issue describes to reproduce the issue. Using the quiet period argument to scheduleBuild2, we can reliably produce a graph where each build is triggered by the other two jobs. An additional final job outside of the loop cuts off the 1 second quiet period from the last loop's iteration so that the test is never waiting for the quiet period of any jobs even though half of the schedule operations specify one.

welcome · 2024-05-08T21:52:16Z

Yay, your first pull request towards Jenkins core was created successfully! Thank you so much!

A contributor will provide feedback soon. Meanwhile, you can join the chats and community forums to connect with other Jenkins users, developers, and maintainers.

timja

Looks reasonable, untested.

Thanks for the pull request.

jglick

Did not spend the time to dig into the revised algorithm but it sounds reasonable from the description.

I can't make sense of the way the code is written

If it was me, it was twelve years ago, so your guess is as good as mine!

It might be possible for us to restore the XML referencing at some point to further reduce the size of the build.xml, but only after the performance trade-offs are better understood.

Simpler to just tune the maximum retained size low enough that it never matters.

jglick · 2024-05-09T13:18:42Z

test/src/test/java/hudson/model/CauseTest.java

@@ -74,9 +74,6 @@ public class CauseTest {
        assertTrue("keeps full history:\n" + buildXml, buildXml.contains("<upstreamBuild>1</upstreamBuild>"));
        buildXml = new XmlFile(Run.XSTREAM, new File(last.getRootDir(), "build.xml")).asString();
        assertFalse("too big:\n" + buildXml, buildXml.contains("<upstreamBuild>1</upstreamBuild>"));
-        for (QueueTaskFuture<FreeStyleBuild> future : futures) {
-            j.assertBuildStatusSuccess(j.waitForCompletion(future.waitForStart()));


By the way we do require that all scheduled or started builds are completed (or canceled before they start) before the end of the test; otherwise there can be a mess cleaning up after the test’s temp dir. I think you are doing that here.

line 105 handles all of the builds in this test as well as the extra handling added by the looks of it

For deeplyNestedCauses, every scheduled build is now passed to j.waitForCompletion immediately so this snippet is no longer necessary.

For broadlyNestedCauses, as @timja mentioned, line 105 will clean up any lingering scheduled builds, however every single build was either:

scheduled with a quiet period of 1 and recorded in futures

scheduled with a quiet period of 0 and immediately awaited for completion

Additionally, each case of (1) is followed by an instance of (2) so the aforementioned cleanup on line 105 should never actually do anything. Though some builds were scheduled without immediately awaiting, there is always a subsequent scheduling of that same job that does immediately await.

I left the cleanup only as a backstop.

timja · 2024-05-09T13:43:08Z

/label ready-for-merge

This PR is now ready for merge, after ~24 hours, we will merge it if there's no negative feedback.

Thanks!

welcome · 2024-05-10T23:55:14Z

Congratulations on getting your very first Jenkins core pull request merged 🎉🥳

This is a fantastic achievement, and we're thrilled to have you as part of our community! Thank you for your valuable input, and we look forward to seeing more of your contributions in the future!

We would like to invite you to join the community chats and forums to meet other Jenkins contributors 😊
Don't forget to check out the participation page to learn more about how to contribute to Jenkins.

cottsay added 2 commits May 7, 2024 15:32

Re-work nested cause trimming algorithm

625abf4

timja approved these changes May 9, 2024

View reviewed changes

timja requested a review from jglick May 9, 2024 07:20

timja added the bug For changelog: Minor bug. Will be listed after features label May 9, 2024

jglick approved these changes May 9, 2024

View reviewed changes

comment-ops-bot bot added the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label May 9, 2024

MarkEWaite merged commit 53b5e90 into jenkinsci:master May 10, 2024
16 checks passed

cottsay deleted the nested-cause-rework branch May 13, 2024 06:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JENKINS-21605] Further reduce number of recorded upstream causes #9248

[JENKINS-21605] Further reduce number of recorded upstream causes #9248

cottsay commented May 8, 2024 •

edited

Loading

Submitter checklist

Maintainer checklist

welcome bot commented May 8, 2024

timja left a comment

jglick left a comment

jglick May 9, 2024

timja May 9, 2024

cottsay May 9, 2024 •

edited

Loading

timja commented May 9, 2024

welcome bot commented May 10, 2024

[JENKINS-21605] Further reduce number of recorded upstream causes #9248

[JENKINS-21605] Further reduce number of recorded upstream causes #9248

Conversation

cottsay commented May 8, 2024 • edited Loading

Testing done

Proposed changelog entries

Proposed upgrade guidelines

Submitter checklist

Desired reviewers

Maintainer checklist

welcome bot commented May 8, 2024

timja left a comment

Choose a reason for hiding this comment

jglick left a comment

Choose a reason for hiding this comment

jglick May 9, 2024

Choose a reason for hiding this comment

timja May 9, 2024

Choose a reason for hiding this comment

cottsay May 9, 2024 • edited Loading

Choose a reason for hiding this comment

timja commented May 9, 2024

welcome bot commented May 10, 2024

cottsay commented May 8, 2024 •

edited

Loading

cottsay May 9, 2024 •

edited

Loading