-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-21605] Further reduce number of recorded upstream causes #9248
Conversation
For deeplyNestedCauses, quickly triggering a jobs with no quiet period runs the possibility that a job is triggered while it is still in the queue. Triggering and waiting for the jobs to execute synchronously avoids this and also simplifies the code. The broadlyNesstedCauses test suffers from the same problem, but is additionally not structured in the way the original issue describes to reproduce the issue. Using the quiet period argument to scheduleBuild2, we can reliably produce a graph where each build is triggered by the other two jobs. An additional final job outside of the loop cuts off the 1 second quiet period from the last loop's iteration so that the test is never waiting for the quiet period of any jobs even though half of the schedule operations specify one.
Yay, your first pull request towards Jenkins core was created successfully! Thank you so much! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable, untested.
Thanks for the pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not spend the time to dig into the revised algorithm but it sounds reasonable from the description.
I can't make sense of the way the code is written
If it was me, it was twelve years ago, so your guess is as good as mine!
It might be possible for us to restore the XML referencing at some point to further reduce the size of the build.xml, but only after the performance trade-offs are better understood.
Simpler to just tune the maximum retained size low enough that it never matters.
@@ -74,9 +74,6 @@ public class CauseTest { | |||
assertTrue("keeps full history:\n" + buildXml, buildXml.contains("<upstreamBuild>1</upstreamBuild>")); | |||
buildXml = new XmlFile(Run.XSTREAM, new File(last.getRootDir(), "build.xml")).asString(); | |||
assertFalse("too big:\n" + buildXml, buildXml.contains("<upstreamBuild>1</upstreamBuild>")); | |||
for (QueueTaskFuture<FreeStyleBuild> future : futures) { | |||
j.assertBuildStatusSuccess(j.waitForCompletion(future.waitForStart())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way we do require that all scheduled or started builds are completed (or canceled before they start) before the end of the test; otherwise there can be a mess cleaning up after the test’s temp dir. I think you are doing that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 105 handles all of the builds in this test as well as the extra handling added by the looks of it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For deeplyNestedCauses
, every scheduled build is now passed to j.waitForCompletion
immediately so this snippet is no longer necessary.
For broadlyNestedCauses
, as @timja mentioned, line 105 will clean up any lingering scheduled builds, however every single build was either:
- scheduled with a quiet period of 1 and recorded in
futures
- scheduled with a quiet period of 0 and immediately awaited for completion
Additionally, each case of (1) is followed by an instance of (2) so the aforementioned cleanup on line 105 should never actually do anything. Though some builds were scheduled without immediately awaiting, there is always a subsequent scheduling of that same job that does immediately await.
I left the cleanup only as a backstop.
/label ready-for-merge This PR is now ready for merge, after ~24 hours, we will merge it if there's no negative feedback. Thanks! |
Congratulations on getting your very first Jenkins core pull request merged 🎉🥳 |
See JENKINS-21605.
There are two commits in this pull request:
Fix non-determinism in nested cause tests
For
deeplyNestedCauses
, quickly triggering a jobs with no quiet period runs the possibility that a job is triggered while it is still in the queue. Triggering and waiting for the jobs to execute synchronously avoids this and also simplifies the code.The
broadlyNesstedCauses
test suffers from the same problem, but is additionally not structured in the way the original issue describes to reproduce the issue. Using the quiet period argument toscheduleBuild2
, we can reliably produce a graph where each build is triggered by the other two jobs. An additional final build outside of the loop cuts off the 1 second quiet period from the last loop's iteration so that the test is never waiting for the quiet period of any jobs even though half of the schedule operations specify one.Re-work nested cause trimming algorithm
This one is a little less straightforward. The algorithm was originally implemented to trim only based on the depth of the causal tree, and was later changed in an attempt to mitigate circumstances where the breadth of the causal tree was causing the same problem (excessive
build.xml
size). This latter change doesn't seem to be working properly, and I can't make sense of the way the code is written. Though it does sometimes reduce the number of recorded upstream causes, it doesn't really do so predictably and is inadequate for our real-world application.The revised algorithm has the following goals:
DeeplyNestedUpstreamCause
anywhere the depth and breadth limits are exceeded and information is culled so that it is clear where information was lost.Both the original implementation and the revised implementation presented here suffer from one known limitation: Each direct cause of a build is trimmed independently from each other. This means that the depth and cumulative limits in the trimming as well as the de-duplication, are applied to each direct cause with no knowledge of each other. While each sub-tree is therefore bounded to some maximum recorded size regardless of the de-facto causal tree, there is still no bound on the number of recorded direct causes (and their trimmed sub-trees). It will be left to a subsequent improvement to address that issue since it will likely involve API changes, but ideally we should at the very least stop recording sub-trees when some limit is exceeded.
I'd like to mention one additional tidbit I uncovered when working through this issue. The size on disk of the
build.xml
is not the only factor related to the recording of upstream causes that can cause Jenkins to perform poorly. Before any of the upstream cause trimming was implemented, duplicated sub-trees were handled in the XML representation using references. In a scenario where there is a lot of duplicated sub-trees in the overall causal tree, there may very well be an extremely compact XML representation on the order of ~300 KiB, but Jenkins may spend several minutes (!!!) serializing the data structures for recording in thebuild.xml
. When the trimming was implemented, it eliminated the possibility that there could ever be "references" in the XML because even upstream cause trees which required no trimming are re-allocated during the trimming process, breaking the relationship between the duplicated sub trees as far as the XML serializer is concerned. It might be possible for us to restore the XML referencing at some point to further reduce the size of thebuild.xml
, but only after the performance trade-offs are better understood.Testing done
This change was deployed to the Jenkins instance in our staging environment and subjected to a worst-case scenario involving a causal tree of roughly ~2000 jobs with a single job on each end (one trigger job, one finalization job). In this real-world workload, the size on disk of the largest
build.xml
files were reduced by roughly 90%. The finalization job'sbuild.xml
in particular went from 80 MiB to 9.4 MiB. Of the intermediate jobs, all 92 of the jobs which previously had abuild.xml
over 1 MiB were reduced to less than 0.6 MiB.The 9.4 MiB "finalization" build was triggered by ~1000 of the intermediate jobs and therefore has an immensely wide causal tree. The improved trimming of each sub-tree accounts for the huge reduction in
build.xml
size from the previous 80 MiB, but the lack of an overall trimming approach at the direct cause level results in the (still unbearable) 9.4 MiBbuild.xml
. Again, addressing this issue is left to future work.Proposed changelog entries
Proposed upgrade guidelines
N/A
Submitter checklist
Desired reviewers
Before the changes are marked as
ready-for-merge
:Maintainer checklist