-
-
Notifications
You must be signed in to change notification settings - Fork 9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-73835] Do not allow builds to be deleted while they are still running and ensure build discarders run after builds are fully complete #9810
Conversation
…mplete and always call Job.logRotate after build finalization
|
||
try { | ||
getParent().logRotate(); | ||
} catch (Exception e) { | ||
LOGGER.log(Level.SEVERE, "Failed to rotate log", e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this into GlobalBuildDiscarderListener
so that isLogUpdated() == true
when log rotation runs. It still runs synchronously, it just runs a little bit later now down in onEndBuilding
. This allows us to check isLogUpdated()
elsewhere to avoid race conditions with Pipeline builds.
@@ -250,7 +250,7 @@ private boolean shouldKeepRun(Run r, Run lsb, Run lstb) { | |||
LOGGER.log(FINER, "{0} is not to be removed or purged of artifacts because it’s the last stable build", r); | |||
return true; | |||
} | |||
if (r.isBuilding()) { | |||
if (r.isLogUpdated()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to avoid race conditions involving Pipeline builds. Previously it was possible for log rotation to delete builds which had not yet fully completed, which meant that a build.xml
file could be written back out into the build directory after log rotation had deleted the build. I will try to demonstrate this downstream in workflow-job
once I have an incremental build here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you confident that this is safe for freestyle builds? I.e., that !logUpdated → !building
or conversely that building → logUpdated
? The lifecycles for AbstractBuild
and WorkflowRun
are pretty different and the Javadoc has always been vague.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure it is safe, but if you prefer we can switch to isBuilding() || isLogUpdated()
like the check for the UI in Jelly here (or maybe we should simplify that to only check isLogUpdated
as well). See the non-Pipeline implementations here and the state enum here.
For !logUpdated → !building
,!isLogUpdated
is only true in State.COMPLETED
, which ensures !isBuilding
since COMPLETED
is after POST_PRODUCTION
.
The converse building → logUpdated
should be ok as well, since if isBuilding
then we must be in State.NOT_STARTED
or State.BUILIDING
, both of which are not COMPLETED
, so isLogUpdated
will be true.
For Pipeline things are simpler since WorkflowRun.isLogUpdated
delegates to WorkflowRun.isBuilding
.
Also just as a data point, if isLogUpdated
could not be used on its own for non-pipelines, I think we would have seen it cause issues in tests due to the usage in JenkinsRule.waitForCompletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And as far as I know the states for a Run
can only move forward, although looking at the code I guess a subclass that calls some of the protected methods incorrectly could cause random state changes.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as far as I know the states for a
Run
can only move forward
That looks right.
like the check for the UI in Jelly
Not sure what you mean.
<j:if test="${!it.building and !it.keepLog}"> |
building
and ought to be switched to only check logUpdated
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 I misread keepLog
as logUpdated
. I'll update that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed that in f676be5.
} catch (Exception e) { | ||
LOGGER.log(Level.WARNING, e, () -> "Failed to rotate log for " + run); | ||
} | ||
// Avoid calling Job.logRotate twice in case JobGlobalBuildDiscarderStrategy is configured globally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing, but essentially the old behavior was that Run.run
called Job.logRotate
while the build was in POST_PRODUCTION
state, but also ever since #4368 if the configuration of GlobalBuildDiscarderConfiguration
included JobGlobalBuildDiscarderStrategy
(which it does by default), we also called Job.logRotate
here while the build was in COMPLETED
state. For backwards compatibility I think we need to ensure it gets called at least once regardless of the global configuration, and it's preferable to call it here in onFinalized
so that we can check Run.isLogUpdated
in LogRotator.shouldKeepRun
and Run.delete
to avoid race conditions with Pipeline builds. We avoid the redundant call by filtering out the relevant strategy when processing the globally-configured discarders.
Also, the old behavior is why I am not concerned about removing this call in workflow-job
and also why I am not worried about making the above call asynchronous for the sake of Pipeline builds. BackgroundGlobalBuildDiscarder.processJob
has been (redundantly) calling logRotate
synchronously for years now in default configurations, so it seems that the asynchronous behavior in jenkinsci/workflow-job-plugin@63fdbe8 is no longer critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(jenkinsci/workflow-job-plugin#70 for better linking)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, after some digging into the behavior of globalBuildDiscarder I ended here.
As a Jenkins admin, I expected the specific (or simple) global discarder to take precedence over any project build discarder, especially if I remove the global project build discarder. This is also what I implied from reading the descriptions here and here (even though it's not explicitly mentioned there and does not contradict the actual behavior). I also found multiple occurrences on SO and forums mentioning that global specific job discarder will take precedence, which is not the case. This is now obvious with the change line 51 here but was already the behavior before as I tested with 2.462.2.
What makes it even more confusing, is the way the discarders are merged and applied. The most aggressive policy takes over. This is confirmed to be an expected behavior from the tests. However, from an admin perspective I'd see the least aggressive discarder to be a safer approach (e.g. for compliance reasons).
I'm curious however, to get your input on the topic. Also let me know if I should raise this somewhere else.
For context, this investigation started as a result of the security eng. telling me that he can erase his traces by overriding build discarder in a job. Example: normal run -> evil run (set buildDiscarder to 1) -> normal run. Then we can't see what happened in the evil run anymore. We also have some compliance policy that require us to keep production pipelines history up to a year which we would like to enforce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current behavior is as designed; build discarders are mainly meant to trim disk space, so it would be normal to have a blanket policy that builds over a month old are not kept, while a particular job that runs every five minutes for some automation is configured to discard all but the last build since history is trash. If your interest is in auditing, you would better use one of the many plugins which stream events or even whole build logs from Jenkins to external systems, which could be configured to only accept “create/append”-type operations and reject attempts to delete anything even if the Jenkins controller were to be compromised somehow. After all, even without any discarders, your white-hat engineer could simply delete the job after one build, along with all of its builds. You could also make the entire controller view-only to regular developers, using various “as code” systems.
@@ -96,6 +101,17 @@ public void successVsFailureWithRemoveLastBuild() throws Exception { | |||
assertEquals(2, numberOf(project.getLastFailedBuild())); | |||
} | |||
|
|||
@Test | |||
public void ableToDeleteCurrentBuild() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this was to check that the switch to isLogUpdated
in LogRotator.shouldKeepRun
did not break anything, since previously the call in Run.run
would not have been able to delete the build with the switch. Because of the redundant call via GlobalBuildDiscarderListener
though it worked anyway, the only difference was a log warning from the deleted code in Run.run
, which we can't assert against any more. So maybe this test is no longer checking anything useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, in principle, someone could want to keep zero builds if they were using publishers (GitHub Checks, Slack, etc.) to track failures?
…essEvenTheBuildIsRunning to match new behavior
@@ -139,24 +136,18 @@ public class DeleteBuildsCommandTest { | |||
assertThat(result.stdout(), containsString("Deleted 0 builds")); | |||
} | |||
|
|||
@Test public void deleteBuildsShouldSuccessEvenTheBuildIsRunning() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not notice this test before. It was added in #2310 along with all of the other tests for delete-builds
.
I still think the previous behavior was undesirable - deleting builds that are still running and reachable in memory does not make sense, can lead to tons of errors in the Jenkins logs, and any builds deleted while running are very likely to come back after a Jenkins restart given that build completion will cause build.xml
to be rewritten.
I can see the argument though that Run.delete
should work like Job.delete
and cancel the build if it is running, waiting up to 15 seconds and only throwing an exception if the build still hasn't completed at that point. @jglick suggested this approach as well, but I was hoping to avoid adding complexity here unless we really think that behavior is necessary. Any opinions on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinion. I think the proposed error behavior is intuitive enough. You can explicitly abort the build first if that is what you intended.
/label ready-for-merge This PR is now ready for merge, after ~24 hours, we will merge it if there's no negative feedback. Thanks! |
|
I'll take a look at |
jenkinsci/lockable-resources-plugin#716 deletes the test. |
…l running and ensure build discarders run after builds are fully complete (jenkinsci#9810) * [JENKINS-73835] Do not allow builds to be deleted while they are still running * [JENKINS-73835] Avoid redundant calls to Job.logRotate when builds complete and always call Job.logRotate after build finalization * [JENKINS-73835] Add issue reference to RunTest.buildsMayNotBeDeletedWhileRunning * [JENKINS-73835] Adjust DeleteBuildsCommandTest.deleteBuildsShouldSuccessEvenTheBuildIsRunning to match new behavior * [JENKINS-73835] Run/delete.jelly should check Run.isLogUpdated, not Run.isBuilding
…l running and ensure build discarders run after builds are fully complete (jenkinsci#9810) * [JENKINS-73835] Do not allow builds to be deleted while they are still running * [JENKINS-73835] Avoid redundant calls to Job.logRotate when builds complete and always call Job.logRotate after build finalization * [JENKINS-73835] Add issue reference to RunTest.buildsMayNotBeDeletedWhileRunning * [JENKINS-73835] Adjust DeleteBuildsCommandTest.deleteBuildsShouldSuccessEvenTheBuildIsRunning to match new behavior * [JENKINS-73835] Run/delete.jelly should check Run.isLogUpdated, not Run.isBuilding (cherry picked from commit d34b17e)
Ever since this PR was merged,
|
Yes, looks like the test needs to forcibly stop the build. I will file a PR to fix it in a bit. |
I filed #9876 to stabilize that test. |
logRotator.setRemoveLastBuild(true); | ||
p.setBuildDiscarder(logRotator); | ||
j.buildAndAssertStatus(Result.SUCCESS, p); | ||
assertNull(p.getBuildByNumber(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failure in https://github.com/jenkinsci/jenkins/pull/9921/checks?check_run_id=32166988156 (#9921) looks like a flake? Missing await
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what we would await
for, because log rotation runs synchronously during build completion. Unless this is a test-specific timing issue due to JenkinsRule.buildAndAssertSuccess
just waiting for QueueTaskFuture.get
instead of !Run.isLogUpdated
, then this test probably just needs to be skipped on Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From some testing, j.buildAndAssertStatus(Result.SUCCESS, p)
is guaranteed to block until GlobalBuildDiscarderListener.onFinalized
has completed. Perhaps the BuildWatcher
background thread tried to copy the logs at an inopportune time and caused log rotation to fail. Seems best to delete the BuildWatcher
here in case it made any of the other tests flaky on Windows and then probably skip this test explicitly on Windows as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the AccessDeniedException
seems strange. Either way, I filed #9923.
See JENKINS-73835 and jenkinsci/workflow-job-plugin#470. I noticed this while investigating #9790, but it is a distinct issue.
This PR makes
Run.delete
throw an exception if it is called on a build which has not yet completed. It also adjusts some logic related toLogRotator
, which is the main programmatic caller ofRun.delete
to avoid some race conditions and related issues.Testing done
See new automated PRs.
Proposed changelog entries
Proposed upgrade guidelines
N/A
Submitter checklist
Desired reviewers
Before the changes are marked as
ready-for-merge
:Maintainer checklist