Extending `AnomalousStatus` to also kill `sh` steps #405

jglick · 2024-11-05T23:57:50Z

In many cases when restoring a build into another K8s cluster using a very lossy backup of the filesystem, via EFS Replication (which does not guarantee snapshot semantics), there is some sort of problem with metadata, which prevents node block retry from recovering automatically. (Prior to jenkinsci/kubernetes-plugin#1617 it did not work even if metadata was perfect.) Sometimes there is a missing program.dat, sometimes a corrupted log file, sometimes a missing FlowNode, etc.

But in many of these cases (CloudBees-internal reference), the log seems fine and the flow nodes seem fine, yet for reasons I cannot easily follow because program.dat is so opaque, the node block seems to have received an Outcome.abnormal with the expected FlowInterrupedException from ExecutorStepDynamicContext.resume; there are also some suppressed exceptions, and AnomalousStatus just adds to this list, without causing the build to proceed. Calling CpsStepContext.scheduleNextRun() from the script console does not help either. However in most of these cases it does seem to work to abort the sh step running inside: somehow that “wakes up” the program, which then fails the node block in the expected way, letting the retry step kick in and ultimately letting the build run to completion.

jglick · 2024-11-06T17:51:07Z

Ineffective.

src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java

dwnusbaum · 2024-11-07T15:17:09Z

src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java

+            // Also abort any shell steps running on the same node(s):
+            if (!affectedNodes.isEmpty()) {
+                StepExecution.applyAll(DurableTaskStep.Execution.class, exec -> {
+                    if (affectedNodes.contains(exec.node)) {


Are there cases (e.g. nodes with multiple executors?) where this could abort steps which are running fine if another step on the same node is having unusual problems? If so, could we check exec.state.cookie or something more precise than just the node name instead?

In theory perhaps, but this monitor is normally used for cloud nodes with one executor, and it seems unlikely that the agent could be connected and functional on one executor and node block while broken in another one of the same build. (For that matter, it would rarely make any sense to run two concurrent node blocks in the same build on the same agent.)

Spanish93 · 2024-12-02T08:45:02Z

Hi @jglick,

This merge is causing significant issues. When there is a brief disconnection between the master and the slave, I now encounter the error: "also cancelling shell steps."

As a workaround, I had to revert to version 371.vb_7cec8f3b_95e, which works fine. However, I'm currently blocked on this version due to this problem.

Could you please provide guidance or suggest a solution to resolve this issue?

jglick · 2024-12-02T13:49:55Z

@Spanish93 I suspect something else is wrong in your system. A temporary disconnection of an agent should not trigger this logic;

workflow-durable-task-step-plugin/src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java

Line 308 in 4043ebf

LOGGER.fine(() -> "running " + exec);

and

workflow-durable-task-step-plugin/src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java

Line 334 in 4043ebf

LOGGER.fine(() -> "know about " + ctx);

should be logged if you have the corresponding logger enabled at a fine level. Prior to this PR, if this periodic task detected an issue—typically caused either by a queue item being lost due to a crash before any agent even began running a node block (since corrected in #408), or some sort of unclassified corruption of build metadata—then the build would still have been aborted:

workflow-durable-task-step-plugin/src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java

Lines 322 to 326 in 4043ebf

    
               ctx.get(TaskListener.class).error("node block still appears to be neither running nor scheduled; cancelling"); 
        
           } catch (IOException | InterruptedException x) { 
        
               LOGGER.log(Level.WARNING, null, x); 
        
           } 
        
           ctx.onFailure(new FlowInterruptedException(Result.ABORTED, false, new QueueTaskCancelled()));

Perhaps your problem actually already existed, but was less frequently observed due to the longer recurrence period between checks? Do you have any means of reliably reproducing your problem in a self-contained environment?

Spanish93 · 2024-12-03T10:29:18Z

Hi @jglick,

After extensive testing, I can confirm that the update is causing issues. When I update the plugin, the job consistently fails. However, reverting to the previous version resolves the problem, and the job completes successfully every time.

Sunny-Anand · 2024-12-05T18:13:16Z

When there is a brief disconnection between the master and the slave, I now encounter the error: "also cancelling shell steps."

As a workaround, I had to revert to version 371.vb_7cec8f3b_95e, which works fine. However, I'm currently blocked on this version due to this problem.

Could you please provide guidance or suggest a solution to resolve this issue?

This fix is also breaking us with the same error. We have multi-step pipeline setup where the steps are executed on separate nodes and when one node finishes , the other node where the other step is in progress gets Terminated.
The solution is to revert to the previous version 1371.vb_7cec8f3b_95e

jglick · 2024-12-09T19:49:15Z

@Spanish93 exactly how long is

a brief disconnection

? This plugin already included code to terminate sh steps after (by default) 5m of being unable to contact the agent.

And @Sunny-Anand

the steps are executed on separate nodes and when one node finishes , the other node where the other step is in progress gets Terminated

similarly, there is no reason that I can see for the change in this PR to cause such a bug unless there was already a problem in your environment, since it is only

reducing the amount of time between successive runs of an existing check
making a failed check, that already printed an error to the build log and sent an interrupt to the node step which would normally abort the build, also send an interrupt to the corresponding sh step

Again, if someone has steps to reproduce a problem from scratch I will do my best to investigate. Enabling FINE logging may also offer some clues.

Spanish93 · 2024-12-09T19:52:05Z

Hi @jglick ,

Less than 1 second.
Since I revert to the version before this commit, no more problems

jglick · 2024-12-09T22:21:13Z

How I tried to reproduce:

made a temp dir /tmp/jdh with a+w
docker run --rm --name jenkins -p 8080:8080 -v /tmp/djh:/var/jenkins_home jenkins/jenkins:lts
logged in with initial admin password, selected no plugins
installed workflow-durable-task-step, workflow-basic-steps, workflow-job, workflow-cps
created a job

parallel a: {
    node('agent1') {
        sh 'set +x; for x in `seq 00 99`; do echo a$x; sleep 6; done'
    }
}, b: {
    node('agent2') {
        sh 'set +x; for x in `seq 00 99`; do echo b$x; sleep 6; done'
    }
}

created an agent1 with remote FS /home/jenkins/agent using inbound launch method; created agent2 as a clone of it
launched docker run --rm --init --link jenkins jenkins/inbound-agent -url http://jenkins:8080/ -name agent1 -webSocket -secret … and similarly for agent2
triggered a build, watched console until it got up to about a8 / b8
docker network disconnect bridge <name-of-agent1-container>
waited until a few b… messages were printed without corresponding a… messages; also saw Cannot contact agent1: java.lang.InterruptedException
docker network connect bridge <name-of-agent1-container>
batched-up a… messages printed as expected, and both branches resumed printing
waited a few more minutes, and build completed normally

Not sure if @Spanish93 & @Sunny-Anand are discussing some fundamentally different scenario. The whole AnomalousStatus extension can of course be disabled, but I am guessing something is fundamentally wrong in these controllers and that would just be masking the issue.

Extending AnomalousStatus to also kill sh steps

cefa50f

jglick added the enhancement label Nov 5, 2024

jglick closed this Nov 6, 2024

jglick deleted the AnomalousStatus branch November 6, 2024 17:51

jglick commented Nov 6, 2024

View reviewed changes

src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java Outdated Show resolved Hide resolved

jglick restored the AnomalousStatus branch November 6, 2024 18:46

Trying a different approach given jenkinsci#405 (comment)

aea9bf0

jglick reopened this Nov 6, 2024

jglick added 2 commits November 6, 2024 13:57

Test fixes

591f289

Make AnomalousStatus run more frequently

4043ebf

jglick requested a review from dwnusbaum November 6, 2024 23:51

jglick marked this pull request as ready for review November 6, 2024 23:51

jglick requested a review from a team as a code owner November 6, 2024 23:51

dwnusbaum approved these changes Nov 7, 2024

View reviewed changes

jglick merged commit 6a3e903 into jenkinsci:master Nov 7, 2024
17 checks passed

jglick deleted the AnomalousStatus branch November 7, 2024 16:16

jglick mentioned this pull request Nov 22, 2024

Build fails to resume if controller crashes before queue is saved #408

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending `AnomalousStatus` to also kill `sh` steps #405

Extending `AnomalousStatus` to also kill `sh` steps #405

jglick commented Nov 5, 2024 •

edited

Loading

jglick commented Nov 6, 2024

dwnusbaum Nov 7, 2024

jglick Nov 7, 2024

Spanish93 commented Dec 2, 2024

jglick commented Dec 2, 2024

Spanish93 commented Dec 3, 2024

Sunny-Anand commented Dec 5, 2024

jglick commented Dec 9, 2024

Spanish93 commented Dec 9, 2024 •

edited

Loading

jglick commented Dec 9, 2024

Extending AnomalousStatus to also kill sh steps #405

Extending AnomalousStatus to also kill sh steps #405

Conversation

jglick commented Nov 5, 2024 • edited Loading

jglick commented Nov 6, 2024

dwnusbaum Nov 7, 2024

Choose a reason for hiding this comment

jglick Nov 7, 2024

Choose a reason for hiding this comment

Spanish93 commented Dec 2, 2024

jglick commented Dec 2, 2024

Spanish93 commented Dec 3, 2024

Sunny-Anand commented Dec 5, 2024

jglick commented Dec 9, 2024

Spanish93 commented Dec 9, 2024 • edited Loading

jglick commented Dec 9, 2024

Extending `AnomalousStatus` to also kill `sh` steps #405

Extending `AnomalousStatus` to also kill `sh` steps #405

jglick commented Nov 5, 2024 •

edited

Loading

Spanish93 commented Dec 9, 2024 •

edited

Loading