-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending AnomalousStatus
to also kill sh
steps
#405
Conversation
Ineffective. |
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Outdated
Show resolved
Hide resolved
// Also abort any shell steps running on the same node(s): | ||
if (!affectedNodes.isEmpty()) { | ||
StepExecution.applyAll(DurableTaskStep.Execution.class, exec -> { | ||
if (affectedNodes.contains(exec.node)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there cases (e.g. nodes with multiple executors?) where this could abort steps which are running fine if another step on the same node is having unusual problems? If so, could we check exec.state.cookie
or something more precise than just the node name instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory perhaps, but this monitor is normally used for cloud nodes with one executor, and it seems unlikely that the agent could be connected and functional on one executor and node
block while broken in another one of the same build. (For that matter, it would rarely make any sense to run two concurrent node
blocks in the same build on the same agent.)
Hi @jglick, This merge is causing significant issues. When there is a brief disconnection between the master and the slave, I now encounter the error: "also cancelling shell steps." As a workaround, I had to revert to version 371.vb_7cec8f3b_95e, which works fine. However, I'm currently blocked on this version due to this problem. Could you please provide guidance or suggest a solution to resolve this issue? |
@Spanish93 I suspect something else is wrong in your system. A temporary disconnection of an agent should not trigger this logic; Line 308 in 4043ebf
Line 334 in 4043ebf
node block (since corrected in #408), or some sort of unclassified corruption of build metadata—then the build would still have been aborted: Lines 322 to 326 in 4043ebf
|
Hi @jglick, After extensive testing, I can confirm that the update is causing issues. When I update the plugin, the job consistently fails. However, reverting to the previous version resolves the problem, and the job completes successfully every time. |
This fix is also breaking us with the same error. We have multi-step pipeline setup where the steps are executed on separate nodes and when one node finishes , the other node where the other step is in progress gets Terminated. |
@Spanish93 exactly how long is
? This plugin already included code to terminate And @Sunny-Anand
similarly, there is no reason that I can see for the change in this PR to cause such a bug unless there was already a problem in your environment, since it is only
Again, if someone has steps to reproduce a problem from scratch I will do my best to investigate. Enabling |
Hi @jglick , Less than 1 second. |
How I tried to reproduce:
parallel a: {
node('agent1') {
sh 'set +x; for x in `seq 00 99`; do echo a$x; sleep 6; done'
}
}, b: {
node('agent2') {
sh 'set +x; for x in `seq 00 99`; do echo b$x; sleep 6; done'
}
}
Not sure if @Spanish93 & @Sunny-Anand are discussing some fundamentally different scenario. The whole |
In many cases when restoring a build into another K8s cluster using a very lossy backup of the filesystem, via EFS Replication (which does not guarantee snapshot semantics), there is some sort of problem with metadata, which prevents
node
block retry from recovering automatically. (Prior to jenkinsci/kubernetes-plugin#1617 it did not work even if metadata was perfect.) Sometimes there is a missingprogram.dat
, sometimes a corrupted log file, sometimes a missingFlowNode
, etc.But in many of these cases (CloudBees-internal reference), the log seems fine and the flow nodes seem fine, yet for reasons I cannot easily follow because
program.dat
is so opaque, thenode
block seems to have received anOutcome.abnormal
with the expectedFlowInterrupedException
fromExecutorStepDynamicContext.resume
; there are also some suppressed exceptions, andAnomalousStatus
just adds to this list, without causing the build to proceed. CallingCpsStepContext.scheduleNextRun()
from the script console does not help either. However in most of these cases it does seem to work to abort thesh
step running inside: somehow that “wakes up” the program, which then fails thenode
block in the expected way, letting theretry
step kick in and ultimately letting the build run to completion.