Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending AnomalousStatus to also kill sh steps #405

Merged
merged 4 commits into from
Nov 7, 2024

Conversation

jglick
Copy link
Member

@jglick jglick commented Nov 5, 2024

In many cases when restoring a build into another K8s cluster using a very lossy backup of the filesystem, via EFS Replication (which does not guarantee snapshot semantics), there is some sort of problem with metadata, which prevents node block retry from recovering automatically. (Prior to jenkinsci/kubernetes-plugin#1617 it did not work even if metadata was perfect.) Sometimes there is a missing program.dat, sometimes a corrupted log file, sometimes a missing FlowNode, etc.

But in many of these cases (CloudBees-internal reference), the log seems fine and the flow nodes seem fine, yet for reasons I cannot easily follow because program.dat is so opaque, the node block seems to have received an Outcome.abnormal with the expected FlowInterrupedException from ExecutorStepDynamicContext.resume; there are also some suppressed exceptions, and AnomalousStatus just adds to this list, without causing the build to proceed. Calling CpsStepContext.scheduleNextRun() from the script console does not help either. However in most of these cases it does seem to work to abort the sh step running inside: somehow that “wakes up” the program, which then fails the node block in the expected way, letting the retry step kick in and ultimately letting the build run to completion.

@jglick
Copy link
Member Author

jglick commented Nov 6, 2024

Ineffective.

@jglick jglick closed this Nov 6, 2024
@jglick jglick deleted the AnomalousStatus branch November 6, 2024 17:51
@jglick jglick restored the AnomalousStatus branch November 6, 2024 18:46
@jglick jglick reopened this Nov 6, 2024
@jglick jglick requested a review from dwnusbaum November 6, 2024 23:51
@jglick jglick marked this pull request as ready for review November 6, 2024 23:51
@jglick jglick requested a review from a team as a code owner November 6, 2024 23:51
// Also abort any shell steps running on the same node(s):
if (!affectedNodes.isEmpty()) {
StepExecution.applyAll(DurableTaskStep.Execution.class, exec -> {
if (affectedNodes.contains(exec.node)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases (e.g. nodes with multiple executors?) where this could abort steps which are running fine if another step on the same node is having unusual problems? If so, could we check exec.state.cookie or something more precise than just the node name instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory perhaps, but this monitor is normally used for cloud nodes with one executor, and it seems unlikely that the agent could be connected and functional on one executor and node block while broken in another one of the same build. (For that matter, it would rarely make any sense to run two concurrent node blocks in the same build on the same agent.)

@jglick jglick merged commit 6a3e903 into jenkinsci:master Nov 7, 2024
17 checks passed
@jglick jglick deleted the AnomalousStatus branch November 7, 2024 16:16
@Spanish93
Copy link

Hi @jglick,

This merge is causing significant issues. When there is a brief disconnection between the master and the slave, I now encounter the error: "also cancelling shell steps."

As a workaround, I had to revert to version 371.vb_7cec8f3b_95e, which works fine. However, I'm currently blocked on this version due to this problem.

Could you please provide guidance or suggest a solution to resolve this issue?

@jglick
Copy link
Member Author

jglick commented Dec 2, 2024

@Spanish93 I suspect something else is wrong in your system. A temporary disconnection of an agent should not trigger this logic;

and should be logged if you have the corresponding logger enabled at a fine level. Prior to this PR, if this periodic task detected an issue—typically caused either by a queue item being lost due to a crash before any agent even began running a node block (since corrected in #408), or some sort of unclassified corruption of build metadata—then the build would still have been aborted:
ctx.get(TaskListener.class).error("node block still appears to be neither running nor scheduled; cancelling");
} catch (IOException | InterruptedException x) {
LOGGER.log(Level.WARNING, null, x);
}
ctx.onFailure(new FlowInterruptedException(Result.ABORTED, false, new QueueTaskCancelled()));
Perhaps your problem actually already existed, but was less frequently observed due to the longer recurrence period between checks? Do you have any means of reliably reproducing your problem in a self-contained environment?

@Spanish93
Copy link

Hi @jglick,

After extensive testing, I can confirm that the update is causing issues. When I update the plugin, the job consistently fails. However, reverting to the previous version resolves the problem, and the job completes successfully every time.

@Sunny-Anand
Copy link

When there is a brief disconnection between the master and the slave, I now encounter the error: "also cancelling shell steps."

As a workaround, I had to revert to version 371.vb_7cec8f3b_95e, which works fine. However, I'm currently blocked on this version due to this problem.

Could you please provide guidance or suggest a solution to resolve this issue?

This fix is also breaking us with the same error. We have multi-step pipeline setup where the steps are executed on separate nodes and when one node finishes , the other node where the other step is in progress gets Terminated.
The solution is to revert to the previous version 1371.vb_7cec8f3b_95e

@jglick
Copy link
Member Author

jglick commented Dec 9, 2024

@Spanish93 exactly how long is

a brief disconnection

? This plugin already included code to terminate sh steps after (by default) 5m of being unable to contact the agent.

And @Sunny-Anand

the steps are executed on separate nodes and when one node finishes , the other node where the other step is in progress gets Terminated

similarly, there is no reason that I can see for the change in this PR to cause such a bug unless there was already a problem in your environment, since it is only

  • reducing the amount of time between successive runs of an existing check
  • making a failed check, that already printed an error to the build log and sent an interrupt to the node step which would normally abort the build, also send an interrupt to the corresponding sh step

Again, if someone has steps to reproduce a problem from scratch I will do my best to investigate. Enabling FINE logging may also offer some clues.

@Spanish93
Copy link

Spanish93 commented Dec 9, 2024

Hi @jglick ,

Less than 1 second.
Since I revert to the version before this commit, no more problems

@jglick
Copy link
Member Author

jglick commented Dec 9, 2024

How I tried to reproduce:

  1. made a temp dir /tmp/jdh with a+w
  2. docker run --rm --name jenkins -p 8080:8080 -v /tmp/djh:/var/jenkins_home jenkins/jenkins:lts
  3. logged in with initial admin password, selected no plugins
  4. installed workflow-durable-task-step, workflow-basic-steps, workflow-job, workflow-cps
  5. created a job
parallel a: {
    node('agent1') {
        sh 'set +x; for x in `seq 00 99`; do echo a$x; sleep 6; done'
    }
}, b: {
    node('agent2') {
        sh 'set +x; for x in `seq 00 99`; do echo b$x; sleep 6; done'
    }
}
  1. created an agent1 with remote FS /home/jenkins/agent using inbound launch method; created agent2 as a clone of it
  2. launched docker run --rm --init --link jenkins jenkins/inbound-agent -url http://jenkins:8080/ -name agent1 -webSocket -secret … and similarly for agent2
  3. triggered a build, watched console until it got up to about a8 / b8
  4. docker network disconnect bridge <name-of-agent1-container>
  5. waited until a few b… messages were printed without corresponding a… messages; also saw Cannot contact agent1: java.lang.InterruptedException
  6. docker network connect bridge <name-of-agent1-container>
  7. batched-up a… messages printed as expected, and both branches resumed printing
  8. waited a few more minutes, and build completed normally

Not sure if @Spanish93 & @Sunny-Anand are discussing some fundamentally different scenario. The whole AnomalousStatus extension can of course be disabled, but I am guessing something is fundamentally wrong in these controllers and that would just be masking the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants