Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stalled processes not cleared on IBM i #2937

Closed
richardlau opened this issue Apr 29, 2022 · 19 comments
Closed

Stalled processes not cleared on IBM i #2937

richardlau opened this issue Apr 29, 2022 · 19 comments

Comments

@richardlau
Copy link
Member

richardlau commented Apr 29, 2022

IBM i builds have been failing on test-iinthecloud-ibmi73-ppc64_be-1 since https://ci.nodejs.org/job/node-test-commit-ibmi/743/nodes=ibmi73-ppc64/ due to a dangling node process.
i.e. https://ci.nodejs.org/job/node-test-commit-ibmi/743/nodes=ibmi73-ppc64/consoleFull

10:22:35 ps awwx | grep Release/node | grep -v grep | cat
10:22:35  38123848      - A     0:25 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99) 
10:22:36 gmake[1]: *** [Makefile:532: test-ci] Error 1

This process is leftover from https://ci.nodejs.org/job/node-test-commit-ibmi/742/nodes=ibmi73-ppc64/ where parallel/test-child-process-exec-abortcontroller-promisified timed out -- the test spawns the process in https://github.com/nodejs/node/blob/e46c680bf2b211bbd52cf959ca17ee98c7f657f5/test/parallel/test-child-process-exec-abortcontroller-promisified.js#L15

The Node.js Makefile is supposed to be able to clear stalled/dangling out/Release/node processes in clear-stalled: https://github.com/nodejs/node/blob/68fb0bf553e2af3e0b61733d29e1e9ba7f73d9b2/Makefile#L460-L466

clear-stalled:
	$(info Clean up any leftover processes but don't error if found.)
	ps awwx | grep Release/node | grep -v grep | cat
	@PS_OUT=`ps awwx | grep Release/node | grep -v grep | awk '{print $$1}'`; \
	if [ "$${PS_OUT}" ]; then \
		echo $${PS_OUT} | xargs kill -9; \
	fi

but it looks like on IBM i this isn't killing the process:

-bash-5.1$ ps -ef | grep out/Release/node
    iojs 38123848        1   0   Apr 26      -  1:18 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
-bash-5.1$ gmake clear-stalled
Clean up any leftover processes but don't error if found.
ps awwx | grep Release/node | grep -v grep | cat
 38123848      - A     1:18 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
-bash-5.1$ ps -ef | grep out/Release/node
    iojs 38123848        1   0   Apr 26      -  1:18 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
-bash-5.1$

If I add some debug into the Makefile I can see that xargs gets the process ID but it looks like kill -9 isn't terminating the process?

-bash-5.1$ git diff
diff --git a/Makefile b/Makefile
index a6549a8474..5bf612a70d 100644
--- a/Makefile
+++ b/Makefile
@@ -463,6 +463,7 @@ clear-stalled:
        @PS_OUT=`ps awwx | grep Release/node | grep -v grep | awk '{print $$1}'`; \
        if [ "$${PS_OUT}" ]; then \
                echo $${PS_OUT} | xargs kill -9; \
+               echo $${PS_OUT} | xargs echo =; \
        fi

 .PHONY: test-build
-bash-5.1$ ps -ef | grep out/Release/node
    iojs 38123848        1   0   Apr 26      -  1:18 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
-bash-5.1$ gmake clear-stalled
Clean up any leftover processes but don't error if found.
ps awwx | grep Release/node | grep -v grep | cat
 38123848      - A     1:18 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
= 38123848
-bash-5.1$ ps -ef | grep out/Release/node
    iojs 38123848        1   0   Apr 26      -  1:18 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
-bash-5.1$

@ThePrez Any ideas?

@richardlau
Copy link
Member Author

(I'm assuming we can manually clear the stalled process to get the CI passing but it would be good if the automation in the build scripts just worked.)

@ThePrez
Copy link
Contributor

ThePrez commented Apr 29, 2022

This is very strange, indeed!
The phenomenon is easily repeatable by simply running the node -e "setInterval(()=>{}, 99)" in a background job.

Strangely:

  • kill -9 from a bash shell works
  • kill -9 from sh works
  • kill -9 from xargs inside a Makefile does NOT work 👎
  • kill -KILL from a bash shell works
  • kill -KILL from sh works
  • kill -KILL from xargs inside a Makefile works

So an easy fix would be to simply change the Makefile do use -KILL instead of -9. I can't imagine that would cause any issue on other platforms.

Regardless, I'm still trying to figure out root cause. IBM i has two different types of signals: ILE and PASE (Node.js runs in PASE), and the numerical representations differ:

  • PASE SIGKILL = 9
  • ILE SIGIO = 9
  • ILE SIGKILL = 12
    But a kill -12 from xargs in the Makefile also fails, so I think that's a "red herring."

@ThePrez
Copy link
Contributor

ThePrez commented Apr 29, 2022

Regardless, that xargs invocation should have -n 1. Would you like me to open a separate issue for that?
oops, no you don't. Disregard!

@richardlau
Copy link
Member Author

Regardless, that xargs invocation should have -n 1. Would you like me to open a separate issue for that?

👍

@ThePrez
Copy link
Contributor

ThePrez commented Apr 29, 2022

this works

clear-stalled:
        $(info Clean up any leftover processes but don't error if found.)
        ps awwx | grep Release/node | grep -v grep | cat
        @PS_OUT=`ps awwx | grep Release/node | grep -v grep | awk '{print $$1}'`; \
        if [ "$${PS_OUT}" ]; then \
                kill -9 $${PS_OUT}; \
        fi

as does (as mentioned)

clear-stalled:
	$(info Clean up any leftover processes but don't error if found.)
	ps awwx | grep Release/node | grep -v grep | cat
	@PS_OUT=`ps awwx | grep Release/node | grep -v grep | awk '{print $$1}'`; \
	if [ "$${PS_OUT}" ]; then \
		echo $${PS_OUT} | xargs -t kill -KILL; \
	fi

In my experimentation, it seems that xargs and -9 together are needed to recreate. This makes no sense.

@kadler
Copy link
Contributor

kadler commented Jul 12, 2022

We debugged this today and discovered the root cause turns out to a bug in the GNU kill, ie. /QOpenSys/pkgs/bin/kill. https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=900b5621e685df7ffd001fc64bc9d44b06b13900

This affects using GNU kill with pretty much any numeric value, not just kill -9. As a "workaround", you could use the correct bit pattern for signal 9 on AIX, ie. /QOpenSys/pkgs/bin/kill -589825 pid 😂😂😂 Otherwise, you can use the system version of kill at /QOpenSys/usr/bin/kill or use kill -KILL.

I'm working on an update with the fix, but due to some infrastructure issues this won't be available for a while.

@github-actions
Copy link

github-actions bot commented May 9, 2023

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

@github-actions github-actions bot added the stale label May 9, 2023
@mhdawson
Copy link
Member

@abmusse is this something you could take a look at?

@mhdawson mhdawson removed the stale label May 16, 2023
@abmusse
Copy link
Contributor

abmusse commented May 16, 2023

Yes, I'll take a look at this one

@abmusse
Copy link
Contributor

abmusse commented May 16, 2023

@mhdawson

Looks like we push the fix up in coreutils-gnu 8.25-9. We have an outdated version on the build system. Likely we just need to run a yum upgrade coreutils-gnu on the build system.

@richardlau
Copy link
Member Author

@abmusse On test-iinthecloud-ibmi73-ppc64_be-1:

-bash-5.1$ yum info coreutils-gnu
Installed Packages
Name        : coreutils-gnu
Arch        : ppc64
Version     : 8.25
Release     : 6
Size        : 118 M
Repo        : installed
From repo   : ibm
Summary     : GNU coreutils
URL         : https://www.gnu.org/software/coreutils
License     : GPL-3.0-or-later
Description : The GNU Core Utilities are the basic file, shell and text manipulation utilities
            : of the GNU operating system. These are the core utilities which are expected to
            : exist on every operating system.

-bash-5.1$ yum upgrade coreutils-gnu
Setting up Upgrade Process
No Packages marked for Update
-bash-5.1$

@abmusse
Copy link
Contributor

abmusse commented May 17, 2023

What repos does this box have?

yum repolist all

We migrated base repos last year. This box may need the ibmi-repos upgrade.

https://ibmi-oss-docs.readthedocs.io/en/latest/yum/IBM_REPOS.html#transition

@richardlau
Copy link
Member Author

-bash-5.1$ yum repolist all
repo id                                                                                            repo name                                                                                        status
ibm                                                                                                ibm                                                                                              enabled: 1002
ibm-7.3                                                                                            ibm-7.3                                                                                          disabled
ibmi-base                                                                                          IBM i base                                                                                       enabled: 1002
ibmi-release                                                                                       IBM i 7.3                                                                                        enabled:   67
repolist: 2071
-bash-5.1$

@abmusse
Copy link
Contributor

abmusse commented May 17, 2023

What url does ibmi-base point to?

cat /QOpenSys/etc/yum/repos.d/ibmi-base.repo

I suspect its outdated and the baseurl does not point to https://public.dhe.ibm.com/software/ibmi/products/pase/rpms/repo-base-7.3/

@abmusse
Copy link
Contributor

abmusse commented May 17, 2023

We need to upgrade ibmi-repos package.

yum upgrade ibmi-repos

Then we should also disable the old ibm repo

yum-config-manager --disable ibm

After that the latest coreutils-gnu should be installable!

@mhdawson
Copy link
Member

@abmusse thanks for taking a lok and create to see you and @richardlau moving it forward.

@abmusse
Copy link
Contributor

abmusse commented May 17, 2023

@richardlau
I upgraded ibmi-repos and coreutils-gnu on iOSSBld1.iInTheCloud.com

richardlau added a commit to richardlau/build that referenced this issue May 17, 2023
Update to use current yum repositories for IBM i 7.3.
Install Python 3.9, and use it to install `tap2junit`.
Do not set group on the `.ssh` directory on platforms
such as IBM i and z/OS where we do not create a group.

Refs: nodejs#2937
Refs: https://ibmi-oss-docs.readthedocs.io/en/latest/yum/IBM_REPOS.html
@richardlau
Copy link
Member Author

Ansible changes, including using the correct yum repositories: #3358

richardlau added a commit that referenced this issue May 19, 2023
Update to use current yum repositories for IBM i 7.3.
Install Python 3.9, and use it to install `tap2junit`.
Do not set group on the `.ssh` directory on platforms
such as IBM i and z/OS where we do not create a group.

Refs: #2937
Refs: https://ibmi-oss-docs.readthedocs.io/en/latest/yum/IBM_REPOS.html
targos pushed a commit to targos/nodejs-build that referenced this issue May 21, 2023
Update to use current yum repositories for IBM i 7.3.
Install Python 3.9, and use it to install `tap2junit`.
Do not set group on the `.ssh` directory on platforms
such as IBM i and z/OS where we do not create a group.

Refs: nodejs#2937
Refs: https://ibmi-oss-docs.readthedocs.io/en/latest/yum/IBM_REPOS.html
@richardlau
Copy link
Member Author

We are now using the correct IBM i yum repositories and coreutils-gnu package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants