Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pacman] getting stuck on clang-aarch64 #4340

Open
1 task done
Wormnest opened this issue Jan 11, 2024 · 13 comments
Open
1 task done

[pacman] getting stuck on clang-aarch64 #4340

Wormnest opened this issue Jan 11, 2024 · 13 comments
Labels

Comments

@Wormnest
Copy link

Description / Steps to reproduce the issue

Since about a month GIMP's aarch64 CI runner is getting stuck when running pacman --noconfirm -Suy.
The last job that succeeded was Dec 11 and another from the same day and any later one is failing.

Most of the time (e.g. here) it already seems to stop before the databases are updated:

$ C:\msys64\usr\bin\bash -lc "bash -x ./build/windows/gitlab-ci/1_build-deps-msys2.sh"
+ set -e
+ [[ aarch64 == \a\a\r\c\h\6\4 ]]
+ export ARTIFACTS_SUFFIX=-a64
+ ARTIFACTS_SUFFIX=-a64
+ [[ CI_NATIVE != \C\I\_\N\A\T\I\V\E ]]
+ pacman --noconfirm -Suy
WARNING: Failed to terminate process: 1 error occurred:
	* failed to attach the runner process to the console of its parent process: The handle is invalid.
WARNING: Timed out waiting for the build to finish

Sometimes it gets a little further:

$ C:\msys64\usr\bin\bash -lc "bash -x ./build/windows/gitlab-ci/1_build-deps-msys2.sh"
+ set -e
+ [[ aarch64 == \a\a\r\c\h\6\4 ]]
+ export ARTIFACTS_SUFFIX=-a64
+ ARTIFACTS_SUFFIX=-a64
+ [[ CI_NATIVE != \C\I\_\N\A\T\I\V\E ]]
+ pacman --noconfirm -Suy
:: Synchronizing package databases...
 clangarm64 downloading...
 msys downloading...
:: Starting core system upgrade...
 there is nothing to do
:: Starting full system upgrade...
 there is nothing to do
+ pacman --noconfirm -S --needed base-devel mingw-w64-clang-aarch64-toolchain mingw-w64-clang-aarch64-meson mingw-w64-clang-aarch64-cairo mingw-w64-clang-aarch64-crt-git mingw-w64-clang-aarch64-glib-networking mingw-w64-clang-aarch64-gobject-introspection mingw-w64-clang-aarch64-json-glib mingw-w64-clang-aarch64-lcms2 mingw-w64-clang-aarch64-lensfun mingw-w64-clang-aarch64-libspiro mingw-w64-clang-aarch64-maxflow mingw-w64-clang-aarch64-openexr mingw-w64-clang-aarch64-pango mingw-w64-clang-aarch64-suitesparse mingw-w64-clang-aarch64-vala
WARNING: Failed to terminate process: 1 error occurred:
	* failed to attach the runner process to the console of its parent process: The handle is invalid.

When testing on my Ms Dev kit now it did not get stuck (but I do remember seeing that sometimes in the past). However, when checking with Process Explorer, I do see that after pacman closed the terminal, the pacman and conhost processes are still running.

Expected behavior

Pacman finishes after doing its thing.

Actual behavior

Pacman gets stuck

Verification

Windows Version

MSYS64_NT-10.0-22621

Are you willing to submit a PR?

No response

@Wormnest Wormnest added the bug label Jan 11, 2024
@Biswa96
Copy link
Member

Biswa96 commented Jan 11, 2024

Does the CI script terminate msys2 processes after update? See https://github.com/msys2/setup-msys2/blob/8b0d40b8912601756301a7b3de7752d5dba969cd/main.js#L408

In that main.js file, pacman -Syuu updates base packages > terminate all msys2 related processes > update remaining packages.

@jeremyd2019
Copy link
Member

jeremyd2019 commented Jan 11, 2024

There is also a bug (presumably in msys2-runtime), which I have never been able to debug, that manifests as processes hanging around when they should have exited. This usually seems to happen when pacman uses gpgme to attempt to validate signatures. As a workaround, I disable the validation of database signatures in pacman.conf, because the database signature verification seems to happen every time pacman is run, whereas package verification is only done when a package is being installed. I still see occasional hangups in package verification though.

See also msys2/msys2-autobuild#62 which I think is the closest thing to an existing bug tracking this.

Workarounds I currently apply:

REM https://github.com/msys2/msys2-autobuild/issues/62
CALL C:\msys64\msys2_shell.cmd -defterm -no-start -c "mkdir -p /etc/pacman.d/hooks && touch /etc/pacman.d/hooks/texinfo-{install,remove}.hook"
REM the caret is messing with CMD parsing, try it another way
C:\msys64\usr\bin\sed.exe -i -e 's/^^\(SigLevel\s\+=\s\+Required\)\s*$/\1 DatabaseNever/' /etc/pacman.conf

@Wormnest
Copy link
Author

Does the CI script terminate msys2 processes after update?

We are now doing something similar, which seems to get us further, but it's finding a database lock.

On my own Arm Dev Kit, I just updated pacman, which went without problems. However, it is now stuck in compiling part of GIMP (I think I've seen this before too). So, this getting stuck is probably not specific to pacman. Looking in process explorer, the innermost process is env.exe. Could it be related to reading/setting env vars, which I seem to remember can have problems from multiple threads.

@Biswa96
Copy link
Member

Biswa96 commented Jan 15, 2024

The database lock file /var/lib/pacman/db.lck can be deleted before running pacman command. Though, I am not sure if that will fix the actual issue.

@brunvonlope
Copy link

Hi. The database lock is being investigated externally (this is not a MSYS2 bug). The problem is, when the database is not a concern, we cann't kill pacman easily. See: https://gitlab.gnome.org/GNOME/gimp/-/jobs/3458478#L157

@Biswa96
Copy link
Member

Biswa96 commented Jan 15, 2024

The msys2 related processes should be terminated outside of msys2 environment. For example, taskkill command in a batch file. See setup-msys2 repository as mentioned.

@brunvonlope
Copy link

brunvonlope commented Jan 15, 2024

The msys2 related processes should be terminated outside of msys2 environment. For example, taskkill command in a batch file. See setup-msys2 repository as mentioned.

I tried before with takkill but the exit code makes the job fail.
Inkscape, for example, uses taskkill, but they use the "retry" key in CI .yml.

@Biswa96
Copy link
Member

Biswa96 commented Jan 15, 2024

OK, I am out of ideas then. By the way, @hmartinez82 has done some great work of porting apps to aarch64. He may suggest some ideas.

@jeremyd2019
Copy link
Member

I wish somebody with good knowledge of low-level debugging on WoA (and/or of Cygwin) could debug this, I have tried and had no luck (I always got an error getting the context of the main thread, from every debugger I tried: windbg, gdb, lldb).

@hmartinez82
Copy link
Contributor

hmartinez82 commented Jan 16, 2024

I even see this happening, randomly, in my personal laptop when using pacman.

@Biswa96 GIMP's aarch64 CI runner is actually a Windows DevKit in my living room 😅. I joke you not.

I'm not low level debugger. Actually I'm thinking about installing Tailscale in that VM and letting someone else with more expertise take a look.

@Jehan
Copy link

Jehan commented Feb 1, 2024

Hi all! Now we have new runners contributed by Arm Ltd., additionally to the one by @hmartinez82. And this pacman getting stuck issue is also happening randomly on their runners.

Is there anything we could tell the admins at Arm to look for in order to help you debug this issue?

@hmartinez82
Copy link
Contributor

@Jehan I'm glad they are having it too, so we now know it's not just my runner. I don't know what the issue is.

@jeremyd2019
Copy link
Member

https://cygwin.com/pipermail/cygwin-patches/2024q1/012617.html

gnomesysadmins pushed a commit to GNOME/gimp that referenced this issue Feb 8, 2024
MSYS2 pacman gets randomly stuck on Windows/Aarch64. The actual issue is still
being investigated by upstream projects, though anyway it's bad for us right
now, to the point that there are discussions to remove Aarch64 support from the
Windows installer (whereas it just got added recently!) in #10729.

This is an attempt to a workaround. Instead of getting stuck forever and waiting
until the whole job times out (per Gitlab CI settings), I time-out the pacman
command within our script and try again, up to 2 more times. Hopefully one of
the calls would succeed.

See: msys2/MSYS2-packages#4340
gnomesysadmins pushed a commit to GNOME/gimp that referenced this issue Feb 8, 2024
MSYS2 pacman gets randomly stuck on Windows/Aarch64. The actual issue is still
being investigated by upstream projects, though anyway it's bad for us right
now, to the point that there are discussions to remove Aarch64 support from the
Windows installer (whereas it just got added recently!) in #10729.

This is an attempt to a workaround. Instead of getting stuck forever and waiting
until the whole job times out (per Gitlab CI settings), I time-out the pacman
command within our script and try again, up to 2 more times. Hopefully one of
the calls would succeed.

See: msys2/MSYS2-packages#4340
gnomesysadmins pushed a commit to GNOME/gimp that referenced this issue Feb 8, 2024
MSYS2 pacman gets randomly stuck on Windows/Aarch64. The actual issue is still
being investigated by upstream projects, though anyway it's bad for us right
now, to the point that there are discussions to remove Aarch64 support from the
Windows installer (whereas it just got added recently!) in #10729.

This is an attempt to a workaround. Instead of getting stuck forever and waiting
until the whole job times out (per Gitlab CI settings), I time-out the pacman
command within our script and try again, up to 2 more times. Hopefully one of
the calls would succeed.

See: msys2/MSYS2-packages#4340
gnomesysadmins pushed a commit to GNOME/gimp that referenced this issue Feb 8, 2024
MSYS2 pacman gets randomly stuck on Windows/Aarch64. The actual issue is still
being investigated by upstream projects, though anyway it's bad for us right
now, to the point that there are discussions to remove Aarch64 support from the
Windows installer (whereas it just got added recently!) in #10729.

This is an attempt to a workaround. Instead of getting stuck forever and waiting
until the whole job times out (per Gitlab CI settings), I time-out the pacman
command within our script and try again, up to 2 more times. Hopefully one of
the calls would succeed.

See: msys2/MSYS2-packages#4340
gnomesysadmins pushed a commit to GNOME/gimp that referenced this issue Feb 8, 2024
MSYS2 pacman gets randomly stuck on Windows/Aarch64. The actual issue is still
being investigated by upstream projects, though anyway it's bad for us right
now, to the point that there are discussions to remove Aarch64 support from the
Windows installer (whereas it just got added recently!) in #10729.

This is an attempt to a workaround. Instead of getting stuck forever and waiting
until the whole job times out (per Gitlab CI settings), I time-out (after 3
minutes) the pacman command within our script and try again, up to 2 more times.
Hopefully one of the calls would succeed.

I also send a SIGKILL through the timeout (though I have no idea how signals
translate to Windows processes) and run again taskkill after this, which may
seem overkill. Interestingly I get output for both, which seems to indicate that
the kill succeeds in both cases (because of several processes?).

Anyway clearly it's a bit of random code not completely understood, but the
inability to test this all locally clearly doesn't help so it's good enough for
the time being.

See: msys2/MSYS2-packages#4340
gnomesysadmins pushed a commit to GNOME/gimp that referenced this issue Feb 8, 2024
…64 jobs.

This is the command suggest by MSYS2 developers here:
msys2/MSYS2-packages#4340 (comment)

They also say to run it outside the MSYS2 environment, which is why it's in the
CI rules, not in the shell script.
gnomesysadmins pushed a commit to GNOME/gimp that referenced this issue Feb 8, 2024
…64 jobs.

This is the command suggest by MSYS2 developers here:
msys2/MSYS2-packages#4340 (comment)

They also say to run it outside the MSYS2 environment, which is why it's in the
CI rules, not in the shell script.

Honestly at this point, it feels like we are just stacking weird workaround to
get it to fail not too often. ;-(
gnomesysadmins pushed a commit to GNOME/gimp that referenced this issue Feb 8, 2024
MSYS2 pacman gets randomly stuck on Windows/Aarch64. The actual issue is still
being investigated by upstream projects, though anyway it's bad for us right
now, to the point that there are discussions to remove Aarch64 support from the
Windows installer (whereas it just got added recently!) in #10729.

This is an attempt to a workaround. Instead of getting stuck forever and waiting
until the whole job times out (per Gitlab CI settings), I time-out (after 3
minutes) the pacman command within our script and try again, up to 2 more times.
Hopefully one of the calls would succeed.

I also send a SIGKILL through the timeout (though I have no idea how signals
translate to Windows processes) and run again taskkill after this, which may
seem overkill. Interestingly I get output for both, which seems to indicate that
the kill succeeds in both cases (because of several processes?).

Anyway clearly it's a bit of random code not completely understood, but the
inability to test this all locally clearly doesn't help so it's good enough for
the time being.

See: msys2/MSYS2-packages#4340
gnomesysadmins pushed a commit to GNOME/gimp that referenced this issue Feb 8, 2024
…64 jobs.

This is the command suggest by MSYS2 developers here:
msys2/MSYS2-packages#4340 (comment)

They also say to run it outside the MSYS2 environment, which is why it's in the
CI rules, not in the shell script.

Honestly at this point, it feels like we are just stacking weird workaround to
get it to fail not too often. ;-(
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants