arm64 builds hangs on install-info a lot #62

lazka · 2022-08-26T14:00:49Z

Just so we have an issue to link to and discuss maybe

jeremyd2019 · 2022-08-26T17:19:26Z

I believe this is a bug in cygwin/msys2-runtime, because it also happened with pacman frequently when it would verify sync db signatures. The workaround I had for that was to append DatabaseNever to SigLevel.

Add attempted workaround for install-info.exe hanging, by disabling the pacman hooks that call it. msys2/msys2-autobuild#62

dscho · 2023-02-16T20:35:42Z

I wonder whether this is still the case for v3.4.*...

jeremyd2019 · 2023-04-01T18:46:24Z

This hang seems to be happening much more often with the new 2023 dev kit machine compared to the qc710. It is now even happening when validating signatures on packages, where it would usually only happen validating database signatures before.

hmartinez82 · 2024-02-13T17:00:29Z

Me and the GIMP project have also seen this. There's a comment lost in a Merge Request (Gitlab) somewhere about it. They decided to stop pacman update as part of the builds and the runners are now doing this on a daily scheduled task with timeout/retry overnight :(

dscho · 2024-02-13T23:06:29Z

I gathered some information that may be helpful for analyzing this issue, and wrote it down here.

jeremyd2019 · 2024-02-15T18:40:49Z

I wonder if it might possibly be https://cygwin.com/pipermail/cygwin/2024-February/255431.html. Maybe we can try backporting that patch (msys2/msys2-runtime@4e77fa9b8bf4) and see if the issues go away?

jeremyd2019 · 2024-02-15T20:25:09Z

If anyone else wants to try, I built msys2-runtime and msys2-runtime-3.3 with that patch applied in https://github.com/jeremyd2019/MSYS2-packages/actions/runs/7921543265. I am planning to try some things with it and see what happens.

UPDATE: that seems pretty broken. I'm guessing I didn't backport the fix correctly.

jeremyd2019 · 2024-02-16T01:09:34Z

https://github.com/jeremyd2019/MSYS2-packages/actions/runs/7924206550 is at least not as immediately broken 😉. Will test that

jeremyd2019 · 2024-02-16T20:52:56Z

I built both 3.4 and 3.3, and 3.3 for 32-bit (which took some doing because any binutils later than 2.40 resulted in a broken msys-2.0.dll). I then set both a Windows 11 VM on the Dev Kit and a Windows 10 install on a Raspberry Pi 4 in a loop running pacman (without disabling db signature checking). The raspberry pi did hang up, but the debugger looks different than I remember. The dev kit vm is still going at last check.

jeremyd2019 · 2024-02-17T06:08:55Z

I think the 32-bit on the raspberry pi hung up in pinfo::release calling CloseHandle. Not entirely sure which handle. I'd probably try getting a debug build next, but not sure I care that much about 32-bit (just that I remembered the raspberry pi having hang issues more frequently so wanted to test it too)

jeremyd2019 · 2024-02-19T05:07:03Z

Looking back at the cygwin thread, it seems that patch was introduced after a report of a hang with 100% CPU usage, rather than the hang with 0 CPU usage that we see, so I'm not sure it's the same issue. I guess I'll keep looking into the pinfo::release hang I see on the raspberry pi as I have time.

jeremyd2019 · 2024-02-19T06:38:37Z

With debug build it hung up somewhere different, but doesn't make any more sense. This time it hung up apparently during process teardown, having called _exit(0), eventually getting to proc_terminate, and is hanging calling TerminateThread on what seems to be a valid handle to a thread. I don't see that thread currently running in the debugger though. I don't see any reason why TerminateThread should hang.

jeremyd2019 · 2024-02-20T21:08:54Z

The 64-bit msys2 on windows 11 did eventually hang too ☹️ looking at some of the other messages in the thread I wasn't too positive about that being the issue we were seeing.

When running `pacman` on Windows/ARM64, we frequently run into curious hangs (see msys2/msys2-autobuild#62 for more details). This commit aims to work around that by replacing the double-fork with a single-fork in `_gpgme_io_spawn()`. Signed-off-by: Johannes Schindelin <[email protected]>

dscho · 2024-05-04T22:14:51Z

I finally have some good news. While I am not even close to a fix, I have a work-around: msys2/MSYS2-packages#4583

Here is a run of Git for Windows' sync job that tries to update, commit & push the git-sdk-arm64 repository. Previously it consistently ran into those hangs, and replacing the pacman.exe built in that PR works around those hangs.

By manually observing the hangs (RDPing into those self-hosted runners) I figured out that there were typically around half a dozen hanging processes whose command-lines were identical to their respective parent processes' command-lines. I've tracked that down to libgpgme's _gpgme_io_spawn() function, which calls a fork() that is immediately followed by another fork() in its child process. GPGME's source code history calls this a "double-fork", and it is not really clear to me why this would be needed.

One thing that helped me tremendously while debugging this was the insight that calling that PowerShell script that runs pacman several times (to do both core and system update, followed by another pacman invocation to allow for Git for Windows' hack that is the post-install script of the git-extra package) works when run in a PowerShell window, but consistently hangs when being run in an SSH session. Or, for that matter, a VS Code terminal (which I ran, for convenience, via a VS Code tunnel).

So these are my thoughts how to proceed from here:

Deploying Work around pacman hangs on Windows/ARM64 MSYS2-packages#4583 should work around those Windows/ARM64 hangs.
We still need to dig into this a bit further, to actually understand what is going wrong.
It might be a Windows bug (or the promised Windows update might be just another work-around), we will see...
The work-around I propose points to the double-fork in GPGME as triggering the issue. It won't be trivial, and it will take time, but I am confident, at least, that this can be turned into a small and simple reproducer.
My hunch is that it is not a Windows bug, but that the pseudo terminal/pseudo console code in the MSYS2 runtime that runs every time a new process is forked is still not race-free.
There are other issues in this area of the MSYS2 runtime that might be related, and need to be addressed even if they are not related. For example, @tyan0 said that crashing sub-processes would likely lead to hangs, too.

jeremyd2019 · 2024-05-05T22:07:06Z

In that case, I wonder if there's a race between starting up the wait thread and shutting it down during process exit. Assuming the second fork is followed by an exec in the (grand)child, that could further complicate things because I think that there is some magic that shuffles around the process tree to try to make it look as though exec actually replaced the process instead of starting a new one. (I think that may even be involved in the wait thread).

I never did get a good understanding of locking around this code, either. This is why I was trying Interlocked operations, to see if maybe there was a race going on, because I was seeing things in the debugger like handles that were NULL in the struct, but the stack showed a non-NULL handle passed to functions like CloseHandle or TerminateThread. I think I was satisfied that they were moving the handle into a temp variable and nulling it in the struct before closing it, but it felt like it was trying to avoid a race in a not-horribly-effective manner.

As for a Windows bug, I couldn't see any good reason for TerminateThread to block. I was a little concerned that maybe terminating a thread could leave the emulation in a bad state.

jeremyd2019 · 2024-05-06T20:40:35Z

I've tracked that down to libgpgme's _gpgme_io_spawn() function, which calls a fork() that is immediately followed by another fork() in its child process. GPGME's source code history calls this a "double-fork", and it is not really clear to me why this would be needed.

I read the code, and I think I understand what it is trying to do. It has this comment:

/* Intermediate child to prevent zombie processes.  */

As I recall, there is a "rule" on *nix that a parent must wait on a child process (or ignore SIGCHLD), or the child will be kept around in the process tables (as a "zombie"). This code seems to not want to wait for the child, and as a library doesn't want to mess with global state like signal handlers. The intermediate process exits as soon as it forks the child that will exec, so is absolved from having to wait for the child, and the parent process waits on the intermediate process allowing it to be reaped.

Alovchin91 · 2024-05-07T11:33:02Z

@jeremyd2019 I believe you’re 100% correct, and TIL: http://stackoverflow.com/questions/10932592/why-fork-twice/16655124#16655124

And I think it’s exactly the reason why it un-hangs if I manually kill the “right” pacman process (the intermediate child apparently). Which probably means that it’s actually the intermediate child that hangs (maybe because the grandchild exits too soon?).

Hope this information helps in diagnosing the root cause 🤞

jeremyd2019 · 2024-11-12T18:29:09Z

I missed the ContextFlags in the docs. This was just a quick hack of a test, based on what was almost certainly a misreading of your prior comment about SuspendThread and GetThreadContext. I had at first interpreted it as GetThreadContext having some side-effect of 'synchronizing' the thread to make sure it was 'booted' from emulation, but re-reading it seems pretty obviously to mean calling GetThreadContext so the child would have the context needed to re-create the threads post-fork. In any event, it seems odd that adding that call seems to have helped: where just SuspendThread hung almost immediately for my fork test program, SuspendThread + GetThreadContext is still going after 12+ hours without a hang. Perhaps just a yield or small sleep would do as well? If not, I will initialize the ContextFlags for future tests.

jeremyd2019 · 2024-11-12T18:59:32Z

I've got a run going now with yield (aka Sleep(0)) instead of GetThreadContext: jeremyd2019/msys2-runtime@c65ed81. Hasn't hung immediately like with nothing between SuspendThread and TerminateThread

Alovchin91 · 2024-11-12T19:20:37Z

I've got a run going now with yield (aka Sleep(0)) instead of GetThreadContext

That's probably not what you're trying to achieve though:

https://devblogs.microsoft.com/oldnewthing/20150205-00/?p=44743

Seems your original solution with GetThreadContext is the correct one.

jeremyd2019 · 2024-11-12T19:24:12Z

Yeah, seems I do need GetThreadContext for the synchronization side-effect. The yield does seem to be working though, but probably less reliably than GetThreadContext would.

Alovchin91 · 2024-11-12T19:26:59Z

Yep, that's pretty much the last two paragraphs:

If you want to make sure the thread really is suspended, you need to perform a synchronous operation that is dependent on the fact that the thread is suspended. This forces the suspend request to be processed since it is a prerequisite for your operation, and since your operation is synchronous, you know that by the time it returns, the suspend has definitely occurred.

The traditional way of doing this is to call GetThreadContext, since this requires the kernel to read from the context of the suspended thread, which has as a prerequisite that the context be saved in the first place, which has as a prerequisite that the thread be suspended.

The issue with Sleep is that it doesn't guarantee that the scheduler will process the suspension request before it gets back to your calling thread.

jeremyd2019 · 2024-11-12T19:50:33Z

Thanks. I thought I was just hacking around blindly trying to find something that helped, and here I'm getting code reviews and everything 😁. I'm going to test jeremyd2019/msys2-runtime@7863965 for a while, and if that works I'm going to try submitting jeremyd2019/msys2-runtime@8597665 to [email protected] and see what they say. (The only reason I'm testing one and submitting the other is that I already know the CancelSynchronousIo addition helps quite a bit, so would make any potential reproduction of a hang less likely).

dscho · 2024-11-12T21:02:07Z

@jeremyd2019 great work! May I suggest to expand the commit message a bit, though? Something like this:

cygthread: suspend thread before terminating.

This addresses an extremely difficult to debug deadlock when running
under emulation on ARM64.

A relatively easy way to trigger this bug is to call `fork()`, then within the
child process immediately call another `fork()` and then `exit()` the
intermediate process.

It would seem that there is a "code emulation" lock on the wait thread at
this stage, and if the thread is terminated too early, that lock still exists
albeit without a thread, and nothing moves forward.

It seems that a `SuspendThread()` combined with a `GetThreadContext()`
(to force the thread to _actually_ be suspended, for more details see
https://devblogs.microsoft.com/oldnewthing/20150205-00/?p=44743)
makes sure the thread is "booted" from emulation before it is suspended.

Hopefully this means it won't be holding any locks or otherwise leave
emulation in a bad state when the thread is terminated.

Also, attempt to use `CancelSynchonousIo()` (as seen in `flock.cc`) to avoid
the need for `TerminateThread()` altogether.  This doesn't always work,
however, so was not a complete fix for the deadlock issue.

Addresses: https://cygwin.com/pipermail/cygwin-developers/2024-May/012694.html
Signed-off-by: Jeremy Drake <[email protected]>

jeremyd2019 · 2024-11-12T22:48:03Z

https://cygwin.com/pipermail/cygwin-patches/2024q4/012886.html

dscho · 2024-11-12T22:59:34Z

https://cygwin.com/pipermail/cygwin-patches/2024q4/012886.html

Thank you! I have a bias toward what I consider a nicer UI at https://inbox.sourceware.org/cygwin-patches/[email protected]/T/#u, maybe you like that one, too?

jeremyd2019 · 2024-11-13T00:31:00Z

this patch applies cleanly to 3.3 as well, so I've got it built for i686 and running my reproducer on win10/raspberry pi 4 since that seemed to be a very good machine to reproduce the hang on (along with the qc710, which I have used for the prior tests for this patch and is still going)

hmartinez82 · 2024-11-13T01:26:35Z

I can help test too. What should I download and do? In my arm64 pacman locks a lot.

jeremyd2019 · 2024-11-13T01:38:51Z

test x86_64 binaries can be found:

for upstream msys2: https://github.com/jeremyd2019/msys2-runtime/actions/runs/11806806336/artifacts/2179071260
for git for windows fork of msys2: https://github.com/git-for-windows/msys2-runtime/actions/runs/11805927624/artifacts/2178808019

in these zips, I just extract the usr/bin/msys-2.0.dll and replace (after backing up!) the one in msys2.

pmsjt · 2024-11-13T03:17:50Z

Nicely done @jeremyd2019 . Yes, the whole point with my original suggestion with Suspend + GetContext was so you can make sure exit from simulation is complete. Sleep/Yield would not give you that or any other API.

I'll pull these and give it a go too. This issue usually reproduces almost instantaneously on my Qualcomm 8380 machine. Thanks for sharing the engineering privates with the fix.

pmsjt · 2024-11-13T04:52:37Z

This test seems to be holding up

#include <stdio.h>
#include <sys/wait.h>
#include <unistd.h>

#ifndef BINARY
#define BINARY "/bin/true"
#endif

#ifndef ARG
#define ARG "0.1"
#endif

int main(int argc, char ** argv)
{
	while (1)
	{
		int pid;
		printf("Starting group of 100x " BINARY " " ARG "\n");
		for (int i = 0; i < 100; ++i)
		{
			pid = fork();
			if (pid == -1)
			{
				perror("fork error");
				return 1;
			}
			else if (pid == 0)
			{
				if ((pid = fork()) == 0)
				{
					char * const args[] = {BINARY, ARG, NULL};
					execv(BINARY, args);
					perror("execv failed");
					_exit(5);
				}
				if (pid == -1)
				{
					perror("inner fork error");
					_exit(1);
				}
				else
				{
					_exit(0);
				}
			}
			else
			{
				int status;
				if (waitpid(pid, &status, 0) == -1)
				{
					perror("waitpid error");
					return 2;
				}
				else if (status != 0)
				{
					fprintf(stderr, "subprocess exited non-zero: %d\n", status);
					return WEXITSTATUS(status);
				}
			}
		}
	}
	return 0;
}

pmsjt · 2024-11-13T04:53:42Z

This does suggest a change that I could potentially do to the emulator to protect cases such as this: A change to always exit simulation before completing cross-thread termination.

So far we haven't been considering because cross-thread termination is already such a Russian roulette, that it didn't seem to warrant the effort to un-simulate first... but Cygwin is giving me a new perspective.

dscho · 2024-11-13T16:14:31Z

for upstream msys2: https://github.com/jeremyd2019/msys2-runtime/actions/runs/11806806336/artifacts/2179071260

As I could confirm in the git-for-windows/msys2-runtime PR that your patch does what we hoped it would, do you maybe want to open a PR for msys2/msys2-runtime@msys2-3.5.4...jeremyd2019:msys2-runtime:msys2-3.5.4-suspendthread? I think we're good to go on that.

jeremyd2019 · 2024-11-13T18:03:15Z

msys2/msys2-runtime#234 (and msys2/msys2-runtime#235 for good measure)

jeremyd2019 · 2024-11-13T18:07:44Z

I ran my fork reproducer for over 12 hours, 4x instances on the QC710 running x86_64/msys2-runtime-3.5.4, and 2x instances on the Raspberry Pi 4 8GB running i686/msys2-runtime-3.3.6, and all instances are still going without deadlocks or crashes.

hmartinez82 · 2024-11-13T20:44:50Z

Salivating here.

Alovchin91 · 2024-11-13T20:49:34Z

Amazing, terrific work, everyone! ❤️ Thank you so much for this!

After msys2/msys2-runtime#234, we shouldn't hit msys2/msys2-autobuild#62 anymore, so remove the workarounds for it. Also remove the sed for enabling clangarm64 in pacman.conf, since it has been enabled by default for a couple of years now.

pmsjt · 2024-11-14T03:26:28Z

Thank you @jeremyd2019 and everyone!

On Windows/ARM64, running the 64-bit version of Git for Windows could infrequently cause deadlocked threads (see e.g. [this report](msys2/msys2-autobuild#62) or [this one](https://inbox.sourceware.org/cygwin-developers/[email protected]/)), [which was addressed](git-for-windows/msys2-runtime#73). Signed-off-by: gitforwindowshelper[bot] <[email protected]>

jeremyd2019 · 2024-11-15T18:02:30Z

Well. I just went to update my raspberry pi, which I left running 2x instances for however long it's been, and there was a totally different failure:

    185 [main] forky (536) C:\msys32\home\Administrator\forky.exe: *** fatal error in forked process - CreateFileMapping shared.5, Win32 error 1455.  Terminating.
  14439 [main] forky (536) cygwin_exception::open_stackdumpfile: Dumping stack trace to forky.exe.stackdump
     90 [main] forky 295 dofork: child -1 - forked process 536 died unexpectedly, retry 0, exit code 0xC0000142, errno 11
fork error: Resource temporarily unavailable

The stackdump was empty.

 C:\>net helpmsg 1455

The paging file is too small for this operation to complete.

Phew, maybe the scrollback in Windows Terminal exhausted the page file?

QC710 was still going without issue.

Alovchin91 · 2024-11-15T19:30:17Z

0xC0000142 is STATUS_DLL_INIT_FAILED and 11 is, among others, ERROR_BAD_FORMAT 🤔

1455 is indeed ERROR_COMMITMENT_LIMIT and it seems to be an ordinary out of memory exception.

Is there maybe a memory leak somewhere? Or does your test app just keep on creating forks without exiting the parent process? I wonder if you could've exhausted the amount of process handles, for example 🤔 It's a 32-bit app after all.

jeremyd2019 · 2024-11-15T19:42:23Z

In this case the 11 is the cygwin errno, EAGAIN: Resource temporarily unavailable. https://gist.github.com/jeremyd2019/3156721497096d0bba00ef19a507f619 does keep creating forks without exiting the parent process... In any event, I don't think it's an error that needs worrying about

Alovchin91 · 2024-11-15T19:44:28Z

In this case the 11 is the cygwin errno, EAGAIN: Resource temporarily unavailable. https://gist.github.com/jeremyd2019/3156721497096d0bba00ef19a507f619 does keep creating forks without exiting the parent process...

I would indeed assume a process handle exhaustion in this case 🤔

lazka added a commit to lazka/msys2-autobuild that referenced this issue Aug 26, 2022

arm64: disable texinfo hooks as a workaround

42c04bf

see msys2#62

lazka mentioned this issue Aug 26, 2022

arm64: disable texinfo hooks as a workaround #63

Closed

jeremyd2019 added a commit to jeremyd2019/winautoconfig that referenced this issue Aug 27, 2022

Uncomment clangarm64 section from pacman.conf.

11954f5

Add attempted workaround for install-info.exe hanging, by disabling the pacman hooks that call it. msys2/msys2-autobuild#62

jeremyd2019 mentioned this issue Sep 5, 2022

Update to Cygwin v3.3.6 msys2/msys2-runtime#100

Merged

dscho mentioned this issue Feb 16, 2023

Port git-artifacts pipeline to git-for-windows-automation git-for-windows/git-for-windows-automation#22

Merged

dennisameling mentioned this issue Jul 13, 2023

Tracking issue for Windows ARM64 support git-for-windows/git#3107

Open

jeremyd2019 mentioned this issue Jan 11, 2024

[pacman] getting stuck on clang-aarch64 msys2/MSYS2-packages#4340

Open

1 task

rimrul mentioned this issue Jan 12, 2024

clang: update to curent MSYS2 pkgrel 7 git-for-windows/MINGW-packages#105

Closed

jeremyd2019 mentioned this issue Jan 15, 2024

New installer hangs on clangarm64 msys2/setup-msys2#358

Open

rimrul mentioned this issue Jan 17, 2024

build-and-deploy: use update-via-pacman.ps1 instead of a home-grown pacman -Syyu git-for-windows/git-for-windows-automation#61

Merged

runeksvendsen mentioned this issue Feb 1, 2024

Windows/PowerShell arm64 installer hangs after "Installing Dependencies..." haskell/ghcup-hs#973

Closed

dscho mentioned this issue May 4, 2024

Work around pacman hangs on Windows/ARM64 msys2/MSYS2-packages#4583

Closed

jeremyd2019 mentioned this issue Nov 13, 2024

cygthread: suspend thread before terminating. msys2/msys2-runtime#234

Merged

lazka closed this as completed in msys2/msys2-runtime#234 Nov 13, 2024

dscho mentioned this issue Nov 14, 2024

msys2-runtime: update to 54a725237845135bbb4609e1194366dece79a0e3 git-for-windows/MSYS2-packages#200

Merged

arm64 builds hangs on install-info a lot #62

arm64 builds hangs on install-info a lot #62

Comments

lazka commented Aug 26, 2022

jeremyd2019 commented Aug 26, 2022 • edited Loading

dscho commented Feb 16, 2023

jeremyd2019 commented Apr 1, 2023

hmartinez82 commented Feb 13, 2024 • edited Loading

dscho commented Feb 13, 2024

jeremyd2019 commented Feb 15, 2024 • edited Loading

jeremyd2019 commented Feb 15, 2024 • edited Loading

jeremyd2019 commented Feb 16, 2024

jeremyd2019 commented Feb 16, 2024

jeremyd2019 commented Feb 17, 2024

jeremyd2019 commented Feb 19, 2024

jeremyd2019 commented Feb 19, 2024

jeremyd2019 commented Feb 20, 2024

dscho commented May 4, 2024

jeremyd2019 commented May 5, 2024 • edited Loading

jeremyd2019 commented May 6, 2024

Alovchin91 commented May 7, 2024 • edited Loading

jeremyd2019 commented Nov 12, 2024

jeremyd2019 commented Nov 12, 2024

Alovchin91 commented Nov 12, 2024 • edited Loading

jeremyd2019 commented Nov 12, 2024

Alovchin91 commented Nov 12, 2024

jeremyd2019 commented Nov 12, 2024 • edited Loading

dscho commented Nov 12, 2024 • edited Loading

jeremyd2019 commented Nov 12, 2024

dscho commented Nov 12, 2024

jeremyd2019 commented Nov 13, 2024

hmartinez82 commented Nov 13, 2024

jeremyd2019 commented Nov 13, 2024

pmsjt commented Nov 13, 2024

pmsjt commented Nov 13, 2024

pmsjt commented Nov 13, 2024 • edited Loading

dscho commented Nov 13, 2024

jeremyd2019 commented Nov 13, 2024

jeremyd2019 commented Nov 13, 2024

hmartinez82 commented Nov 13, 2024 • edited Loading

Alovchin91 commented Nov 13, 2024 • edited Loading

pmsjt commented Nov 14, 2024

jeremyd2019 commented Nov 15, 2024 • edited Loading

Alovchin91 commented Nov 15, 2024 • edited Loading

jeremyd2019 commented Nov 15, 2024

Alovchin91 commented Nov 15, 2024

jeremyd2019 commented Aug 26, 2022 •

edited

Loading

hmartinez82 commented Feb 13, 2024 •

edited

Loading

jeremyd2019 commented Feb 15, 2024 •

edited

Loading

jeremyd2019 commented Feb 15, 2024 •

edited

Loading

jeremyd2019 commented May 5, 2024 •

edited

Loading

Alovchin91 commented May 7, 2024 •

edited

Loading

Alovchin91 commented Nov 12, 2024 •

edited

Loading

jeremyd2019 commented Nov 12, 2024 •

edited

Loading

dscho commented Nov 12, 2024 •

edited

Loading

pmsjt commented Nov 13, 2024 •

edited

Loading

hmartinez82 commented Nov 13, 2024 •

edited

Loading

Alovchin91 commented Nov 13, 2024 •

edited

Loading

jeremyd2019 commented Nov 15, 2024 •

edited

Loading

Alovchin91 commented Nov 15, 2024 •

edited

Loading