Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono] Assertion at sgen-stw.c:77, condition not met #76805

Closed
ayakael opened this issue Oct 10, 2022 · 12 comments · Fixed by #76500
Closed

[mono] Assertion at sgen-stw.c:77, condition not met #76805

ayakael opened this issue Oct 10, 2022 · 12 comments · Fixed by #76500
Milestone

Comments

@ayakael
Copy link
Contributor

ayakael commented Oct 10, 2022

Description

When building Fsharp (commit 430d645d778ec0db10ad7ad0b02de9fab3ce5647) with mono-flavored runtime (commit 06aceb7 - .net 7-rc1), I've run into this error:

 Microsoft (R) F# Compiler version 12.0.5.0 for F# 7.0 (TaskId:260)
                     Copyright (c) Microsoft Corporation. All Rights Reserved. (TaskId:260)
                     Stack overflow in unmanaged: IP: 0x3ff8c9bc8c0, fault addr: 0x3ffcc5e7000 (TaskId:260)
                     Real:  0.4 Realdelta:  0.4 Cpu:  0.6 Cpudelta:  0.3 Mem:  88 G0:  10 G1:  1 G2:  1 [Import mscorlib and FSharp.Core.dll] (TaskId:260)
                     Real:  1.5 Realdelta:  1.1 Cpu:  2.1 Cpudelta:  1.5 Mem: 146 G0:  41 G1:  2 G2:  2 [Parse inputs] (TaskId:260)
                     Real:  1.5 Realdelta:  0.0 Cpu:  2.1 Cpudelta:  0.0 Mem: 146 G0:   0 G1:  0 G2:  0 [Import non-system references] (TaskId:260)
                     * Assertion at /var/build/dotnet7/tefsharp.logsting/dotnet7-stage0/src/dotnet-v7.0.100-rc.1.22431.12/src/runtime/src/mono/mono/metadata/sgen-stw.c:77, condition `info->client_info.stack_start >= info->client_info.info.stack_start_limit && info->client_info.stack_start < info->client_info.info.stack_end' not met (TaskId:260)

Reproduction Steps

Build fsharp with mono flavored runtime in linux-musl environment

Expected behavior

Build should pass

Actual behavior

Build fails with error

Regression?

Not a regression, confirmed existing on dotnet 6.0.9

Known Workarounds

None found

Configuration

dotnet --info

.NET SDK:
 Version:   7.0.100-rc.1.22431.12
 Commit:    cb14812a5c

Runtime Environment:
 OS Name:     alpine
 OS Version:  3.17_alpha20220809
 OS Platform: Linux
 RID:         linux-musl-s390x
 Base Path:   /home/build/dotnet7/testing/dotnet7-stage0/src/bootstrap/sdk/7.0.100-rc.1.22431.12/

Host:
  Version:      7.0.0-rc.1.22426.10
  Architecture: s390x
  Commit:       N/A

.NET SDKs installed:
  7.0.100-rc.1.22431.12 [/home/build/dotnet7/testing/dotnet7-stage0/src/bootstrap/sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 7.0.0-rc.1.22427.2 [/home/build/dotnet7/testing/dotnet7-stage0/src/bootstrap/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 7.0.0-rc.1.22426.10 [/home/build/dotnet7/testing/dotnet7-stage0/src/bootstrap/shared/Microsoft.NETCore.App]

Other architectures found:
  None

Environment variables:
  Not set

global.json file:
  Not found

Learn more:
  https://aka.ms/dotnet/info

Download .NET:
  https://aka.ms/dotnet/download

Other information

Seems to the same error as reported here

Full log: fsharp.log

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Oct 10, 2022
@ghost
Copy link

ghost commented Oct 10, 2022

Tagging subscribers to this area: @BrzVlad
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

When building Fsharp (commit 430d645d778ec0db10ad7ad0b02de9fab3ce5647) with mono-flavored runtime (commit 06aceb7 - .net 7-rc1), I've run into this error:

 Microsoft (R) F# Compiler version 12.0.5.0 for F# 7.0 (TaskId:260)
                     Copyright (c) Microsoft Corporation. All Rights Reserved. (TaskId:260)
                     Stack overflow in unmanaged: IP: 0x3ff8c9bc8c0, fault addr: 0x3ffcc5e7000 (TaskId:260)
                     Real:  0.4 Realdelta:  0.4 Cpu:  0.6 Cpudelta:  0.3 Mem:  88 G0:  10 G1:  1 G2:  1 [Import mscorlib and FSharp.Core.dll] (TaskId:260)
                     Real:  1.5 Realdelta:  1.1 Cpu:  2.1 Cpudelta:  1.5 Mem: 146 G0:  41 G1:  2 G2:  2 [Parse inputs] (TaskId:260)
                     Real:  1.5 Realdelta:  0.0 Cpu:  2.1 Cpudelta:  0.0 Mem: 146 G0:   0 G1:  0 G2:  0 [Import non-system references] (TaskId:260)
                     * Assertion at /var/build/dotnet7/tefsharp.logsting/dotnet7-stage0/src/dotnet-v7.0.100-rc.1.22431.12/src/runtime/src/mono/mono/metadata/sgen-stw.c:77, condition `info->client_info.stack_start >= info->client_info.info.stack_start_limit && info->client_info.stack_start < info->client_info.info.stack_end' not met (TaskId:260)

Reproduction Steps

Build fsharp with mono flavored runtime in linux-musl environment

Expected behavior

Build should pass

Actual behavior

Build fails with error

Regression?

Not a regression, confirmed existing on dotnet 6.0.9

Known Workarounds

None found

Configuration

dotnet --info

.NET SDK:
 Version:   7.0.100-rc.1.22431.12
 Commit:    cb14812a5c

Runtime Environment:
 OS Name:     alpine
 OS Version:  3.17_alpha20220809
 OS Platform: Linux
 RID:         linux-musl-s390x
 Base Path:   /home/build/dotnet7/testing/dotnet7-stage0/src/bootstrap/sdk/7.0.100-rc.1.22431.12/

Host:
  Version:      7.0.0-rc.1.22426.10
  Architecture: s390x
  Commit:       N/A

.NET SDKs installed:
  7.0.100-rc.1.22431.12 [/home/build/dotnet7/testing/dotnet7-stage0/src/bootstrap/sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 7.0.0-rc.1.22427.2 [/home/build/dotnet7/testing/dotnet7-stage0/src/bootstrap/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 7.0.0-rc.1.22426.10 [/home/build/dotnet7/testing/dotnet7-stage0/src/bootstrap/shared/Microsoft.NETCore.App]

Other architectures found:
  None

Environment variables:
  Not set

global.json file:
  Not found

Learn more:
  https://aka.ms/dotnet/info

Download .NET:
  https://aka.ms/dotnet/download

Other information

Seems to the same error as reported here

Full log: fsharp.log

Author: ayakael
Assignees: -
Labels:

untriaged, area-GC-mono

Milestone: -

@marek-safar marek-safar added this to the 8.0.0 milestone Oct 10, 2022
@marek-safar marek-safar removed the untriaged New issue has not been triaged by the area owner label Oct 10, 2022
@ayakael
Copy link
Contributor Author

ayakael commented Oct 11, 2022

Note that this seems to be the only thing blocking s390x support on Alpine Linux (and any arch that uses mono), given it being one of the last components to build. I'm investigating work-arounds, but not fruitful so far.

@ayakael
Copy link
Contributor Author

ayakael commented Oct 15, 2022

@BrzVlad I see that you were the one to reply to that bug report in 2016. For your convenience, I've setup an aport that reproduces the bug on linux-musl-x64 by building runtime with /p:PrimaryRuntimeFlavor=Mono. It is available here and you can run it via the following steps:

git clone https://gitlab.alpinelinux.org/ayakael/aports -b dotnet7/mono-fsharp
cd  aports/testing/dotnet7-stage0
abuild deps unpack prepare build

It should fail with error code 134, or error code 1.
The aport builds a minimum set of components to be able to build an SDK tar, then using that produced tarball with mono-flavored runtime it builds fsharp. This aport is usually used to crossbuild to other platforms, but in this case I am using it to easily reproduce the mono bug.

@BrzVlad
Copy link
Member

BrzVlad commented Oct 18, 2022

I think I remember now this issue. When we register a thread to the runtime (register_thread), we are fetching the stack limits of the thread from the OS, in mono_threads_platform_get_stack_bounds. From what I remember, on alpine there was a particular problem with the main thread, where the OS was reporting a small stack size (something like 128k) and, as the thread was using more stack, this size was dynamically increasing, leading to assertions in the GC. New threads created via pthread had the stack size initialized from the start and didn't exhibit this problem.

A simple diff like the following fixed the fsharp build:

diff --git a/src/mono/mono/metadata/sgen-stw.c b/src/mono/mono/metadata/sgen-stw.c
index 29a19aef06a..8be47c731fa 100644
--- a/src/mono/mono/metadata/sgen-stw.c
+++ b/src/mono/mono/metadata/sgen-stw.c
@@ -74,6 +74,8 @@ update_current_thread_stack (void *start)
 
 	info->client_info.stack_start = align_pointer (&stack_guard);
 	g_assert (info->client_info.stack_start);
+	if (info->client_info.stack_start < info->client_info.info.stack_start_limit)
+		info->client_info.info.stack_start_limit = info->client_info.stack_start;
 	g_assert (info->client_info.stack_start >= info->client_info.info.stack_start_limit && info->client_info.stack_start < info->client_info.info.stack_end);
 
 #if !defined(MONO_CROSS_COMPILE) && MONO_ARCH_HAS_MONO_CONTEXT

I think the proper fix, if possible, would be to have the stack limit detected somehow for the main thread and update mono_threads_platform_get_stack_bounds on alpine. @ayakael Do you have any insight on how to achieve this, maybe force the full reservation of the stack space for the main thread ?

@ayakael
Copy link
Contributor Author

ayakael commented Oct 18, 2022

I had started exploring the stack issues via #76805, and found that coreclr had implemented an ENSURE_PRIMARY_STACK_SIZE function that might prove useful for us:

#ifdef ENSURE_PRIMARY_STACK_SIZE
/*++
Function:
EnsureStackSize
Abstract:
This fixes a problem on MUSL where the initial stack size reported by the
pthread_attr_getstack is about 128kB, but this limit is not fixed and
the stack can grow dynamically. The problem is that it makes the
functions ReflectionInvocation::[Try]EnsureSufficientExecutionStack
to fail for real life scenarios like e.g. compilation of corefx.
Since there is no real fixed limit for the stack, the code below
ensures moving the stack limit to a value that makes reasonable
real life scenarios work.
--*/
__attribute__((noinline,NOOPT_ATTRIBUTE))
void
EnsureStackSize(SIZE_T stackSize)
{
volatile uint8_t *s = (uint8_t *)_alloca(stackSize);
*s = 0;
}
#endif // ENSURE_PRIMARY_STACK_SIZE

Granted, I'm a neophyte at C and C#, so I'm learning as I bugfix, but I feel like the comments for that function is right smack what we're encountering here.

@ayakael
Copy link
Contributor Author

ayakael commented Oct 18, 2022

Testing your patch, it fails on s390x with:

(CoreCompile target) -> 
                     /home/build/dotnet6/community/dotnet6-stage0/src/dotnet-v6.0.110/src/fsharp/src/fsharp/FSharp.Core/prim-types.fs(2019,13): error FS0073: internal error: Expression is too large and/or complex to emit. Method name: 'GenericMinimum'. Recursive depth: 21. [/home/build/dotnet6/community/dotnet6-stage0/src/dotnet-v6.0.110/src/fsharp/src/fsharp/FSharp.Core/FSharp.Core.fsproj]

Edit: With dotnet7, error appears as a StackOverflowException, thus indeed the issue still persists on s390x given its inhenrently larger stack needs. x64 seems to not be an issue on dotnet7, but it'd likely come out of the woodworks pretty easily.

@BrzVlad
Copy link
Member

BrzVlad commented Oct 20, 2022

@ayakael CoreCLR has a very simple approach. Does this fix all stack related issues you are encountering ? BrzVlad@08735ad

@ayakael
Copy link
Contributor Author

ayakael commented Oct 20, 2022

@ayakael CoreCLR has a very simple approach. Does this fix all stack related issues you are encountering ? BrzVlad@08735ad

The fix works on x64, but I'm still getting stack overflow exceptions on s390x. I'm playing with the limit as to compensate for s390x higher stack usage.

@ayakael
Copy link
Contributor Author

ayakael commented Oct 21, 2022

Setting default_size to 6 * 1024 * 1024 did the trick for s390x on .NET 7. Build of fsharp with runtime version 6.0.10 fails with:

/home/build/dotnet6/community/dotnet6-stage0/src/nuget/microsoft.dotnet.arcade.sdk/7.0.0-beta.21456.1/tools/BuildReleasePackages.targets(20,5): error MSB4062: The "Microsoft.DotNet.Tools.UpdatePackageVersionTask" task could not be loaded from the assembly /home/build/dotnet6/community/dotnet6-stage0/src/nuget/microsoft.dotnet.nugetrepack.tasks/7.0.0-beta.21456.1/tools/netcoreapp3.1/Microsoft.DotNet.NuGetRepack.Tasks.dll. Could not load file or assembly '/home/build/dotnet6/community/dotnet6-stage0/src/nuget/microsoft.dotnet.nugetrepack.tasks/7.0.0-beta.21456.1/tools/netcoreapp3.1/Microsoft.DotNet.NuGetRepack.Tasks.dll' or one of its dependencies. Confirm that the <UsingTask> declaration is correct, that the assembly and all its dependencies are available, and that the task contains a public class that implements Microsoft.Build.Framework.ITask. [/home/build/dotnet6/community/dotnet6-stage0/src/nuget/microsoft.dotnet.arcade.sdk/7.0.0-beta.21456.1/tools/AfterSolutionBuild.proj]

Although this is likely to do with testing methodology having to be different with .NET 6 SDK. I'm going to test both cases in a full source-build build and report back.

@ayakael
Copy link
Contributor Author

ayakael commented Oct 22, 2022

Indeed, it was testing methodology. With your fix, provided default_size change, dotnet6 on linux-musl-s390x works. On dotnet7 I still have some issues but they don't seem related to stack issues.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Oct 26, 2022
@ayakael
Copy link
Contributor Author

ayakael commented Oct 26, 2022

@BrzVlad I've included the fix in my mono musl PR here #76500

akoeplinger pushed a commit that referenced this issue Jan 27, 2023
Co-authored-by: Alexander Köplinger <[email protected]>
Fixes #76805
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jan 27, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Feb 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants