Skip to content

Conversation

@janvorli
Copy link
Member

The current CheckActivationSafePoint uses thread local storage to
get the current Thread instance. But this function is called from
async signal handler (the activation signal handler) and it is not
allowed to access TLS variables there because the access can allocate
and if the interrupted code was running in an allocation code, it
could crash.
There was no problem with this since .NET 1.0, but a change in the
recent glibc version has broken this. We've got reports of crashes
in this code due to the reason mentioned above.

This change introduces an async safe mechanism for accessing the
current Thread instance from async signal handlers. It uses a
segmented array that can grow, but never shrink. Entries for
threads are added when runtime creates a thread / attaches to an
external thread and removed when the thread dies.

The check for safety of the activation injection was further enhanced
to make sure that the ScanReaderLock is not taken. In cases it would
need to be taken, we just reject the location.

Since NativeAOT is subject to the same issue, the code to maintain the
thread id to thread instance map is placed to the minipal and shared
between coreclr and NativeAOT.

Closes #121581

The current CheckActivationSafePoint uses thread local storage to
get the current Thread instance. But this function is called from
async signal handler (the activation signal handler) and it is not
allowed to access TLS variables there because the access can allocate
and if the interrupted code was running in an allocation code, it
could crash.
There was no problem with this since .NET 1.0, but a change in the
recent glibc version has broken this. We've got reports of crashes
in this code due to the reason mentioned above.

This change introduces an async safe mechanism for accessing the
current Thread instance from async signal handlers. It uses a
segmented array that can grow, but never shrink. Entries for
threads are added when runtime creates a thread / attaches to an
external thread and removed when the thread dies.

Closes dotnet#121581
@janvorli janvorli added this to the 11.0.0 milestone Nov 28, 2025
@janvorli janvorli requested a review from jkotas November 28, 2025 23:10
@janvorli janvorli self-assigned this Nov 28, 2025
Copilot AI review requested due to automatic review settings November 28, 2025 23:10
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Copilot finished reviewing on behalf of janvorli November 28, 2025 23:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes crashes occurring when async signal handlers access Thread Local Storage (TLS) in recent glibc versions. The fix introduces an async-safe, lock-free segmented array to map OS thread IDs to Thread instances, avoiding TLS access in signal handlers. The implementation is shared between CoreCLR and NativeAOT through the minipal library.

Key changes:

  • New async-safe thread lookup mechanism using lock-free segmented arrays
  • Added minipal_get_current_thread_id_no_cache() to avoid TLS in signal handlers
  • Enhanced activation safe point checks to avoid taking ScanReaderLock
  • Integrated async-safe thread lookup in both CoreCLR and NativeAOT runtimes

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/native/minipal/thread.h Adds async-safe API declarations and no-cache thread ID function
src/native/minipal/thread.c Implements lock-free segmented array for async-safe thread lookup
src/native/minipal/CMakeLists.txt Adds thread.c to build
src/coreclr/vm/threadsuspend.cpp Updates CheckActivationSafePoint to use async-safe thread lookup
src/coreclr/vm/threads.h Declares GetThreadAsyncSafe for Unix
src/coreclr/vm/threads.cpp Implements GetThreadAsyncSafe and integrates with async-safe map
src/coreclr/vm/codeman.h Adds IsManagedCodeNoLock and GetScanFlags parameter
src/coreclr/vm/codeman.cpp Implements IsManagedCodeNoLock for use without reader lock
src/coreclr/nativeaot/Runtime/unix/PalUnix.cpp Updates activation handler to use async-safe thread lookup
src/coreclr/nativeaot/Runtime/threadstore.inl Conditionalizes GetCurrentThread and adds async-safe variant
src/coreclr/nativeaot/Runtime/threadstore.h Declares GetCurrentThreadIfAvailableAsyncSafe
src/coreclr/nativeaot/Runtime/threadstore.cpp Implements async-safe thread lookup and map integration

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

@jkotas
Copy link
Member

jkotas commented Nov 29, 2025

There was no problem with this since .NET 1.0, but a change in the recent glibc version has broken this

Have you been able to trace down the change? I think we just got lucky that it has not showed up on the radar earlier.

@janvorli
Copy link
Member Author

janvorli commented Dec 4, 2025

Looking at the NativeAOT handler, there is actually also a call to the GetCurrentThreadIfAvailable in the Thread::HijackCallback that , so the NativeAOT fix is not complete yet.

On coreclr, forward the activation signal to a previously registered
handler (if any). On NativeAOT, that was already happening.

Also on NativeAOT, pass the Thread we get using the async safe mechanism
to the Thread::HijackCallback so that it doesn't try to get it itself.
@janvorli
Copy link
Member Author

janvorli commented Dec 5, 2025

@jkotas I've tested this change with added testing activation signal handler to corerun and verified that this handler is called after the real coreclr one returns. I did it by writing a message to stderr in that handler and then running all coreclr tests with it and scanning the test logs afterwards for that message. I've also ran all pri 1 tests with gcstress 3 locally to stress it. All seemed good.

#if defined(TARGET_UNIX) && !defined(TARGET_WASM)
if (!InsertThreadIntoAsyncSafeMap(pAttachingThread->m_threadId, pAttachingThread))
{
ASSERT_UNCONDITIONALLY("Failed to insert thread into async-safe map due to OOM.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ASSERT_UNCONDITIONALLY("Failed to insert thread into async-safe map due to OOM.");
PalPrintFatalError("\nFailed to insert thread into async-safe map due to out of memory.\n");

if (!InsertThreadIntoAsyncSafeMap(t->GetOSThreadId64(), t))
{
// TODO: can we handle this OOM more gracefully?
EEPOLICY_HANDLE_FATAL_ERROR_WITH_MESSAGE(COR_E_EXECUTIONENGINE, W("Failed to insert thread into async-safe map due to OOM."));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
EEPOLICY_HANDLE_FATAL_ERROR_WITH_MESSAGE(COR_E_EXECUTIONENGINE, W("Failed to insert thread into async-safe map due to OOM."));
EEPOLICY_HANDLE_FATAL_ERROR_WITH_MESSAGE(COR_E_EXECUTIONENGINE, W("Failed to insert thread into async-safe map due to out of memory."));

#if defined(TARGET_UNIX) && !defined(TARGET_WASM)
if (!InsertThreadIntoAsyncSafeMap(t->GetOSThreadId64(), t))
{
// TODO: can we handle this OOM more gracefully?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can - if it is one of our threads. However, I do not think it is worth it. The allocation of the thread statics above can crash due to OOM too and I do not think there is a good way to recover from it either. Recovering from corner-case OOMs is just not a thing on Unix.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Address sanitizer: Crash/deadlock in inject signal handler on Linux with glibc >= 2.40

4 participants