PoC: Replace Thread::Local(T) with Thread::LocalStorage#16029
PoC: Replace Thread::Local(T) with Thread::LocalStorage#16029ysbaddaden wants to merge 7 commits intocrystal-lang:masterfrom
Thread::Local(T) with Thread::LocalStorage#16029Conversation
src/crystal/system/thread.cr
Outdated
| end | ||
|
|
||
| {% if LocalStorage.has_constant?(:Table) %} | ||
| property local_storage = Pointer(LocalStorage::Table).null |
There was a problem hiding this comment.
This pointer is to keep a live pointer to the per-thread table, so the GC won't collect it; we could spare it if we had a binding for GC_malloc_uncollectable.
There was a problem hiding this comment.
I've used GC_malloc_uncollectable in some libraries/experiments already and it's very slow
There was a problem hiding this comment.
Damn.
Is the call slow, or is having uncollectable memory slowing down the GC? If it's the former it might be acceptable since we expect to malloc every now and then.
I might revise this: with a thread pool (see #15885) and the ability to move an execution context scheduler to another thread (see #15871) we might want to attach data to the current scheduler rather than the current thread 🤔 That would make the custom implementation more legit. |
|
Note: scheduler-local-storage might be useful for io_uring where we'll need a local ring for each |
This is actually wrong: storing a Reference into a thread local can lead the GC to collect the reference —the GC can scan thread locals. Also, the compiler should set An issue is that the option is only available to the C++ API, and both Zig and Rust use custom wrappers. |
Implements the usual TLS/TSS solution on top of the `@[ThreadLocal]` annotation on targets where it's available and working, otherwise merely calls into the system API and skips the custom implementation as it would merely duplciate what the system does. The interface follows the system APIs that use void pointers so storing anything in TLS/TSS doesn't create a hard dependency. For example PCRE2 can default to register keys with destructors that call into LibPCRE2, without requiring to always link libpcre2 when we don't use regular expressions.
b7a22f9 to
bc9dc05
Compare
|
Updated.
|
The original implementation required to keep a reference to any GC memory stored in Thread::LocalStorage when using the TSS or FLS fallbacks because the GC can't scan the tables and will consider the memory to be unreachable and collect it (oops). This refactor always uses the custom table and only stores the pointer to the table in TSS/FLS which is safe because a pointer to the table is always kept in reachable memory.
bc9dc05 to
92e77ee
Compare
This comment was marked as outdated.
This comment was marked as outdated.
|
It's a nice approach, but we can do something much simpler: #16173. |
Alternative to #15616 that doesn't work in some situations on
x86_64-darwinas we found out in the macOS CI runs of #15955.After wasting hours of debug on CI, I can't get PCRE2 + TSS + stack unwind to work correctly on this target. Either PCRE2+TSS segfaults when creating the key, or it works and we can't raise anymore (unwinding the stack doesn't work). I have no idea where the problem lies.
This PoC merely reimplements the system TSS/TLS (thread specific/local storage) in pure Crystal. It's one GC malloc per thread that needs to store something, yet still allows dynamic key creation/deletion at runtime. A compile time static table as proposed by @BlobCodes might be more interesting (but no runtime dynamism).
On targets that don't support the
@[ThreadLocal]annotation the interface merely calls into the system API because that would otherwise do the same thing twice: 1. lookup the key in system TSS table and 2. lookup the key in the crystal TSS table).Performance wise, it should be as efficient as the system TSS; we could probably use the unsafe variant of
@[ThreadLocal]proposed by @BlobCodes too in #15889 since there's no fiber yield point within the API (though we might want to annotate some methods with@[NoInline]).NOTE: to be honest I'm not so sure about this solution, it seems really overkill. I'm not even sure we'll use it outside of PCRE2... but maybe there are other external libraries that could benefit from thread locals with destructors?
NOTE: another use case for thread-local-storage is #16157 (thread local RNG instance).