tcmalloc crash when /sys/devices/system/cpu/possible is sandboxed away #245

jonmeow · 2024-07-29T15:51:12Z

We're trying to use tcmalloc in a sandboxed environment that hides /sys. This causes a crash in NumPossibleCPUsNoCache when trying to read "/sys/devices/system/cpu/possible".

It looks like tcmalloc used to use absl's implementation, which I think was calling std::thread::hardware_concurrency (here). What do you think of adding that as a fallback when the /sys info is unavailable?

For context, the specific crash is:

1 external/tcmalloc~/tcmalloc/internal/sysinfo.cc:124] CHECK in NumPossibleCPUsNoCache: cpus.has_value() (false) 
Program terminated with signal: SIGSEGV
Compiler returned: 139

The specific environment is compiler-explorer, so you can look at what's available pretty easily, e.g. here's some relevant Python queries:
https://python.compiler-explorer.com/z/jT6baYxba

std::thread::hardware_concurrency returns 2 in this environment (https://cpp.compiler-explorer.com/z/r4dfbxxMs)

The issue reported to us is carbon-language/carbon-lang#4176. It's temporarily reproducible at https://carbon.compiler-explorer.com/z/Ecfjh7bhs, but we'll be trying to fix that promptly and then I think the Python check above is the better reference.

The text was updated successfully, but these errors were encountered:

ckennelly · 2024-07-29T16:25:48Z

TCMalloc needs NumPossibleCPUsNocache to be non-allocating, since we use it to size some data structures during our very first allocation.

In the past, we've run into issues where std::thread::hardware_concurrency might allocate depending on C++ stdlib/libc implementation details. For example, libc++'s implementation seems to call sysconf in many cases, which is async-unsafe for heap allocations.

If you know where sysconf is sourcing the information from, it'd certainly be possible to directly fetch it in a way careful to not allocate.

jonmeow · 2024-07-29T18:25:25Z

In glibc, AFAICT sysconf calls get_nprocs (here), which calls get_nproc_fallback(here), which reads /proc/stat (here).

That said, get_nprocs is marked as:
Preliminary: | MT-Safe | AS-Safe | AC-Safe fd. Would that be an okay choice, avoiding a direct read of /proc/stat?

ckennelly · 2024-07-29T18:38:07Z

I think we need the _SC_NPROCESSORS_CONF case, since offlined CPUs can cause problems (see issue #188 ), which reads from /sys as well: https://github.com/bminor/glibc/blob/32328a5a1461ff88c0b1e04954e9c68b3fa7f56d/sysdeps/unix/sysv/linux/getsysstats.c#L231

jonmeow · 2024-07-29T18:51:14Z

To confirm, are you okay with get_nprocs_conf even though it's documented as unsafe? (note, this may reflect out-of-date documentation, it looks like the implementation changed a couple years ago)

Note, I'm happy to send you a change if you're okay with one of:

Use get_nprocs_conf in place of current code (since it reads "possible")
Keep the current code, but use get_nprocs as a fallback (since it's marked as safe)
Add a read of /proc/stat similar to the get_nprocs_fallback code (since get_nprocs_fallback isn't an API)

ckennelly · 2024-07-29T19:01:22Z

Let me clarify:

The logically "correct" value is sysconf(_SC_NPROCESSORS_CONF) (or get_nprocs_conf), but this is not malloc-free, so TCMalloc cannot use it. Our hand-rolled, allocation-free implementation is very similar, though, in that we read /sys to figure out how many CPUs might be in existence. I don't think this addresses your sandbox's limitations.
sysconf(_SC_NPROCESSORS_ONLN) / get_nprocs (which std::thread::hardware_concurrency uses) is not the correct value for this. If cores are offlined, for example, rseq might give us a CPU ID that exceeds the number of online CPUs, leading to memory corruption.
I'm not sure whether /proc/stat provides online or all CPUs, but given the two distinct glibc implementations, I'm inclined to believe it's only the former.

I'm not sure if there is an alternative to reading /sys that provides the number of possible CPU IDs that might be available.

jonmeow · 2024-07-29T21:30:22Z

We're going to see if we can get an exception in the compiler-explorer side, since it's not readily apparent whether the /proc fallback done by glibc provides a compatible alternative for /sys/devices/system/cpu/possible.

This returns std::nullopt when /sys cannot be read to determine the number of CPUs. In a subsequent change, we can check this first and turn off per-CPU caches when we cannot accurately determine the maximum possible CPU ID we will see. See issue #245. PiperOrigin-RevId: 667653028 Change-Id: I70e8126cb71b82d94bcd7da90e15ee16b0b69a8a

We use /sys to determine the number of CPUs we might see from restartable sequences. When this is not available, gracefully degrade rather than check-failing. Fixes #245. PiperOrigin-RevId: 667702663 Change-Id: Id78f557913f5936f7c4515dcbc2d445e162f68e3

jonmeow mentioned this issue Jul 29, 2024

Carbon Explorer Fails with Error CHECK in NumPossibleCPUsNoCache: cpus.has_value() (false) carbon-language/carbon-lang#4176

Closed

jonmeow closed this as not planned Won't fix, can't repro, duplicate, stale Jul 29, 2024

jonmeow mentioned this issue Jul 29, 2024

[BUG]: Carbon (tcmalloc) needs access to /sys/devices/system/cpu/possible compiler-explorer/compiler-explorer#6734

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tcmalloc crash when /sys/devices/system/cpu/possible is sandboxed away #245

tcmalloc crash when /sys/devices/system/cpu/possible is sandboxed away #245

jonmeow commented Jul 29, 2024

ckennelly commented Jul 29, 2024

jonmeow commented Jul 29, 2024

ckennelly commented Jul 29, 2024

jonmeow commented Jul 29, 2024

ckennelly commented Jul 29, 2024

jonmeow commented Jul 29, 2024

tcmalloc crash when /sys/devices/system/cpu/possible is sandboxed away #245

tcmalloc crash when /sys/devices/system/cpu/possible is sandboxed away #245

Comments

jonmeow commented Jul 29, 2024

ckennelly commented Jul 29, 2024

jonmeow commented Jul 29, 2024

ckennelly commented Jul 29, 2024

jonmeow commented Jul 29, 2024

ckennelly commented Jul 29, 2024

jonmeow commented Jul 29, 2024