Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcmalloc crash when /sys/devices/system/cpu/possible is sandboxed away #245

Closed
jonmeow opened this issue Jul 29, 2024 · 6 comments
Closed

Comments

@jonmeow
Copy link

jonmeow commented Jul 29, 2024

We're trying to use tcmalloc in a sandboxed environment that hides /sys. This causes a crash in NumPossibleCPUsNoCache when trying to read "/sys/devices/system/cpu/possible".

It looks like tcmalloc used to use absl's implementation, which I think was calling std::thread::hardware_concurrency (here). What do you think of adding that as a fallback when the /sys info is unavailable?

For context, the specific crash is:

1 external/tcmalloc~/tcmalloc/internal/sysinfo.cc:124] CHECK in NumPossibleCPUsNoCache: cpus.has_value() (false) 
Program terminated with signal: SIGSEGV
Compiler returned: 139

The specific environment is compiler-explorer, so you can look at what's available pretty easily, e.g. here's some relevant Python queries:
https://python.compiler-explorer.com/z/jT6baYxba

std::thread::hardware_concurrency returns 2 in this environment (https://cpp.compiler-explorer.com/z/r4dfbxxMs)

The issue reported to us is carbon-language/carbon-lang#4176. It's temporarily reproducible at https://carbon.compiler-explorer.com/z/Ecfjh7bhs, but we'll be trying to fix that promptly and then I think the Python check above is the better reference.

@ckennelly
Copy link
Collaborator

TCMalloc needs NumPossibleCPUsNocache to be non-allocating, since we use it to size some data structures during our very first allocation.

In the past, we've run into issues where std::thread::hardware_concurrency might allocate depending on C++ stdlib/libc implementation details. For example, libc++'s implementation seems to call sysconf in many cases, which is async-unsafe for heap allocations.

If you know where sysconf is sourcing the information from, it'd certainly be possible to directly fetch it in a way careful to not allocate.

@jonmeow
Copy link
Author

jonmeow commented Jul 29, 2024

In glibc, AFAICT sysconf calls get_nprocs (here), which calls get_nproc_fallback(here), which reads /proc/stat (here).

That said, get_nprocs is marked as:
Preliminary: | MT-Safe | AS-Safe | AC-Safe fd. Would that be an okay choice, avoiding a direct read of /proc/stat?

@ckennelly
Copy link
Collaborator

I think we need the _SC_NPROCESSORS_CONF case, since offlined CPUs can cause problems (see issue #188 ), which reads from /sys as well: https://github.com/bminor/glibc/blob/32328a5a1461ff88c0b1e04954e9c68b3fa7f56d/sysdeps/unix/sysv/linux/getsysstats.c#L231

@jonmeow
Copy link
Author

jonmeow commented Jul 29, 2024

To confirm, are you okay with get_nprocs_conf even though it's documented as unsafe? (note, this may reflect out-of-date documentation, it looks like the implementation changed a couple years ago)

Note, I'm happy to send you a change if you're okay with one of:

  1. Use get_nprocs_conf in place of current code (since it reads "possible")
  2. Keep the current code, but use get_nprocs as a fallback (since it's marked as safe)
  3. Add a read of /proc/stat similar to the get_nprocs_fallback code (since get_nprocs_fallback isn't an API)

@ckennelly
Copy link
Collaborator

Let me clarify:

  • The logically "correct" value is sysconf(_SC_NPROCESSORS_CONF) (or get_nprocs_conf), but this is not malloc-free, so TCMalloc cannot use it. Our hand-rolled, allocation-free implementation is very similar, though, in that we read /sys to figure out how many CPUs might be in existence. I don't think this addresses your sandbox's limitations.

  • sysconf(_SC_NPROCESSORS_ONLN) / get_nprocs (which std::thread::hardware_concurrency uses) is not the correct value for this. If cores are offlined, for example, rseq might give us a CPU ID that exceeds the number of online CPUs, leading to memory corruption.

  • I'm not sure whether /proc/stat provides online or all CPUs, but given the two distinct glibc implementations, I'm inclined to believe it's only the former.

I'm not sure if there is an alternative to reading /sys that provides the number of possible CPU IDs that might be available.

@jonmeow
Copy link
Author

jonmeow commented Jul 29, 2024

We're going to see if we can get an exception in the compiler-explorer side, since it's not readily apparent whether the /proc fallback done by glibc provides a compatible alternative for /sys/devices/system/cpu/possible.

@jonmeow jonmeow closed this as not planned Won't fix, can't repro, duplicate, stale Jul 29, 2024
copybara-service bot pushed a commit that referenced this issue Aug 26, 2024
This returns std::nullopt when /sys cannot be read to determine the number of
CPUs.  In a subsequent change, we can check this first and turn off per-CPU
caches when we cannot accurately determine the maximum possible CPU ID we will
see.   See issue #245.

PiperOrigin-RevId: 667653028
Change-Id: I70e8126cb71b82d94bcd7da90e15ee16b0b69a8a
copybara-service bot pushed a commit that referenced this issue Aug 26, 2024
We use /sys to determine the number of CPUs we might see from restartable
sequences.  When this is not available, gracefully degrade rather than
check-failing.

Fixes #245.

PiperOrigin-RevId: 667702663
Change-Id: Id78f557913f5936f7c4515dcbc2d445e162f68e3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants