Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

macos: "failed to set thread exception port" in forked process #6785

Closed
thibaultcha opened this issue Jul 30, 2023 · 18 comments · Fixed by #6793
Closed

macos: "failed to set thread exception port" in forked process #6785

thibaultcha opened this issue Jul 30, 2023 · 18 comments · Fixed by #6793
Labels
bug Incorrect behavior in the current implementation that needs fixing

Comments

@thibaultcha
Copy link
Contributor

thibaultcha commented Jul 30, 2023

Hello,

Avid Wasmtime users in our Nginx embedding, we are facing an assertion failure when running it with Wasmtime on macos (x86 and arm64):

thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
  left: `268435466`,
 right: `0`: failed to set thread exception port', crates/runtime/src/traphandlers/macos.rs:474:9

This assertion systematically fails with Wasmtime 8.0.1 (and probably earlier) to 11.0.1 when Nginx is configured with daemon on (which ends here in Nginx forking into a background process). It seems it occurs in our call to wasmtime_linker_instantiate after having initialized an engine and a single store.

Not forking the master process (daemon off) seems to be working fine, even in forked worker processes (managed by the master process in foreground); instances are created and behave as expected in the Nginx worker processes.
The assertion failure specifically occurs once the master process has forked itself into a background process with daemon on.

In summary:

  • Works with daemon off: master process (foreground) -> wasmtime_linker_instantiate (temporary instance) -> fork() -> worker processes -> wasmtime_linker_instantiate (worker instances processing requests).
  • Fails with daemon on: master process (foreground) -> fork() (background daemon) -> wasmtime_linker_instantiate (temporary instance) -> assertion failure, no instance created, and no chance to fork() into worker processes.

In the past we used to have a CI/CD pipeline with macos targets and Wasmtime used to work fine; but this CI/CD pipeline was taken down, and this bit us in the last few days. It seems like older macos work on Wasmtime has something to do with it, but I know nothing of the macos system interface...

So far the root cause of this assertion failure has eluded us; probably state that aren't being carried over to the forked process or something like this. Could someone shed some light on what may be at cause here?

Thanks!

@thibaultcha thibaultcha added the bug Incorrect behavior in the current implementation that needs fixing label Jul 30, 2023
@thibaultcha
Copy link
Contributor Author

@casimiro has a good example and complete stacktrace of the assertion in #6788.

One thing we have tried to mitigate this is to destroy the engine before forking, in the hope that it would remove all secondary state, so that the forked process could call engine_new again and start from a clean slate. Alas, it does not make the assertion pass either.

@peterhuene
Copy link
Member

The error code is MACH_SEND_INVALID_RIGHT (the message body specified a port right which the caller didn't possess).

Wasmtime allocates a Mach port name (WASMTIME_PORT), for receiving exceptions from the kernel, upon the very first construction of an Engine.

The assertion failure is from the per-thread initialization that occurs to setup the thread exception port from threads that are running wasm code.

Without a fork, this works because all threads in a process see the same Mach port namespace. However, Mach port namespaces aren't inherited in a child process.

So I suspect we're going to run into an issue whenever an Engine is created prior to a fork(); I'm not sure exactly what our policy is on supporting fork as there's lots of complex, OS-specific initialization in the engine, but this code is clearly not intended to be fork-aware.

Note that there is a compile-time feature (posix-signals-on-macos) which will forego the Mach exception ports integration in favor of traditional signal handlers; it might be worth seeing if you can build the Wasmtime C API with this feature flag and see if it helps with this issue.

@thibaultcha
Copy link
Contributor Author

thibaultcha commented Jul 31, 2023

Thanks @peterhuene; ok so just like we thought, our hypothesis was the same. Although we think that destroying the engine prior to the fork should remedy this, it feels like an issue in Wasmtime itself that it doesn't.

I'd say it's also a bummer that this cannot be disabled at runtime (in engine configuration) as now our users will have to compile Wasmtime themselves so they can disable this feature, they cannot just download one of the releases... But at least it is a path forward.

Thanks again!

@peterhuene
Copy link
Member

peterhuene commented Jul 31, 2023

I am not aware of a reason we couldn't "uninstall" the trap handlers or reset otherwise global state upon the destruction of the last Engine (i.e. to restore things the way they were prior to the construction of the first Engine).

I think it simply hasn't been a priority for us to do so as generally the majority use pattern for Wasmtime thus far has been to create a single Engine for the process and forego fork in favor of single-process event queues.

@thibaultcha
Copy link
Contributor Author

the majority use pattern for Wasmtime thus far has been to create a single Engine for the process and forego fork in favor of event queues.

Note that all Nginx is doing is forking into a background process which isn't so uncommon, I suspect more embeddings of Wasmtime might run into this issue in the long run.

@peterhuene
Copy link
Member

Is there a call to create the engine from within the initial process when the feature to fork the daemon is on?

If not, I can't really explain why that particular feature would be tripping things up if all it is doing is immediately forking.

@thibaultcha
Copy link
Contributor Author

thibaultcha commented Jul 31, 2023

Is there a call to create the engine from within the initial process when the feature to fork the daemon is on?

Yes there is or else we would have moved it already. We must open an engine and validate the .wasm bytecode being loaded before the fork() in order to potentially exit() if it is invalid. If we wait until after the fork() to do so, it is too late to ask Nginx not to start (i.e. go into background & fork() into worker processes).

It's ok though, for now we are looking into posix-signals-on-macos as you pointed out.

@peterhuene
Copy link
Member

peterhuene commented Jul 31, 2023

I see, thank you for the clarification.

I suspect even in the daemon off case, we've initialized the per-thread exception port successfully as it's before the fork() of the worker process, but the worker probably inherited the fact it was initialized so doesn't attempt to do so again (and hence bypasses the assertion failure); thus I would expect trap handling to be broken if the wasm traps in the worker process.

@zhongweiy
Copy link

For the posix-signals-on-macos feature, I find some error like:

cargo build --release -p wasmtime-c-api --lib --features posix-signals-on-macos
error: none of the selected packages contains these features: posix-signals-on-macos

And it can pass without wasmtime-c-api:
cargo build --release --lib --features posix-signals-on-macos

It seems the wasmtime-c-api package does not have this feature. Is it a bug?

@peterhuene
Copy link
Member

Yeah, it appears to be missing from the wasmtime-c-api crate, just needs to forward the feature to the wasmtime crate:

crates/c-api/Cargo.toml:

[features]
...
posix-signals-on-macos = ["wasmtime/posix-signals-on-macos"]

@bjorn3
Copy link
Contributor

bjorn3 commented Jul 31, 2023

If you don't want to modify the source --features wasmtime/posix-signals-on-macos may also work.

@thibaultcha
Copy link
Contributor Author

Thanks all for the tips;

With macOS x86, traps are properly handled with posix-signals-on-macos, but on macOS ARM64, we get this exception on a trap with Wasmtime v11.0.1:

thread '<unnamed>' panicked at 'misaligned pointer dereference: address must be a multiple of 0x10 but is 0x105327c18', crates/runtime/src/traphandlers/unix.rs:217:22
stack backtrace:
   0: rust_begin_unwind
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:593:5
   1: core::panicking::panic_fmt
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/panicking.rs:67:14
   2: core::panicking::panic_misaligned_pointer_dereference
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/panicking.rs:174:5
   3: wasmtime_runtime::traphandlers::unix::get_pc_and_fp
             at ./work/runtimes/libwasmtime/repos/wasmtime/crates/runtime/src/traphandlers/unix.rs:217:22
   4: wasmtime_runtime::traphandlers::unix::trap_handler::{{closure}}
             at ./work/runtimes/libwasmtime/repos/wasmtime/crates/runtime/src/traphandlers/unix.rs:94:24
   5: wasmtime_runtime::traphandlers::tls::with
             at ./work/runtimes/libwasmtime/repos/wasmtime/crates/runtime/src/traphandlers.rs:747:18
   6: wasmtime_runtime::traphandlers::unix::trap_handler
             at ./work/runtimes/libwasmtime/repos/wasmtime/crates/runtime/src/traphandlers/unix.rs:79:19
   7: __platform_memmove

This seems to be coming from Wasmtime's interaction with the libc binding...

alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Aug 2, 2023
Currently the `libc` crate has an incorrect definition of `ucontext_t`
for this platform which is causing alignment issues when it's used. This
fixes [this issue][1] and the `posix-signals-on-macos` feature on this
platform.

[1]: bytecodealliance#6785 (comment)
@alexcrichton
Copy link
Member

That's a bug! I've posted a temporary fix for that at #6793

github-merge-queue bot pushed a commit that referenced this issue Aug 2, 2023
Currently the `libc` crate has an incorrect definition of `ucontext_t`
for this platform which is causing alignment issues when it's used. This
fixes [this issue][1] and the `posix-signals-on-macos` feature on this
platform.

[1]: #6785 (comment)
@alexcrichton
Copy link
Member

Oops didn't mean to close this

@alexcrichton alexcrichton reopened this Aug 2, 2023
@alexcrichton
Copy link
Member

I investigated this a bit recently and I'm not actually certain that we'll want to fix this. I think that the best solution here might be to recommend that macOS users who want to leverage fork should use signals via the Cargo feature rather than the default mach ports. If this is common then one option we could support is to make this a runtime configuration option instead of a compile-time configuration option too.

Otherwise though the issues on macOS that I found difficult to handle are:

  • Trap handling can't be Engine-local because the process's exception ports are, well, process-global.
  • It's technically possible to shut down the ports/handler thread when Engines all go away, but that IMO is kind of spooky action-at-a-distance where you have to be certain that no Engine is in existence which you may not necessarily always have control over (although I realize it's reasonable for this use case)
  • One option is to use pthread_atfork, but the problem with that is that it's unclear if all forks are intended for this sort of use case where the process continues to live of if it's something ephemeral such as a fork-then-exec combo.

I suppose that can all be mostly summarized as I think the only possibly-feasible thing to do here is to shut everything down when an Engine is destroyed, but I'm not personally all that comfortable doing that as it doesn't align with other platforms and I'm also not certain how robust we can make it.

Otherwise though @thibaultcha I'd ask you if having a custom build is difficult for y'all? If so we can move the signals-vs-ports to a runtime option instead. Additionally one other mitigation we could implement is to use pthread_atfork to force any wasm execution in the child to panic immediately rather than continue in a broken state, effectively having a more precise way of warning developers that "hey ports aren't compatible with fork please use this other thing to use signals instead"

@thibaultcha
Copy link
Contributor Author

Hi @alexcrichton, thanks for having a look at this too! It is ok for us to have a custom build, although a runtime configuration option would be even better since it means users can download any of the upstream Wasmtime releases which imho is ideal for a friction-less experience. How much work do you foresee that be? Perhaps it could be something one of us could contribute (a difficulty is that none of us on my team has access to a Macbook, but maybe we can write the patch off a box if it isn't too big).

Although, I do hope relying on this feature in the long run won't be a hassle for us (e.g. if Wasmtime on macos focuses more efforts & fixes on trap handling through ports at the detriment of signal dispositions); in which case having tighter control over the port handlers might be preferable, since we could shut it down before fork() and ensure all of our processes rely on the same mechanisms Wasmtime invests into...

alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Aug 4, 2023
This commit adds a `Config::macos_use_mach_ports` configuration option
to replace the old `posix-signals-on-macos` compile-time Cargo feature.
This'll make Wasmtime a tad larger on macOS but likely negligibly so.
Otherwise this is intended to provide a resolution to bytecodealliance#6785 where
embedders will be able to use any build of Wasmtime and configure at
runtime how trap handling should happen.

Functionally this commit additionally registers a `pthread_atfork`
handler to cause any usage of Wasmtime in the child to panic. This
should help head off a known-invalid state in case anyone runs into it
in the future.
@alexcrichton
Copy link
Member

Ok sounds good! Not too much work but no worries I've done some initial bits in #6807 now. If you'd like though some help in implementing a corresponding option in the C API would be appreciated!

As for the long term you should be good. Unix signals are unlikely to be as battle-tested as Mach ports because they're off-by-default for macOS, but they're turn on-by-default for Linux for example which means they're not completely untested. Y'all aren't the first to want to use signals instead of mach ports as well so it's something that I'd at least personally like to see continued support for in Wasmtime. Basically that's to say that, yes, you may run into future issues. I think we'll be interested in fixing them promptly, though, if they arise.

github-merge-queue bot pushed a commit that referenced this issue Aug 9, 2023
* Configure Mach ports vs signals via `Config`

This commit adds a `Config::macos_use_mach_ports` configuration option
to replace the old `posix-signals-on-macos` compile-time Cargo feature.
This'll make Wasmtime a tad larger on macOS but likely negligibly so.
Otherwise this is intended to provide a resolution to #6785 where
embedders will be able to use any build of Wasmtime and configure at
runtime how trap handling should happen.

Functionally this commit additionally registers a `pthread_atfork`
handler to cause any usage of Wasmtime in the child to panic. This
should help head off a known-invalid state in case anyone runs into it
in the future.

* Fix build on non-macOS
bors pushed a commit to rust-lang/libc that referenced this issue Aug 15, 2023
This commit effectively reverts #2817. Currently `ucontext_t` has both
the wrong size and the wrong alignment for aarch64-apple-darwin which
causes problems for users referencing the structure [1]. The issue
linked from #2817 claimed that it fixed #2812 but that's still an issue
where FFI warnings are still emitted for usage of `ucontext_t` due to
its transitive usage of `u128`. I'm not sure how to fix #2812 myself,
but given that #2817 doesn't appear to solve its original intent and
additionally the size/align are currently wrong this commit reverts in
the meantime.

[1]: bytecodealliance/wasmtime#6785 (comment)
bors added a commit to rust-lang/libc that referenced this issue Aug 15, 2023
…ohnTitor

Fix size/align of `ucontext_t` on aarch64-apple-darwin

This commit effectively reverts #2817. Currently `ucontext_t` has both the wrong size and the wrong alignment for aarch64-apple-darwin which causes problems for users referencing the structure [1]. The issue linked from #2817 claimed that it fixed #2812 but that's still an issue where FFI warnings are still emitted for usage of `ucontext_t` due to its transitive usage of `u128`. I'm not sure how to fix #2812 myself, but given that #2817 doesn't appear to solve its original intent and additionally the size/align are currently wrong this commit reverts in the meantime.

[1]: bytecodealliance/wasmtime#6785 (comment)
bors added a commit to rust-lang/libc that referenced this issue Aug 17, 2023
…ohnTitor

Fix size/align of `ucontext_t` on aarch64-apple-darwin

This commit effectively reverts #2817. Currently `ucontext_t` has both the wrong size and the wrong alignment for aarch64-apple-darwin which causes problems for users referencing the structure [1]. The issue linked from #2817 claimed that it fixed #2812 but that's still an issue where FFI warnings are still emitted for usage of `ucontext_t` due to its transitive usage of `u128`. I'm not sure how to fix #2812 myself, but given that #2817 doesn't appear to solve its original intent and additionally the size/align are currently wrong this commit reverts in the meantime.

[1]: bytecodealliance/wasmtime#6785 (comment)
eduardomourar pushed a commit to eduardomourar/wasmtime that referenced this issue Aug 18, 2023
…6793)

Currently the `libc` crate has an incorrect definition of `ucontext_t`
for this platform which is causing alignment issues when it's used. This
fixes [this issue][1] and the `posix-signals-on-macos` feature on this
platform.

[1]: bytecodealliance#6785 (comment)
eduardomourar pushed a commit to eduardomourar/wasmtime that referenced this issue Aug 18, 2023
* Configure Mach ports vs signals via `Config`

This commit adds a `Config::macos_use_mach_ports` configuration option
to replace the old `posix-signals-on-macos` compile-time Cargo feature.
This'll make Wasmtime a tad larger on macOS but likely negligibly so.
Otherwise this is intended to provide a resolution to bytecodealliance#6785 where
embedders will be able to use any build of Wasmtime and configure at
runtime how trap handling should happen.

Functionally this commit additionally registers a `pthread_atfork`
handler to cause any usage of Wasmtime in the child to panic. This
should help head off a known-invalid state in case anyone runs into it
in the future.

* Fix build on non-macOS
xevor11 pushed a commit to xevor11/libc that referenced this issue Aug 25, 2023
This commit effectively reverts rust-lang#2817. Currently `ucontext_t` has both
the wrong size and the wrong alignment for aarch64-apple-darwin which
causes problems for users referencing the structure [1]. The issue
linked from rust-lang#2817 claimed that it fixed rust-lang#2812 but that's still an issue
where FFI warnings are still emitted for usage of `ucontext_t` due to
its transitive usage of `u128`. I'm not sure how to fix rust-lang#2812 myself,
but given that rust-lang#2817 doesn't appear to solve its original intent and
additionally the size/align are currently wrong this commit reverts in
the meantime.

[1]: bytecodealliance/wasmtime#6785 (comment)
@alexcrichton
Copy link
Member

I think everything is now handled here with a combo of what I mentioned above:

I'm going to close this now but @thibaultcha if there's anything else please comment here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior in the current implementation that needs fixing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants