Fix iOS termination crash in ProvisionalDispatcher#2059
Conversation
When iOS is terminating a process, it may release some static memory, which could cause the `event_dispatcher_` variable to point to invalid memory, leading to a crash. A crash that occurs when an app is already in the process of being terminated isn't too bad, considering it's already shutting down, but this matters to some extent, considering that crash reporters will report this as an app crash, and the OS may choose to penalize your app by not proactively suggesting it. This should fix a crash with a backtrace such as this one: ``` EXC_BAD_ACCESS: Attempted to dereference garbage pointer 0x8000000000000010. 0 Envoy::Event::ProvisionalDispatcher::post(std::__1::function<void ()>) 1 Envoy::EngineHandle::runOnEngineDispatcher(long, std::__1::function<void (Envoy::Engine&)>) 2 start_stream 3 -[EnvoyHTTPStreamImpl initWithHandle:callbacks:explicitFlowControl:] 4 -[EnvoyEngineImpl startStreamWithCallbacks:explicitFlowControl:] 5 StreamPrototype.start(queue:) ``` Signed-off-by: JP Simard <jp@jpsim.com>
ProvisionalDispatcher
|
Same caveat as #2056 applies here: I haven't been able to reproduce this crash, nor validate that this would actually fix it. I'm just making an educated guess based on the backtrace at the time of the crash. |
Signed-off-by: JP Simard <jp@jpsim.com>
Signed-off-by: JP Simard <jp@jpsim.com>
|
Thanks @jpsim! A couple thoughts: |
Does the dispatcher have any way to detect if we've terminated the engine? |
It could be explicitly informed. Let me know if you'd like to chat about this change synchronously. |
* main: (59 commits) Bump Lyft Support Rotation (#2156) add specifying more maven deps (#2151) update envoy@e4eaf1b97 (#2146) bazel: create symbol mapping file (#2126) Bump Lyft Support Rotation (#2148) bazel: remove sandbox disable build: export flatbuffer jvm dep (#2147) Bump Lyft Support Rotation (#2143) bazel: Add flatbuffers Swift hack key_value: structure for prefs based key value store (#2120) build: add flatbuffers (#2133) Bump Lyft Support Rotation (#2131) envoy: bump upstream Envoy to 419e237 (#2132) stats: enable more metrics (#2130) Use the right type (envoy_network_t) (#2125) Bump Lyft Support Rotation (#2118) Update CONTRIBUTING.md to include updating subrepos (#2023) ci: create baseline and experimental test app pipelines (#2075) config: temporarily hardcode h2 max concurrent streams to 100 (#2124) ... Signed-off-by: JP Simard <jp@jpsim.com>
This reverts commit d7ef467. Signed-off-by: JP Simard <jp@jpsim.com>
This reverts commit 424ed0e. Signed-off-by: JP Simard <jp@jpsim.com>
To prevent any further work from being enqueued. Signed-off-by: JP Simard <jp@jpsim.com>
|
Something like this is also needed for multi-engine support (#332). |
snowp
left a comment
There was a problem hiding this comment.
Few questions but conceptually this makes sense I think
I would think that we could add a simple test that verifies that once we call terminate we no longer delegate to the inner dispatcher
| // Don't perform any work on the dispatcher if marked as terminated. | ||
| if (terminated_) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Can this ever happen? Can we terminate before we've drained?
There was a problem hiding this comment.
I'm not sure, but it seems safest to check here considering engine termination is exposed in the public API so a user could invoke it at any time.
We've also already acquired the lock here, which is the expensive part, the boolean check should add effectively no overhead.
| main_thread_.join(); | ||
| } | ||
|
|
||
| dispatcher_->terminate(); |
There was a problem hiding this comment.
Should this be before the join? I'm not that familiar with this but isn't the dispatcher running on main_thread_, so if we're calling it here then the dispatcher is already not running?
There was a problem hiding this comment.
I don't know. I can move it and see if the tests in #2129 still pass.
There was a problem hiding this comment.
Actually, this causes a data race, caught by the TSan tests: https://github.com/envoyproxy/envoy-mobile/runs/6026097993?check_suite_focus=true
WARNING: ThreadSanitizer: data race (pid=22)
Read of size 8 at 0x7b5400021280 by main thread:
#0 memcpy ??:? (main_interface_test+0x19060ae)
#1 Envoy::Thread::ThreadId::isEmpty() const ??:? (main_interface_test+0x1b9d8ee)
#2 Envoy::Event::DispatcherImpl::runFatalActionsOnTrackedObject(std::__cxx11::list<std::unique_ptr<Envoy::Server::Configuration::FatalAction, std::default_delete<Envoy::Server::Configuration::FatalAction> >, std::allocator<std::unique_ptr<Envoy::Server::Configuration::FatalAction, std::default_delete<Envoy::Server::Configuration::FatalAction> > > > const&) const ??:? (main_interface_test+0x3eac80c)
Reverting.
| std::list<Event::PostCb> init_queue_ GUARDED_BY(state_lock_); | ||
| Event::Dispatcher* event_dispatcher_{}; | ||
| Thread::ThreadSynchronizer synchronizer_; | ||
| bool terminated_ GUARDED_BY(state_lock_){}; |
There was a problem hiding this comment.
Doesn't matter a ton but if you made this atomic it could be checked outside of the lock
There was a problem hiding this comment.
Considering that this only needs to be checked in places where we've already acquired the lock, it seems we should leverage that.
I'll take a stab at writing a test that does this, but for what it's worth the existing tests fail with #2129 without this, so it's soon to be implicitly covered by our existing tests. |
Based on code review from Snow: #2059 (comment) Signed-off-by: JP Simard <jp@jpsim.com>
…rash * origin/main: bazel: Allow multiple definitions for armeabi android links internal: pass engine handles throughout API (#2149) Signed-off-by: JP Simard <jp@jpsim.com>
To validate dispatcher termination. Signed-off-by: JP Simard <jp@jpsim.com>
… thread" This reverts commit 27c9e25. Signed-off-by: JP Simard <jp@jpsim.com>
snowp
left a comment
There was a problem hiding this comment.
I'm not convinced this solves it 100% (since we set this after we call event_dispatcher_.exit(), it feels racy), but will definitely make things better so this works for me
|
I’m also not convinced the termination crash will be completely fixed but I do think this will help. |
…unroutable-families * origin/main: (25 commits) Update Envoy to c96f711 (envoyproxy#2168) Bump Lyft Support Rotation (envoyproxy#2162) Update Envoy to 0e8899c (envoyproxy#2166) Update rules_apple & rules_swift (envoyproxy#2167) bazel: set inmemory remote exec flags globally http client: add cancel log and limit callback to open streams (envoyproxy#2165) bump envoy to 5181d2355f208061688922572727fe06ba8b3a07 (envoyproxy#2157) pin maven dependencies (envoyproxy#2161) Fix iOS termination crash in `ProvisionalDispatcher` (envoyproxy#2059) bazel: Allow multiple definitions for armeabi android links internal: pass engine handles throughout API (envoyproxy#2149) Bump Lyft Support Rotation (envoyproxy#2156) add specifying more maven deps (envoyproxy#2151) update envoy@e4eaf1b97 (envoyproxy#2146) bazel: create symbol mapping file (envoyproxy#2126) Bump Lyft Support Rotation (envoyproxy#2148) bazel: remove sandbox disable build: export flatbuffer jvm dep (envoyproxy#2147) Bump Lyft Support Rotation (envoyproxy#2143) bazel: Add flatbuffers Swift hack ... Signed-off-by: JP Simard <jp@jpsim.com>
I believe #2059 helped reduce these occurrences, but we're still seeing some termination crashes where a different thread is waiting on `std::thread::join()`. So moving this to before the `main_thread_.join()` may help. Signed-off-by: JP Simard <jp@jpsim.com>
I believe #2059 helped reduce these occurrences, but we're still seeing some termination crashes where a different thread is waiting on `std::thread::join()`. So moving this to before the `main_thread_.join()` may help. Risk Level: Moderate, there's a possibility this will introduce a data race, but even if that's the case, it'll be when the engine is being terminated. Release Notes: Added Co-authored-by: Mike Schore <mike.schore@gmail.com> Signed-off-by: JP Simard <jp@jpsim.com>
I believe envoyproxy/envoy-mobile#2059 helped reduce these occurrences, but we're still seeing some termination crashes where a different thread is waiting on `std::thread::join()`. So moving this to before the `main_thread_.join()` may help. Risk Level: Moderate, there's a possibility this will introduce a data race, but even if that's the case, it'll be when the engine is being terminated. Release Notes: Added Co-authored-by: Mike Schore <mike.schore@gmail.com> Signed-off-by: JP Simard <jp@jpsim.com>
When iOS is terminating a process, it may release some static memory, which could cause the
event_dispatcher_variable to point to invalid memory, leading to a crash.A crash that occurs when an app is already in the process of being terminated isn't too bad, considering it's already shutting down, but this matters to some extent, considering that crash reporters will report this as an app crash, and the OS may choose to penalize your app by not proactively suggesting it.
This should fix a crash with a backtrace such as this one:
This is a follow up to #2056 after discussion with Snow.
Signed-off-by: JP Simard jp@jpsim.com