Fix iOS termination crash in `ProvisionalDispatcher` by jpsim · Pull Request #2059 · envoyproxy/envoy-mobile

jpsim · 2022-02-15T20:57:09Z

When iOS is terminating a process, it may release some static memory, which could cause the event_dispatcher_ variable to point to invalid memory, leading to a crash.

A crash that occurs when an app is already in the process of being terminated isn't too bad, considering it's already shutting down, but this matters to some extent, considering that crash reporters will report this as an app crash, and the OS may choose to penalize your app by not proactively suggesting it.

This should fix a crash with a backtrace such as this one:

EXC_BAD_ACCESS: Attempted to dereference garbage pointer 0x8000000000000010.

0 Envoy::Event::ProvisionalDispatcher::post(std::__1::function<void ()>)
1 Envoy::EngineHandle::runOnEngineDispatcher(long, std::__1::function<void (Envoy::Engine&)>)
2 start_stream
3 -[EnvoyHTTPStreamImpl initWithHandle:callbacks:explicitFlowControl:]
4 -[EnvoyEngineImpl startStreamWithCallbacks:explicitFlowControl:]
5 StreamPrototype.start(queue:)

This is a follow up to #2056 after discussion with Snow.

Signed-off-by: JP Simard jp@jpsim.com

When iOS is terminating a process, it may release some static memory, which could cause the `event_dispatcher_` variable to point to invalid memory, leading to a crash. A crash that occurs when an app is already in the process of being terminated isn't too bad, considering it's already shutting down, but this matters to some extent, considering that crash reporters will report this as an app crash, and the OS may choose to penalize your app by not proactively suggesting it. This should fix a crash with a backtrace such as this one: ``` EXC_BAD_ACCESS: Attempted to dereference garbage pointer 0x8000000000000010. 0 Envoy::Event::ProvisionalDispatcher::post(std::__1::function<void ()>) 1 Envoy::EngineHandle::runOnEngineDispatcher(long, std::__1::function<void (Envoy::Engine&)>) 2 start_stream 3 -[EnvoyHTTPStreamImpl initWithHandle:callbacks:explicitFlowControl:] 4 -[EnvoyEngineImpl startStreamWithCallbacks:explicitFlowControl:] 5 StreamPrototype.start(queue:) ``` Signed-off-by: JP Simard <jp@jpsim.com>

jpsim · 2022-02-15T20:58:12Z

Same caveat as #2056 applies here: I haven't been able to reproduce this crash, nor validate that this would actually fix it. I'm just making an educated guess based on the backtrace at the time of the crash.

Signed-off-by: JP Simard <jp@jpsim.com>

goaway · 2022-02-16T18:01:18Z

Thanks @jpsim!

A couple thoughts:
This looks like it's probably not threadsafe, and races seem likely in these sorts of circumstances.
Would this state make sense to be contained within the Dispatcher instead of the Engine? After all, we return the result of the post which could be false if things are shut down. (This is partly why post returns a value.)

jpsim · 2022-02-16T21:04:52Z

Would this state make sense to be contained within the Dispatcher instead of the Engine?

Does the dispatcher have any way to detect if we've terminated the engine?

goaway · 2022-02-23T08:07:06Z

Would this state make sense to be contained within the Dispatcher instead of the Engine?

Does the dispatcher have any way to detect if we've terminated the engine?

It could be explicitly informed. Let me know if you'd like to chat about this change synchronously.

* main: (59 commits) Bump Lyft Support Rotation (#2156) add specifying more maven deps (#2151) update envoy@e4eaf1b97 (#2146) bazel: create symbol mapping file (#2126) Bump Lyft Support Rotation (#2148) bazel: remove sandbox disable build: export flatbuffer jvm dep (#2147) Bump Lyft Support Rotation (#2143) bazel: Add flatbuffers Swift hack key_value: structure for prefs based key value store (#2120) build: add flatbuffers (#2133) Bump Lyft Support Rotation (#2131) envoy: bump upstream Envoy to 419e237 (#2132) stats: enable more metrics (#2130) Use the right type (envoy_network_t) (#2125) Bump Lyft Support Rotation (#2118) Update CONTRIBUTING.md to include updating subrepos (#2023) ci: create baseline and experimental test app pipelines (#2075) config: temporarily hardcode h2 max concurrent streams to 100 (#2124) ... Signed-off-by: JP Simard <jp@jpsim.com>

This reverts commit d7ef467. Signed-off-by: JP Simard <jp@jpsim.com>

This reverts commit 424ed0e. Signed-off-by: JP Simard <jp@jpsim.com>

To prevent any further work from being enqueued. Signed-off-by: JP Simard <jp@jpsim.com>

jpsim · 2022-04-12T20:21:19Z

Something like this is also needed for multi-engine support (#332).

snowp

Few questions but conceptually this makes sense I think

I would think that we could add a simple test that verifies that once we call terminate we no longer delegate to the inner dispatcher

snowp · 2022-04-14T14:03:09Z

library/common/event/provisional_dispatcher.cc

+  // Don't perform any work on the dispatcher if marked as terminated.
+  if (terminated_) {
+    return;
+  }


Can this ever happen? Can we terminate before we've drained?

I'm not sure, but it seems safest to check here considering engine termination is exposed in the public API so a user could invoke it at any time.

We've also already acquired the lock here, which is the expensive part, the boolean check should add effectively no overhead.

snowp · 2022-04-14T14:03:11Z

library/common/engine.cc

    main_thread_.join();
  }

+  dispatcher_->terminate();


Should this be before the join? I'm not that familiar with this but isn't the dispatcher running on main_thread_, so if we're calling it here then the dispatcher is already not running?

I don't know. I can move it and see if the tests in #2129 still pass.

Tests still pass with #2129, pushed in 27c9e25.

Actually, this causes a data race, caught by the TSan tests: https://github.com/envoyproxy/envoy-mobile/runs/6026097993?check_suite_focus=true

WARNING: ThreadSanitizer: data race (pid=22) Read of size 8 at 0x7b5400021280 by main thread: #0 memcpy ??:? (main_interface_test+0x19060ae) #1 Envoy::Thread::ThreadId::isEmpty() const ??:? (main_interface_test+0x1b9d8ee) #2 Envoy::Event::DispatcherImpl::runFatalActionsOnTrackedObject(std::__cxx11::list<std::unique_ptr<Envoy::Server::Configuration::FatalAction, std::default_delete<Envoy::Server::Configuration::FatalAction> >, std::allocator<std::unique_ptr<Envoy::Server::Configuration::FatalAction, std::default_delete<Envoy::Server::Configuration::FatalAction> > > > const&) const ??:? (main_interface_test+0x3eac80c)

Reverting.

snowp · 2022-04-14T14:05:34Z

library/common/event/provisional_dispatcher.h

  std::list<Event::PostCb> init_queue_ GUARDED_BY(state_lock_);
  Event::Dispatcher* event_dispatcher_{};
  Thread::ThreadSynchronizer synchronizer_;
+  bool terminated_ GUARDED_BY(state_lock_){};


Doesn't matter a ton but if you made this atomic it could be checked outside of the lock

Considering that this only needs to be checked in places where we've already acquired the lock, it seems we should leverage that.

jpsim · 2022-04-14T14:39:33Z

I would think that we could add a simple test that verifies that once we call terminate we no longer delegate to the inner dispatcher

I'll take a stab at writing a test that does this, but for what it's worth the existing tests fail with #2129 without this, so it's soon to be implicitly covered by our existing tests.

Based on code review from Snow: #2059 (comment) Signed-off-by: JP Simard <jp@jpsim.com>

…rash * origin/main: bazel: Allow multiple definitions for armeabi android links internal: pass engine handles throughout API (#2149) Signed-off-by: JP Simard <jp@jpsim.com>

To validate dispatcher termination. Signed-off-by: JP Simard <jp@jpsim.com>

… thread" This reverts commit 27c9e25. Signed-off-by: JP Simard <jp@jpsim.com>

snowp

I'm not convinced this solves it 100% (since we set this after we call event_dispatcher_.exit(), it feels racy), but will definitely make things better so this works for me

jpsim · 2022-04-15T01:32:00Z

I’m also not convinced the termination crash will be completely fixed but I do think this will help.

…unroutable-families * origin/main: (25 commits) Update Envoy to c96f711 (envoyproxy#2168) Bump Lyft Support Rotation (envoyproxy#2162) Update Envoy to 0e8899c (envoyproxy#2166) Update rules_apple & rules_swift (envoyproxy#2167) bazel: set inmemory remote exec flags globally http client: add cancel log and limit callback to open streams (envoyproxy#2165) bump envoy to 5181d2355f208061688922572727fe06ba8b3a07 (envoyproxy#2157) pin maven dependencies (envoyproxy#2161) Fix iOS termination crash in `ProvisionalDispatcher` (envoyproxy#2059) bazel: Allow multiple definitions for armeabi android links internal: pass engine handles throughout API (envoyproxy#2149) Bump Lyft Support Rotation (envoyproxy#2156) add specifying more maven deps (envoyproxy#2151) update envoy@e4eaf1b97 (envoyproxy#2146) bazel: create symbol mapping file (envoyproxy#2126) Bump Lyft Support Rotation (envoyproxy#2148) bazel: remove sandbox disable build: export flatbuffer jvm dep (envoyproxy#2147) Bump Lyft Support Rotation (envoyproxy#2143) bazel: Add flatbuffers Swift hack ... Signed-off-by: JP Simard <jp@jpsim.com>

I believe #2059 helped reduce these occurrences, but we're still seeing some termination crashes where a different thread is waiting on `std::thread::join()`. So moving this to before the `main_thread_.join()` may help. Signed-off-by: JP Simard <jp@jpsim.com>

I believe #2059 helped reduce these occurrences, but we're still seeing some termination crashes where a different thread is waiting on `std::thread::join()`. So moving this to before the `main_thread_.join()` may help. Risk Level: Moderate, there's a possibility this will introduce a data race, but even if that's the case, it'll be when the engine is being terminated. Release Notes: Added Co-authored-by: Mike Schore <mike.schore@gmail.com> Signed-off-by: JP Simard <jp@jpsim.com>

I believe envoyproxy/envoy-mobile#2059 helped reduce these occurrences, but we're still seeing some termination crashes where a different thread is waiting on `std::thread::join()`. So moving this to before the `main_thread_.join()` may help. Risk Level: Moderate, there's a possibility this will introduce a data race, but even if that's the case, it'll be when the engine is being terminated. Release Notes: Added Co-authored-by: Mike Schore <mike.schore@gmail.com> Signed-off-by: JP Simard <jp@jpsim.com>

jpsim changed the title ~~Fix iOS termination crash~~ Fix iOS termination crash in ProvisionalDispatcher Feb 15, 2022

jpsim requested review from goaway and snowp February 15, 2022 20:58

jpsim added 2 commits February 15, 2022 16:10

fixup! Fix iOS termination crash

d7ef467

Signed-off-by: JP Simard <jp@jpsim.com>

fixup! Fix iOS termination crash

885f3d6

Signed-off-by: JP Simard <jp@jpsim.com>

jpsim added 4 commits April 12, 2022 16:11

Revert "fixup! Fix iOS termination crash"

aab4a18

This reverts commit d7ef467. Signed-off-by: JP Simard <jp@jpsim.com>

Revert "Fix iOS termination crash"

3aca27f

This reverts commit 424ed0e. Signed-off-by: JP Simard <jp@jpsim.com>

Terminate dispatcher when terminating engine

bc46d84

To prevent any further work from being enqueued. Signed-off-by: JP Simard <jp@jpsim.com>

snowp reviewed Apr 14, 2022

View reviewed changes

jpsim added 4 commits April 14, 2022 10:56

Move dispatcher termination to happen before joining the main thread

27c9e25

Based on code review from Snow: #2059 (comment) Signed-off-by: JP Simard <jp@jpsim.com>

Merge remote-tracking branch 'origin/main' into fix-ios-termination-c…

71873fa

…rash * origin/main: bazel: Allow multiple definitions for armeabi android links internal: pass engine handles throughout API (#2149) Signed-off-by: JP Simard <jp@jpsim.com>

Split off termination from destruction

c67f1f4

To validate dispatcher termination. Signed-off-by: JP Simard <jp@jpsim.com>

Revert "Move dispatcher termination to happen before joining the main…

781512e

… thread" This reverts commit 27c9e25. Signed-off-by: JP Simard <jp@jpsim.com>

jpsim requested a review from snowp April 14, 2022 15:40

snowp approved these changes Apr 15, 2022

View reviewed changes

jpsim merged commit e4e01e0 into main Apr 15, 2022

jpsim deleted the fix-ios-termination-crash branch April 15, 2022 14:07

jpsim mentioned this pull request May 13, 2022

iOS: fix termination crash in ProvisionalDispatcher (again) #2276

Merged

Conversation

jpsim commented Feb 15, 2022

Uh oh!

jpsim commented Feb 15, 2022

Uh oh!

goaway commented Feb 16, 2022

Uh oh!

jpsim commented Feb 16, 2022

Uh oh!

goaway commented Feb 23, 2022

Uh oh!

jpsim commented Apr 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snowp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpsim commented Apr 14, 2022

Uh oh!

snowp left a comment

Choose a reason for hiding this comment

Uh oh!

jpsim commented Apr 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jpsim commented Apr 12, 2022 •

edited

Loading