Skip to content

Fix iOS termination crash in ProvisionalDispatcher#2059

Merged
jpsim merged 11 commits intomainfrom
fix-ios-termination-crash
Apr 15, 2022
Merged

Fix iOS termination crash in ProvisionalDispatcher#2059
jpsim merged 11 commits intomainfrom
fix-ios-termination-crash

Conversation

@jpsim
Copy link
Contributor

@jpsim jpsim commented Feb 15, 2022

When iOS is terminating a process, it may release some static memory, which could cause the event_dispatcher_ variable to point to invalid memory, leading to a crash.

A crash that occurs when an app is already in the process of being terminated isn't too bad, considering it's already shutting down, but this matters to some extent, considering that crash reporters will report this as an app crash, and the OS may choose to penalize your app by not proactively suggesting it.

This should fix a crash with a backtrace such as this one:

EXC_BAD_ACCESS: Attempted to dereference garbage pointer 0x8000000000000010.

0 Envoy::Event::ProvisionalDispatcher::post(std::__1::function<void ()>)
1 Envoy::EngineHandle::runOnEngineDispatcher(long, std::__1::function<void (Envoy::Engine&)>)
2 start_stream
3 -[EnvoyHTTPStreamImpl initWithHandle:callbacks:explicitFlowControl:]
4 -[EnvoyEngineImpl startStreamWithCallbacks:explicitFlowControl:]
5 StreamPrototype.start(queue:)

This is a follow up to #2056 after discussion with Snow.

Signed-off-by: JP Simard jp@jpsim.com

When iOS is terminating a process, it may release some static memory,
which could cause the `event_dispatcher_` variable to point to invalid
memory, leading to a crash.

A crash that occurs when an app is already in the process of being
terminated isn't too bad, considering it's already shutting down, but
this matters to some extent, considering that crash reporters will
report this as an app crash, and the OS may choose to penalize your app
by not proactively suggesting it.

This should fix a crash with a backtrace such as this one:

```
EXC_BAD_ACCESS: Attempted to dereference garbage pointer 0x8000000000000010.

0 Envoy::Event::ProvisionalDispatcher::post(std::__1::function<void ()>)
1 Envoy::EngineHandle::runOnEngineDispatcher(long, std::__1::function<void (Envoy::Engine&)>)
2 start_stream
3 -[EnvoyHTTPStreamImpl initWithHandle:callbacks:explicitFlowControl:]
4 -[EnvoyEngineImpl startStreamWithCallbacks:explicitFlowControl:]
5 StreamPrototype.start(queue:)
```

Signed-off-by: JP Simard <jp@jpsim.com>
@jpsim jpsim changed the title Fix iOS termination crash Fix iOS termination crash in ProvisionalDispatcher Feb 15, 2022
@jpsim
Copy link
Contributor Author

jpsim commented Feb 15, 2022

Same caveat as #2056 applies here: I haven't been able to reproduce this crash, nor validate that this would actually fix it. I'm just making an educated guess based on the backtrace at the time of the crash.

@jpsim jpsim requested review from goaway and snowp February 15, 2022 20:58
Signed-off-by: JP Simard <jp@jpsim.com>
Signed-off-by: JP Simard <jp@jpsim.com>
@goaway
Copy link
Contributor

goaway commented Feb 16, 2022

Thanks @jpsim!

A couple thoughts:
This looks like it's probably not threadsafe, and races seem likely in these sorts of circumstances.
Would this state make sense to be contained within the Dispatcher instead of the Engine? After all, we return the result of the post which could be false if things are shut down. (This is partly why post returns a value.)

@jpsim
Copy link
Contributor Author

jpsim commented Feb 16, 2022

Would this state make sense to be contained within the Dispatcher instead of the Engine?

Does the dispatcher have any way to detect if we've terminated the engine?

@goaway
Copy link
Contributor

goaway commented Feb 23, 2022

Would this state make sense to be contained within the Dispatcher instead of the Engine?

Does the dispatcher have any way to detect if we've terminated the engine?

It could be explicitly informed. Let me know if you'd like to chat about this change synchronously.

jpsim added 4 commits April 12, 2022 16:11
* main: (59 commits)
  Bump Lyft Support Rotation (#2156)
  add specifying more maven deps (#2151)
  update envoy@e4eaf1b97 (#2146)
  bazel: create symbol mapping file (#2126)
  Bump Lyft Support Rotation (#2148)
  bazel: remove sandbox disable
  build: export flatbuffer jvm dep (#2147)
  Bump Lyft Support Rotation (#2143)
  bazel: Add flatbuffers Swift hack
  key_value: structure for prefs based key value store (#2120)
  build: add flatbuffers (#2133)
  Bump Lyft Support Rotation (#2131)
  envoy: bump upstream Envoy to 419e237 (#2132)
  stats: enable more metrics (#2130)
  Use the right type (envoy_network_t) (#2125)
  Bump Lyft Support Rotation (#2118)
  Update CONTRIBUTING.md to include updating subrepos (#2023)
  ci: create baseline and experimental test app pipelines (#2075)
  config: temporarily hardcode h2 max concurrent streams to 100 (#2124)
  ...

Signed-off-by: JP Simard <jp@jpsim.com>
This reverts commit d7ef467.

Signed-off-by: JP Simard <jp@jpsim.com>
This reverts commit 424ed0e.

Signed-off-by: JP Simard <jp@jpsim.com>
To prevent any further work from being enqueued.

Signed-off-by: JP Simard <jp@jpsim.com>
@jpsim
Copy link
Contributor Author

jpsim commented Apr 12, 2022

Something like this is also needed for multi-engine support (#332).

Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions but conceptually this makes sense I think

I would think that we could add a simple test that verifies that once we call terminate we no longer delegate to the inner dispatcher

Comment on lines +16 to +19
// Don't perform any work on the dispatcher if marked as terminated.
if (terminated_) {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this ever happen? Can we terminate before we've drained?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but it seems safest to check here considering engine termination is exposed in the public API so a user could invoke it at any time.

We've also already acquired the lock here, which is the expensive part, the boolean check should add effectively no overhead.

main_thread_.join();
}

dispatcher_->terminate();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be before the join? I'm not that familiar with this but isn't the dispatcher running on main_thread_, so if we're calling it here then the dispatcher is already not running?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. I can move it and see if the tests in #2129 still pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests still pass with #2129, pushed in 27c9e25.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this causes a data race, caught by the TSan tests: https://github.com/envoyproxy/envoy-mobile/runs/6026097993?check_suite_focus=true

WARNING: ThreadSanitizer: data race (pid=22)
  Read of size 8 at 0x7b5400021280 by main thread:
    #0 memcpy ??:? (main_interface_test+0x19060ae)
    #1 Envoy::Thread::ThreadId::isEmpty() const ??:? (main_interface_test+0x1b9d8ee)
    #2 Envoy::Event::DispatcherImpl::runFatalActionsOnTrackedObject(std::__cxx11::list<std::unique_ptr<Envoy::Server::Configuration::FatalAction, std::default_delete<Envoy::Server::Configuration::FatalAction> >, std::allocator<std::unique_ptr<Envoy::Server::Configuration::FatalAction, std::default_delete<Envoy::Server::Configuration::FatalAction> > > > const&) const ??:? (main_interface_test+0x3eac80c)

Reverting.

std::list<Event::PostCb> init_queue_ GUARDED_BY(state_lock_);
Event::Dispatcher* event_dispatcher_{};
Thread::ThreadSynchronizer synchronizer_;
bool terminated_ GUARDED_BY(state_lock_){};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't matter a ton but if you made this atomic it could be checked outside of the lock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that this only needs to be checked in places where we've already acquired the lock, it seems we should leverage that.

@jpsim
Copy link
Contributor Author

jpsim commented Apr 14, 2022

I would think that we could add a simple test that verifies that once we call terminate we no longer delegate to the inner dispatcher

I'll take a stab at writing a test that does this, but for what it's worth the existing tests fail with #2129 without this, so it's soon to be implicitly covered by our existing tests.

jpsim added 4 commits April 14, 2022 10:56
Based on code review from Snow:
#2059 (comment)

Signed-off-by: JP Simard <jp@jpsim.com>
…rash

* origin/main:
  bazel: Allow multiple definitions for armeabi android links
  internal: pass engine handles throughout API (#2149)

Signed-off-by: JP Simard <jp@jpsim.com>
To validate dispatcher termination.

Signed-off-by: JP Simard <jp@jpsim.com>
… thread"

This reverts commit 27c9e25.

Signed-off-by: JP Simard <jp@jpsim.com>
@jpsim jpsim requested a review from snowp April 14, 2022 15:40
Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this solves it 100% (since we set this after we call event_dispatcher_.exit(), it feels racy), but will definitely make things better so this works for me

@jpsim
Copy link
Contributor Author

jpsim commented Apr 15, 2022

I’m also not convinced the termination crash will be completely fixed but I do think this will help.

@jpsim jpsim merged commit e4e01e0 into main Apr 15, 2022
@jpsim jpsim deleted the fix-ios-termination-crash branch April 15, 2022 14:07
jpsim added a commit to jpsim/envoy-mobile that referenced this pull request Apr 20, 2022
…unroutable-families

* origin/main: (25 commits)
  Update Envoy to c96f711 (envoyproxy#2168)
  Bump Lyft Support Rotation (envoyproxy#2162)
  Update Envoy to 0e8899c (envoyproxy#2166)
  Update rules_apple & rules_swift (envoyproxy#2167)
  bazel: set inmemory remote exec flags globally
  http client: add cancel log and limit callback to open streams (envoyproxy#2165)
  bump envoy to 5181d2355f208061688922572727fe06ba8b3a07 (envoyproxy#2157)
  pin maven dependencies (envoyproxy#2161)
  Fix iOS termination crash in `ProvisionalDispatcher` (envoyproxy#2059)
  bazel: Allow multiple definitions for armeabi android links
  internal: pass engine handles throughout API (envoyproxy#2149)
  Bump Lyft Support Rotation (envoyproxy#2156)
  add specifying more maven deps (envoyproxy#2151)
  update envoy@e4eaf1b97 (envoyproxy#2146)
  bazel: create symbol mapping file (envoyproxy#2126)
  Bump Lyft Support Rotation (envoyproxy#2148)
  bazel: remove sandbox disable
  build: export flatbuffer jvm dep (envoyproxy#2147)
  Bump Lyft Support Rotation (envoyproxy#2143)
  bazel: Add flatbuffers Swift hack
  ...

Signed-off-by: JP Simard <jp@jpsim.com>
jpsim added a commit that referenced this pull request May 13, 2022
I believe #2059 helped
reduce these occurrences, but we're still seeing some termination
crashes where a different thread is waiting on `std::thread::join()`.

So moving this to before the `main_thread_.join()` may help.

Signed-off-by: JP Simard <jp@jpsim.com>
jpsim added a commit that referenced this pull request May 17, 2022
I believe #2059 helped reduce these occurrences, but we're still seeing some termination crashes where a different thread is waiting on `std::thread::join()`.

So moving this to before the `main_thread_.join()` may help.

Risk Level: Moderate, there's a possibility this will introduce a data race, but even if that's the case, it'll be when the engine is being terminated.
Release Notes: Added

Co-authored-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim added a commit to envoyproxy/envoy that referenced this pull request Nov 29, 2022
I believe envoyproxy/envoy-mobile#2059 helped reduce these occurrences, but we're still seeing some termination crashes where a different thread is waiting on `std::thread::join()`.

So moving this to before the `main_thread_.join()` may help.

Risk Level: Moderate, there's a possibility this will introduce a data race, but even if that's the case, it'll be when the engine is being terminated.
Release Notes: Added

Co-authored-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants