Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve unsafe memory accesses in ExchangeContext (#25023) #25032

Conversation

domichae-amazon
Copy link
Contributor

Fixes #25023

Problem

When an ExchangeMgr is shut down (ShutDown()), the SessionManager is set to nullptr and resources are released. However, if a child ExchangeContext is still in use on another thread (e.g. in the middle of SendMessage(...)), the ExchangeContext uses mExchangeManager->GetSessionManager() without any null safety checks. This can cause segmentation faults when the null SessionManager is dereferenced.

Solution

Add null checks on all usages of the SessionManager. Return error codes where appropriate / applicable, and fail gracefully / silently where it is safe to do so.

@CLAassistant
Copy link

CLAassistant commented Feb 13, 2023

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link

PR #25032: Size comparison from e7528bc to 6bac357

Increases (1 build for cc32xx)
platform target config section e7528bc 6bac357 change % change
cc32xx lock CC3235SF_LAUNCHXL (read only) 640361 640385 24 0.0
.debug_info 20180782 2018094 159 0.0
.debug_line 2649926 2650018 92 0.0
.debug_loc 2786017 2786273 256 0.0
.text 532604 532628 24 0.0
Full report (1 build for cc32xx)
platform target config section e7528bc 6bac357 change % change
cc32xx lock CC3235SF_LAUNCHXL 0 0 0 0.0
(read only) 640361 640385 24 0.0
(read/write) 204084 204084 0 0.0
.ARM.attributes 44 44 0 0.0
.ARM.exidx 8 8 0 0.0
.bss 197488 197488 0 0.0
.comment 194 194 0 0.0
.data 1476 1476 0 0.0
.debug_abbrev 928461 928461 0 0.0
.debug_aranges 87352 87352 0 0.0
.debug_frame 299840 299840 0 0.0
.debug_info 20180782 2018094 159 0.0
.debug_line 2649926 2650018 92 0.0
.debug_loc 2786017 2786273 256 0.0
.debug_ranges 280728 280728 0 0.0
.debug_str 3005287 3005287 0 0.0
.ramVecs 780 780 0 0.0
.resetVecs 64 64 0 0.0
.rodata 105633 105633 0 0.0
.shstrtab 232 232 0 0.0
.stab 204 204 0 0.0
.stabstr 441 441 0 0.0
.stack 2048 2048 0 0.0
.strtab 375902 375902 0 0.0
.symtab 255856 255856 0 0.0
.text 532604 532628 24 0.0

Copy link
Contributor

@bzbarsky-apple bzbarsky-apple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, if a child ExchangeContext is still in use on another thread

Then you have a thread race. You can't be in the middle of SendMessage on one thread and shut things down on another thread without taking the relevant locks!

Comment on lines +447 to +448
// Verify that the Session Manager is still alive. Note that host applications may stop the chip::Server
// at any time, which will tear down the Session Manager, Exchange Manager, and Exchange Context. Note
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they're not allowed to stop it at any time, unless they are holding the Matter stack lock. We should add an assert to that effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share any references to this expectation in the codebase? It isn't enforced anywhere in ExchangeMgr or any other file in messaging/* except for a single usage in ExchangeContext at the beginning of SendMessage.

Copy link
Contributor

@bzbarsky-apple bzbarsky-apple Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share any references to this expectation in the codebase? I

Which expectation? That you only touch the Matter SDK while holding the Matter lock? It's a basic API expectation in all Matter APIs. Not all of them assert this, but that's been a matter of code size and time. If you call Matter APIs that mutate state on multiple threads without locking, that cannot possibly be safe and will lead to data/memory corruption and hence bugs.

For what it's worth, see #25041

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And note that the lock asserts are no-ops on various platforms. Again, the expectation is that SDK consumers do the right thing and only touch SDK APIs with the locks held, either directly (which also does not work on some lock-free platforms) or via posting whatever they are doing to the Matter event loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which expectation? That you only touch the Matter SDK while holding the Matter lock? It's a basic API expectation in all Matter APIs.

It's surprising that this assertion would not be mentioned in all but one file in this module. As a newcomer, and having only read a few dozen files in the SDK, this is the only time that I've seen it mentioned. Is there a centralized document explaining this thread safety approach?

Again, the expectation is that SDK consumers do the right thing

This doesn't seem like a reasonable assumption for an SDK... public APIs should generally be defensive and assume that the caller can always make the wrong decision. Even if the SDK chooses not to fail gracefully (a valid choice), clients need immediate feedback to help them find bugs in their code. The current segmentation fault in the middle of SendMessage being serviced by a different thread doesn't obviously indicate that the client (the tv-casting-app) is doing something wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a centralized document explaining this thread safety approach?

I don't know of one... @mrjerryjohns did something ever get written here?

To be clear, this is not great, but that's the state of things right now. Not enough hands....

public APIs should generally be defensive and assume that the caller can always make the wrong decision.

Sure. Again:

  1. Not all platforms support checking the locking correctness.
  2. Adding the asserts to all the places that should have them has nonzero cost both in developer time and final codesize on the platforms that do have such checking.
  3. We do not have a clear distinction between private and public APIs right now (an issue in its own right).

We have been adding the asserts in an ad-hoc manner to high-value places that would have a good chance of catching bugs, but something more systematic would be welcome, obviously.

The current segmentation fault in the middle of SendMessage being serviced by a different thread doesn't obviously indicate that the client (the tv-casting-app) is doing something wrong.

Completely agreed! I'm just saying that the right fix here is not in SendMessage; that's going to cover up a thread race that's likely corrupting data elsewhere, at least probabilistically. The right fix is to get rid of the thread race.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good discussion on this issue, I agree that we should fix the thread race instead. With #25041 merged, we should start seeing assertion failures now in the tv-casting-app example which would help pinpoint the threading problem. I'm happy to close out this PR, I'll just wait a few days to see if anyone chimes in with pointers on the thread safety documentation. Thanks!

Comment on lines +449 to +450
// that at the time of writing (2023/02/09) the SessionManager pointer, if non-null, is guaranteed to be
// valid because it is allocated in the parent Server class.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you're not even a Server....

ExchangeContext * ec = ctx.NewExchangeToAlice(&mockSolicitedAppDelegate);

// Close the Exchange Context.
ec->Close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this ec is a dangling pointer....

ec->Close();

// Call public APIs to verify that they do not crash or fail the test.
NL_TEST_ASSERT(inSuite, ec->StartResponseTimer() == CHIP_ERROR_INTERNAL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This absolutely can crash. It's calling a method on a deleted object!

* Tests that the Exchange Context APIs do not crash if delayed calls are made after the Exchange Context is
* closed.
*/
void CheckExchangeContextDoesNotCrashWhenDelayedCallsOccurAfterClose(nlTestSuite * inSuite, void * inContext)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this test does not compile, and if it compiled would not get run. Please make sure to verify that tests fail without the fix and pass with it...

@pullapprove pullapprove bot requested a review from kkasperczyk-no February 14, 2023 18:09
auto-merge was automatically disabled February 15, 2023 13:42

Merge queue setting changed

@domichae-amazon
Copy link
Contributor Author

I'm going to close this pull request and create a new one that addresses the root cause.

@domichae-amazon domichae-amazon deleted the feature/25023-bugfix branch February 17, 2023 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] tv-casting-app does not take the SDK lock prior to SDK interactions
4 participants