-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve unsafe memory accesses in ExchangeContext (#25023) #25032
Resolve unsafe memory accesses in ExchangeContext (#25023) #25032
Conversation
PR #25032: Size comparison from e7528bc to 6bac357 Increases (1 build for cc32xx)
Full report (1 build for cc32xx)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, if a child ExchangeContext is still in use on another thread
Then you have a thread race. You can't be in the middle of SendMessage on one thread and shut things down on another thread without taking the relevant locks!
// Verify that the Session Manager is still alive. Note that host applications may stop the chip::Server | ||
// at any time, which will tear down the Session Manager, Exchange Manager, and Exchange Context. Note |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they're not allowed to stop it at any time, unless they are holding the Matter stack lock. We should add an assert to that effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share any references to this expectation in the codebase? It isn't enforced anywhere in ExchangeMgr
or any other file in messaging/*
except for a single usage in ExchangeContext
at the beginning of SendMessage
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share any references to this expectation in the codebase? I
Which expectation? That you only touch the Matter SDK while holding the Matter lock? It's a basic API expectation in all Matter APIs. Not all of them assert this, but that's been a matter of code size and time. If you call Matter APIs that mutate state on multiple threads without locking, that cannot possibly be safe and will lead to data/memory corruption and hence bugs.
For what it's worth, see #25041
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And note that the lock asserts are no-ops on various platforms. Again, the expectation is that SDK consumers do the right thing and only touch SDK APIs with the locks held, either directly (which also does not work on some lock-free platforms) or via posting whatever they are doing to the Matter event loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which expectation? That you only touch the Matter SDK while holding the Matter lock? It's a basic API expectation in all Matter APIs.
It's surprising that this assertion would not be mentioned in all but one file in this module. As a newcomer, and having only read a few dozen files in the SDK, this is the only time that I've seen it mentioned. Is there a centralized document explaining this thread safety approach?
Again, the expectation is that SDK consumers do the right thing
This doesn't seem like a reasonable assumption for an SDK... public APIs should generally be defensive and assume that the caller can always make the wrong decision. Even if the SDK chooses not to fail gracefully (a valid choice), clients need immediate feedback to help them find bugs in their code. The current segmentation fault in the middle of SendMessage
being serviced by a different thread doesn't obviously indicate that the client (the tv-casting-app
) is doing something wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a centralized document explaining this thread safety approach?
I don't know of one... @mrjerryjohns did something ever get written here?
To be clear, this is not great, but that's the state of things right now. Not enough hands....
public APIs should generally be defensive and assume that the caller can always make the wrong decision.
Sure. Again:
- Not all platforms support checking the locking correctness.
- Adding the asserts to all the places that should have them has nonzero cost both in developer time and final codesize on the platforms that do have such checking.
- We do not have a clear distinction between private and public APIs right now (an issue in its own right).
We have been adding the asserts in an ad-hoc manner to high-value places that would have a good chance of catching bugs, but something more systematic would be welcome, obviously.
The current segmentation fault in the middle of SendMessage being serviced by a different thread doesn't obviously indicate that the client (the tv-casting-app) is doing something wrong.
Completely agreed! I'm just saying that the right fix here is not in SendMessage; that's going to cover up a thread race that's likely corrupting data elsewhere, at least probabilistically. The right fix is to get rid of the thread race.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the good discussion on this issue, I agree that we should fix the thread race instead. With #25041 merged, we should start seeing assertion failures now in the tv-casting-app
example which would help pinpoint the threading problem. I'm happy to close out this PR, I'll just wait a few days to see if anyone chimes in with pointers on the thread safety documentation. Thanks!
// that at the time of writing (2023/02/09) the SessionManager pointer, if non-null, is guaranteed to be | ||
// valid because it is allocated in the parent Server class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you're not even a Server....
ExchangeContext * ec = ctx.NewExchangeToAlice(&mockSolicitedAppDelegate); | ||
|
||
// Close the Exchange Context. | ||
ec->Close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this ec
is a dangling pointer....
ec->Close(); | ||
|
||
// Call public APIs to verify that they do not crash or fail the test. | ||
NL_TEST_ASSERT(inSuite, ec->StartResponseTimer() == CHIP_ERROR_INTERNAL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This absolutely can crash. It's calling a method on a deleted object!
* Tests that the Exchange Context APIs do not crash if delayed calls are made after the Exchange Context is | ||
* closed. | ||
*/ | ||
void CheckExchangeContextDoesNotCrashWhenDelayedCallsOccurAfterClose(nlTestSuite * inSuite, void * inContext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this test does not compile, and if it compiled would not get run. Please make sure to verify that tests fail without the fix and pass with it...
I'm going to close this pull request and create a new one that addresses the root cause. |
Fixes #25023
Problem
When an
ExchangeMgr
is shut down (ShutDown()
), theSessionManager
is set tonullptr
and resources are released. However, if a childExchangeContext
is still in use on another thread (e.g. in the middle ofSendMessage(...)
), theExchangeContext
usesmExchangeManager->GetSessionManager()
without anynull
safety checks. This can cause segmentation faults when thenull
SessionManager
is dereferenced.Solution
Add
null
checks on all usages of theSessionManager
. Return error codes where appropriate / applicable, and fail gracefully / silently where it is safe to do so.