Resolve unsafe memory accesses in ExchangeContext (#25023) #25032

domichae-amazon · 2023-02-13T23:44:29Z

Problem

When an ExchangeMgr is shut down (ShutDown()), the SessionManager is set to nullptr and resources are released. However, if a child ExchangeContext is still in use on another thread (e.g. in the middle of SendMessage(...)), the ExchangeContext uses mExchangeManager->GetSessionManager() without any null safety checks. This can cause segmentation faults when the null SessionManager is dereferenced.

Solution

Add null checks on all usages of the SessionManager. Return error codes where appropriate / applicable, and fail gracefully / silently where it is safe to do so.

CLAassistant · 2023-02-13T23:44:34Z

All committers have signed the CLA.

github-actions · 2023-02-14T00:04:22Z

PR #25032: Size comparison from e7528bc to 6bac357

Increases (1 build for cc32xx)

platform	target	config	section	`e7528bc`	`6bac357`	change
cc32xx	lock	CC3235SF_LAUNCHXL	(read only)	640361	640385	24
			.debug_info	20180782	`2018094`	159
			.debug_line	2649926	`2650018`	92
			.debug_loc	2786017	2786273	256
			.text	532604	532628	24

Full report (1 build for cc32xx)

platform	target	config	section	`e7528bc`	`6bac357`	change
cc32xx	lock	CC3235SF_LAUNCHXL		0	0	0
			(read only)	640361	640385	24
			(read/write)	204084	204084	0
			.ARM.attributes	44	44	0
			.ARM.exidx	8	8	0
			.bss	197488	197488	0
			.comment	194	194	0
			.data	1476	1476	0
			.debug_abbrev	928461	928461	0
			.debug_aranges	87352	87352	0
			.debug_frame	299840	299840	0
			.debug_info	20180782	`2018094`	159
			.debug_line	2649926	`2650018`	92
			.debug_loc	2786017	2786273	256
			.debug_ranges	280728	280728	0
			.debug_str	3005287	3005287	0
			.ramVecs	780	780	0
			.resetVecs	64	64	0
			.rodata	105633	105633	0
			.shstrtab	232	232	0
			.stab	204	204	0
			.stabstr	441	441	0
			.stack	2048	2048	0
			.strtab	375902	375902	0
			.symtab	255856	255856	0
			.text	532604	532628	24

bzbarsky-apple

However, if a child ExchangeContext is still in use on another thread

Then you have a thread race. You can't be in the middle of SendMessage on one thread and shut things down on another thread without taking the relevant locks!

bzbarsky-apple · 2023-02-14T01:30:50Z

src/messaging/ExchangeContext.cpp

+    // Verify that the Session Manager is still alive. Note that host applications may stop the chip::Server
+    // at any time, which will tear down the Session Manager, Exchange Manager, and Exchange Context. Note


No, they're not allowed to stop it at any time, unless they are holding the Matter stack lock. We should add an assert to that effect.

Can you share any references to this expectation in the codebase? It isn't enforced anywhere in ExchangeMgr or any other file in messaging/* except for a single usage in ExchangeContext at the beginning of SendMessage.

Can you share any references to this expectation in the codebase? I

Which expectation? That you only touch the Matter SDK while holding the Matter lock? It's a basic API expectation in all Matter APIs. Not all of them assert this, but that's been a matter of code size and time. If you call Matter APIs that mutate state on multiple threads without locking, that cannot possibly be safe and will lead to data/memory corruption and hence bugs.

For what it's worth, see #25041

And note that the lock asserts are no-ops on various platforms. Again, the expectation is that SDK consumers do the right thing and only touch SDK APIs with the locks held, either directly (which also does not work on some lock-free platforms) or via posting whatever they are doing to the Matter event loop.

Which expectation? That you only touch the Matter SDK while holding the Matter lock? It's a basic API expectation in all Matter APIs.

It's surprising that this assertion would not be mentioned in all but one file in this module. As a newcomer, and having only read a few dozen files in the SDK, this is the only time that I've seen it mentioned. Is there a centralized document explaining this thread safety approach?

Again, the expectation is that SDK consumers do the right thing

This doesn't seem like a reasonable assumption for an SDK... public APIs should generally be defensive and assume that the caller can always make the wrong decision. Even if the SDK chooses not to fail gracefully (a valid choice), clients need immediate feedback to help them find bugs in their code. The current segmentation fault in the middle of SendMessage being serviced by a different thread doesn't obviously indicate that the client (the tv-casting-app) is doing something wrong.

Is there a centralized document explaining this thread safety approach?

I don't know of one... @mrjerryjohns did something ever get written here?

To be clear, this is not great, but that's the state of things right now. Not enough hands....

public APIs should generally be defensive and assume that the caller can always make the wrong decision.

Sure. Again:

Not all platforms support checking the locking correctness.

Adding the asserts to all the places that should have them has nonzero cost both in developer time and final codesize on the platforms that do have such checking.

We do not have a clear distinction between private and public APIs right now (an issue in its own right).

We have been adding the asserts in an ad-hoc manner to high-value places that would have a good chance of catching bugs, but something more systematic would be welcome, obviously.

The current segmentation fault in the middle of SendMessage being serviced by a different thread doesn't obviously indicate that the client (the tv-casting-app) is doing something wrong.

Completely agreed! I'm just saying that the right fix here is not in SendMessage; that's going to cover up a thread race that's likely corrupting data elsewhere, at least probabilistically. The right fix is to get rid of the thread race.

Thanks for the good discussion on this issue, I agree that we should fix the thread race instead. With #25041 merged, we should start seeing assertion failures now in the tv-casting-app example which would help pinpoint the threading problem. I'm happy to close out this PR, I'll just wait a few days to see if anyone chimes in with pointers on the thread safety documentation. Thanks!

bzbarsky-apple · 2023-02-14T01:31:11Z

src/messaging/ExchangeContext.cpp

+    // that at the time of writing (2023/02/09) the SessionManager pointer, if non-null, is guaranteed to be
+    // valid because it is allocated in the parent Server class.


Unless you're not even a Server....

bzbarsky-apple · 2023-02-14T01:31:29Z

src/messaging/tests/TestMessagingLayer.cpp

+    ExchangeContext * ec = ctx.NewExchangeToAlice(&mockSolicitedAppDelegate);
+
+    // Close the Exchange Context.
+    ec->Close();


After this ec is a dangling pointer....

bzbarsky-apple · 2023-02-14T01:31:46Z

src/messaging/tests/TestMessagingLayer.cpp

+    ec->Close();
+
+    // Call public APIs to verify that they do not crash or fail the test.
+    NL_TEST_ASSERT(inSuite, ec->StartResponseTimer() == CHIP_ERROR_INTERNAL);


This absolutely can crash. It's calling a method on a deleted object!

bzbarsky-apple · 2023-02-14T04:43:29Z

src/messaging/tests/TestMessagingLayer.cpp

+ * Tests that the Exchange Context APIs do not crash if delayed calls are made after the Exchange Context is
+ * closed.
+ */
+void CheckExchangeContextDoesNotCrashWhenDelayedCallsOccurAfterClose(nlTestSuite * inSuite, void * inContext)


Also, this test does not compile, and if it compiled would not get run. Please make sure to verify that tests fail without the fix and pass with it...

domichae-amazon · 2023-02-17T21:39:54Z

I'm going to close this pull request and create a new one that addresses the root cause.

Resolve unsafe memory accesses in ExchangeContext (project-chip#25023)

6bac357

pullapprove bot requested review from mlepage-google, mrjerryjohns, msandstedt, mspang, pjzander-signify, robszewczyk, saurabhst, selissia, tecimovic, tehampson, turon, vijs, vivien-apple, woody-apple, xylophone21 and yufengwangca February 13, 2023 23:48

pullapprove bot added the review - pending label Feb 13, 2023

sharadb-amazon approved these changes Feb 13, 2023

View reviewed changes

bzbarsky-apple requested changes Feb 14, 2023

View reviewed changes

bzbarsky-apple reviewed Feb 14, 2023

View reviewed changes

pullapprove bot requested a review from kkasperczyk-no February 14, 2023 18:09

woody-apple enabled auto-merge February 14, 2023 20:51

auto-merge was automatically disabled February 15, 2023 13:42
Merge queue setting changed

domichae-amazon closed this Feb 17, 2023

domichae-amazon deleted the feature/25023-bugfix branch February 17, 2023 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve unsafe memory accesses in ExchangeContext (#25023) #25032

Resolve unsafe memory accesses in ExchangeContext (#25023) #25032

domichae-amazon commented Feb 13, 2023

CLAassistant commented Feb 13, 2023 •

edited

Loading

github-actions bot commented Feb 14, 2023

bzbarsky-apple left a comment

bzbarsky-apple Feb 14, 2023

domichae-amazon Feb 14, 2023

bzbarsky-apple Feb 14, 2023 •

edited

Loading

bzbarsky-apple Feb 14, 2023

domichae-amazon Feb 14, 2023

bzbarsky-apple Feb 14, 2023

domichae-amazon Feb 15, 2023

bzbarsky-apple Feb 14, 2023

bzbarsky-apple Feb 14, 2023

bzbarsky-apple Feb 14, 2023

bzbarsky-apple Feb 14, 2023

domichae-amazon commented Feb 17, 2023

		// Verify that the Session Manager is still alive. Note that host applications may stop the chip::Server
		// at any time, which will tear down the Session Manager, Exchange Manager, and Exchange Context. Note

		// that at the time of writing (2023/02/09) the SessionManager pointer, if non-null, is guaranteed to be
		// valid because it is allocated in the parent Server class.

Resolve unsafe memory accesses in ExchangeContext (#25023) #25032

Resolve unsafe memory accesses in ExchangeContext (#25023) #25032

Conversation

domichae-amazon commented Feb 13, 2023

Problem

Solution

CLAassistant commented Feb 13, 2023 • edited Loading

github-actions bot commented Feb 14, 2023

bzbarsky-apple left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bzbarsky-apple Feb 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

domichae-amazon commented Feb 17, 2023

CLAassistant commented Feb 13, 2023 •

edited

Loading

bzbarsky-apple Feb 14, 2023 •

edited

Loading