Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESP32 NimBLE panic when closing connection #1172

Open
beckerzito opened this issue Jul 13, 2023 · 11 comments
Open

ESP32 NimBLE panic when closing connection #1172

beckerzito opened this issue Jul 13, 2023 · 11 comments

Comments

@beckerzito
Copy link

beckerzito commented Jul 13, 2023

Build environment: macOS, Windows, or Linux
Moddable SDK version: OS220805b
Target device: ESP32 platforms

Description
When using the NimBLE library, in some conditions I can see that if some BLE communication happens while the BLE connection is being closed, a panic exception is raised, with LoadProhibited. Guru Meditation Error: Core 0 panic'ed (LoadProhibited). Exception was unhandled.

Steps to Reproduce

  1. Build any application with BLE communication using NimBLE library (for example test-server example), in which the BLE is closed after some time.
  2. Make sure you have a BLE client connected and writing/reading to characteristics of the BLE Server while the close is requested.
  3. See exception happening.

Expected behavior
The communication should close properly, without any exceptions raised.

Images
From an initial analysis, what I could observe is:
image

As soon as the close is called, the modBLEServer.c already sets the global instance gBLE to NULL. The same instance is used by the GATT callbacks when events happen on the BLE communication.

The module then posts a message to the machine to call the function bleServerCloseEvent that is going to call the function void xs_ble_server_destructor that properly deinitializes the NIBLE driver and closes all the current connections.
image

So, once the NimBLE has its own thread to receive BLE events, it seems a concurrence problem. If right before the close event is executed by the event loop, a message on BLE arrives, the NimBLE task would be running the callback waiting for the requestPending property to be set to False by the readEvent, which is also in the event queue. (see below).

image

As soon as the close is executed in the event loop, the gBLE is set to NULL and the while loop on the NimBLE callback will crash with load prohibited.

I imagine that the quick win here would be to set the gBLE to NULL only after resetting the gatt server. On the other hand, I can see also some opportunities to improve the protection against the gBLE usage by multiple threads on the long term.

Note: I checked that in newer versions we still have the same code.

@mkellner
Copy link
Collaborator

Hello, can you please provide a stack trace with the failure?

Can you please provide some BLE client code to reproduce the problem?

@beckerzito
Copy link
Author

Hi @mkellner

Here's the stack trace:

I (151016) NimBLE: GAP procedure initiated: stop advertising.

Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x4010df58  PS      : 0x00060e30  A0      : 0x801530a8  A1      : 0x3ffdd6e0  
0x4010df58: gatt_svr_chr_dynamic_value_access_cb at .../moddable/modules/network/ble/nimble/modBLEServer.c:890

A2      : 0x00000000  A3      : 0x00000001  A4      : 0x3ffdd750  A5      : 0x00000005  
A6      : 0x3ffc3940  A7      : 0x3ffd5704  A8      : 0x8010df55  A9      : 0x3ffdd6c0  
A10     : 0x00000005  A11     : 0x3ffbf428  A12     : 0x00000000  A13     : 0x4010e464  
0x4010e464: readEvent at .../moddable/modules/network/ble/nimble/modBLEServer.c:513

A14     : 0x3ffdd750  A15     : 0x00060820  SAR     : 0x00000010  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000076  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0x00000000  


Backtrace:0x4010df55:0x3ffdd6e00x401530a5:0x3ffdd710 0x40153139:0x3ffdd740 0x40156bcd:0x3ffdd780 0x40156e6e:0x3ffdd7c0 0x40157565:0x3ffdd7e0 0x4014eed5:0x3ffdd810 0x4014edab:0x3ffdd840 0x4014e6e0:0x3ffdd870 0x4014e6eb:0x3ffdd890 0x401559c2:0x3ffdd8b0 0x4010eec3:0x3ffdd8d0 
0x4010df55: gatt_svr_chr_dynamic_value_access_cb at .../moddable/modules/network/ble/nimble/modBLEServer.c:890

0x401530a5: ble_gatts_val_access at .../esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_gatts.c:375
 (inlined by) ble_gatts_val_access at .../esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_gatts.c:340

0x40153139: ble_gatts_chr_val_access at .../esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_gatts.c:421

0x40156bcd: ble_att_svr_read at .../esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_att_svr.c:398

0x40156e6e: ble_att_svr_read_handle at .../esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_att_svr.c:473

0x40157565: ble_att_svr_rx_read at .../esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_att_svr.c:1478

0x4014eed5: ble_att_rx at .../esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_att.c:503

0x4014edab: ble_hs_hci_evt_acl_process at .../esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_hs_hci_evt.c:927

0x4014e6e0: ble_hs_process_rx_data_queue at .../esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_hs.c:238

0x4014e6eb: ble_hs_event_rx_data at .../esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_hs.c:532

0x401559c2: ble_npl_event_run at .../esp-idf/components/bt/host/nimble/nimble/porting/npl/freertos/include/nimble/nimble_npl_os.h:125
 (inlined by) nimble_port_run at .../esp-idf/components/bt/host/nimble/nimble/porting/nimble/src/nimble_port.c:78

0x4010eec3: nimble_host_task at .../moddable/modules/network/ble/nimble/modBLECommon.c:45
 (inlined by) ble_host_task at .../moddable/modules/network/ble/nimble/modBLECommon.c:51

About the BLE client, I don't have any example ready... I'm reproducing the issue with a special app I'm working on.
Can you let me know what is the way you test your BLE Server examples? Then I can try to create some BLE client code in the same platform you use.

Finally, I put together the idea of resetting the gBLE only after resetting the gatt server, but also fixed some concurrence conditions by using modCriticalSection. You can find a patch attached here. diff.patch

Let me know what you think!

Thanks!

@phoddie
Copy link
Collaborator

phoddie commented Jul 17, 2023

Thanks for the stack trace.

We found a way to reproduce the problem reliably by using the name-change-server example with a small change to have the onCharacteristicRead callback close the BLE server. Then, using nRF Connect on iOS, the crash triggers by connecting and waiting a second.

Thanks for sharing the patch. The use of critical sections may help in places, but some of them don't seem likely to have an effect.

+	modCriticalSectionBegin();
+	xsMachine * the = gBLE->the;
+	modCriticalSectionEnd();

We have a fix that we've successfully tested that uses a reference count to ensure that the BLE server isn't disposed while in use. That allows gatt_svr_chr_dynamic_value_access_cb to complete cleanly if the connection is closed during its execution.

@beckerzito
Copy link
Author

Hi @phoddie,

You're right, there are some cases unnecessarily protected.
But I'm glad you were able to find a fix!

Let me know how we can get that. Appreciate your support once more!

@mkellner
Copy link
Collaborator

Please try:

https://gist.github.com/mkellner/80a28e35da39d9d3f5e4f68db67b8182

Replace $MODDABLE/modules/network/ble/nimble/modBLEServer.c with this version. Let us know if it has solved your issue.

@mkellner
Copy link
Collaborator

mkellner commented Jul 20, 2023

Hello @beckerzito,

Did the modBLEServer.c provided to you in the gist above fix your problem?

@beckerzito
Copy link
Author

Hi @mkellner and @phoddie

I have tested the version lately and it seems that for the specific issue of communication during disconnection it solved the issue. But now, with these changes I'm getting a new problem, never faced before...

It's not a consistent issue but with a very high frequency (Android 13) and sometimes in (iOs and Android 11), the pairing PIN is not generated, then those clients end up closing the connection.

I wasn't able to find the root cause yet but just wanted to point out this issue that I'm supposing is related to this update, once it wasn't running before!

I'll keep you posted on the findings, but just wondering if sharing that you could have some insight what might be happening... Maybe some timeout protection on the BLE client? In which the BLE Server on firmware is waiting for something to continue the connection procedure?

@phoddie
Copy link
Collaborator

phoddie commented Aug 4, 2023

@beckerzito – thanks for the update.

Glad to hear that the changes resolve the problems at close.

It's not a consistent issue but with a very high frequency (Android 13) and sometimes in (iOs and Android 11), the pairing PIN is not generated, then those clients end up closing the connection.

That's unexpected. Do you have a way to reproduce that with one of the example apps or another project you can share? We can take a look using the BLE security-server example but there may be differences in your configuration.

pairing PIN is not generated

You might try enabling logging in the BLE server to see what is different between the success and failure cases. That is done by changing both LOG_GAP and LOG_GATT from 0 to 1.

Maybe some timeout protection on the BLE client?

That would be on the mobile device side (Android or iOS). I can't say. Presumably they timeout eventually, but there's no obvious reason that the ESP32 should be particularly slow.

@phoddie
Copy link
Collaborator

phoddie commented Aug 18, 2023

@beckerzito - any update here. Pairing works in my testing. You indicate that this isn't 100% reproducible, so maybe there's some special condition needed.

@beckerzito
Copy link
Author

hi @phoddie sorry for the long delay. Unfortunately, I couldn't work to understand why the proposed change is not working properly. But it's clear that it doesn't work during the integration tests in some Android platforms or even iOS versions. As I described, the pins are not generated and the clients end up closing the connection (at least this is the observed behavior).

I apologize for not being able to go deeper on that and creating the app to reproduce it, it's hard to reproduce in an example where the CPU usage is very low... My hypothesis is that it appears in critical CPU usages.... (high number of threads, interrupts, and so on).

My only point is that I don't suggest going forward with the proposed change. So far we have been working with the patch I sent on the issue, and all the tests performed pretty well.

@phoddie
Copy link
Collaborator

phoddie commented Nov 22, 2023

@beckerzito – Thank you for the update. As you will understand, we are most effective at fixing problems when we can reproduce them.

Still, these changes do fix some real problems. We'll have to revisit that to decide whether it makes sense to integrate them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants