Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to perform factory reset after removing fabric during ota #24329

Closed
fuxiaoming-lumi opened this issue Jan 9, 2023 · 17 comments · Fixed by #24818
Closed

[BUG] Unable to perform factory reset after removing fabric during ota #24329

fuxiaoming-lumi opened this issue Jan 9, 2023 · 17 comments · Fixed by #24818

Comments

@fuxiaoming-lumi
Copy link
Contributor

Reproduction steps

  1. Use chip-tool to commission DUT(Matter over Thread) and chip-ota-provider-app to the same fabric.
  2. Configure the OTA Provider with the access control list (ACL) that grants Operate privileges to all nodes in the fabric.
  3. Use chip-tool to send the Announce OTA Provider command to the DUT and trigger OTA progress.
  4. Use chip-tool to remove-fabric when the ota progress is about 10%.
  5. In this case, the factory reset can‘t be performed successfully ( API - chip::Server::GetInstance().ScheduleFactoryReset() ).
  6. Reboot the DUT and the factory reset can be performed successfully.

Bug prevalence

100% reproduce

GitHub hash of the SDK that was being used

83c3b6b

Platform

efr32, k32w

Platform Version(s)

No response

Anything else?

With NXP K32W Platform:
chip-ota-provider-app-with-nxp-k32w.txt
chip-tool-nxp-k32w.txt
dut-nxp-k32w.txt

With Silicon Labs MG24 Platform:
chip-ota-provider-app-silicon-mg24.txt
chip-tool-silicon-mg24.txt
dut_silicon-mg24.txt

It is necessary to verify whether other platform has the same problem.

@bzbarsky-apple
Copy link
Contributor

In this case, the factory reset can‘t be performed successfully ( API - chip::Server::GetInstance().ScheduleFactoryReset() ).

What happens when you call the API? Is the lambda just not scheduled, or does it for some reason not do the right thing @fuxiaoming-lumi ?

@marius-alex-tache
Copy link
Contributor

I can reproduce this on K32W0 contact sensor. What I'm seeing is that the app event queue becomes full because the events are never consumed. I'm not sure why this happens (edit: please see the note below). I've put a breakpoint inside AppTaskMain main loop, but it never triggers.

Actually, it seems that there is a failed assert, but somehow the button interrupts are still handled.
image

Is there a race condition somewhere?

@bzbarsky-apple
Copy link
Contributor

Does the OTA stuff keep running even though the fabric has been removed or something?

@marius-alex-tache
Copy link
Contributor

I saw that some packets sent by OTAP were still received after removing the fabric (at least I saw some UDP message received logs), but I'm not certain if I remember correctly. Will have to test more to confirm.

@marius-alex-tache
Copy link
Contributor

marius-alex-tache commented Jan 30, 2023

[117880[D][DMG]Command handler moving to [ Preparing]
[117885[D][DMG]Command handler moving to [AddingComm]
[117897[D][DMG]Command handler moving to [AddedComma]
----> [117903[D][IN]Expiring all sessions for fabric 0x1!!
[117912[D][IN]SecureSession[0x400da18]: MarkForEviction Type:2 LSID:21793
[117922[P][SC]SecureSession[0x400da18]: Moving from state 'kActive' --> 'kPendingEviction'
[117931[D][IN]SecureSession[0x400da18]: Released - Type:2 LSID:21793
[117941[D][IN]SecureSession[0x400dad0]: MarkForEviction Type:2 LSID:21796
[117950[P][SC]SecureSession[0x400dad0]: Moving from state 'kActive' --> 'kPendingEviction'
[117960[D][IN]SecureSession[0x400db88]: MarkForEviction Type:2 LSID:21795
[117970[P][SC]SecureSession[0x400db88]: Moving from state 'kActive' --> 'kPendingEviction'
[117981[D][IN]SecureSession[0x400db88]: Released - Type:2 LSID:21795
[117988[D][IN]SecureSession[0x400dc40]: MarkForEviction Type:2 LSID:21797
[117999[P][SC]SecureSession[0x400dc40]: Moving from state 'kActive' --> 'kPendingEviction'
[118010[E][EM]Failed to send Solitary ack for MessageCounter:41142393 on exchange 35046i:2
----> [118019[D][IN]SecureSession[0x400dc40]: Released - Type:2 LSID:21797
[118030[D][DMG]Decreasing reference count for CommandHandler, remaining 0
[118037[P][EM]<<< [E:58927r M:118536181 (Ack:94750963)] (S) Msg TX to 1:000000000001B669 [0000] --- Type 0001:09 (IM:InvokeCommandResponse)
[118053[P][IN](S) Sending msg 118536181 on secure session with LSID: 21796
[118061[D][DMG]Command handler moving to [CommandSen]
[118070[D][DMG]Command handler moving to [AwaitingDe]
[118076[E][DL]Long dispatch time: 848 ms, for event type 3
[118084[P][EM]>>> [E:58926r M:199548407 (Ack:166991627)] (U) Msg RX from 0:8B8E894580E092EE [0000] --- Type 0000:10 (SecureChannel:StandaloneAck)
[118103[D][EM]Found matching exchange: 58926r, Delegate: 0
[118110[D][EM]Rxd Ack; Removing MessageCounter:166991627 from Retrans Table on exchange 58926r
----> [118120[E][IN]Data received on an unknown session (LSID=21797). Dropping it!
[118127[E][EM]Failed to send Solitary ack for MessageCounter:41142393 on exchange 35046i:2
[118136[E][EM]Failed to send Solitary ack for MessageCounter:41142393 on exchange 35046i:2
[118144[E][EM]Failed to send Solitary ack for MessageCounter:41142393 on exchange 35046i:2
[118159[E][EM]Failed to send Solitary ack for MessageCounter:41142393 on exchange 35046i:2
[118174[E][EM]Failed to send Solitary ack for MessageCounter:41142393 on exchange 35046i:2
[118188[E][EM]Failed to send Solitary ack for MessageCounter:41142393 on exchange 35046i:2
[118209[D][IN]UDP Message Received packet nb : 42 SrcAddr : fd11:33:0:0:573e:df41:e3b0:afa6[5540] DestAddr : fd11:33:0:0:cf57:8fd8:ccac:a903[5540] Payload Length 1062
[118229[D][IN]UDP Message Received packet nb : 43 SrcAddr : fd11:33:0:0:573e:df41:e3b0:afa6[44291] DestAddr : fd11:33:0:0:cf57:8fd8:ccac:a903[5540] Payload Length 34
Solitary ack for MessageCounter:41142393 on exchange 35046i:2
[118250[E][IN]Data received on an unknown session (LSID=21797). Dropping it!
[118258[P][EM]>>> [E:58927r M:94750964 (Ack:118536181)] (S) Msg RX from 1:000000000001B669 [0000] --- Type 0000:10 (SecureChannel:StandaloneAck)
[118271[D][EM]Found matching exchange: 58927r, Delegate: 0
[118277[D][EM]Rxd Ack; Removing MessageCounter:118536181 from Retrans Table on exchange 58927r
[118286[D][IN]SecureSession[0x400dad0]: Released - Type:2 LSID:21796
[118292[E][EM]Failed to send Solitary ack for MessageCounter:41142393 on exchange 35046i:2

Seems like a session is used after it's released (hence the assert in the stack trace above). Marked some entries with ---->.

Edit: this behavior is seen on contact-sensor app, which is a SED (ConnectivityManager::kThreadDeviceType_SleepyEndDevice), but is not seen on lighting-app, which is a MED (ConnectivityManager::kThreadDeviceType_MinimalEndDevice). Thus, the issue might be related to the polling mechanism of the SED.

@bzbarsky-apple
Copy link
Contributor

@marius-alex-tache Is a fabric removing another fabric, or is a fabric self-removing itself in this case?

@marius-alex-tache
Copy link
Contributor

marius-alex-tache commented Jan 31, 2023

The use case is done using only one controller, so it's self-removing.
I'm adding one more log that contains additional details:

[34112[D][IN]Expiring all sessions for fabric 0x1!!
[34117[D][IN]SecureSession[0x400da18]: MarkForEviction Type:2 LSID:27998
[34123[P][SC]SecureSession[0x400da18]: Moving from state 'kActive' --> 'kPendingEviction'
[34132[D][IN]SecureSession[0x400da18]: Released - Type:2 LSID:27998
[34138[D][IN]SecureSession[0x400dad0]: MarkForEviction Type:2 LSID:28001
[34145[P][SC]SecureSession[0x400dad0]: Moving from state 'kActive' --> 'kPendingEviction'
[34153[D][IN]SecureSession[0x400db88]: MarkForEviction Type:2 LSID:28000
[34160[P][SC]SecureSession[0x400db88]: Moving from state 'kActive' --> 'kPendingEviction'
[34169[D][IN]SecureSession[0x400db88]: Released - Type:2 LSID:28000
[34175[D][IN]SecureSession[0x400dc40]: MarkForEviction Type:2 LSID:28002
[34182[P][SC]SecureSession[0x400dc40]: Moving from state 'kActive' --> 'kPendingEviction'
---->[34190[E][BDX]exchange timed out
[34193[D][BDX]aborting due to timeout
[34198[E][SWU]Invalid output image file supplied
[34203[D][EVL]LogEvent event number: 0x0000000000000007 priority: 1, endpoint id:  0x0 cluster id: 0x0000[34219[D][IN]UDP Message Received packet nb : 37 SrcAddr : fd11:3_002A event id: 0x2 Sys timestamp: 0x000000000000859A
[34234[D][DMG]Endpoint 0,3:0:0:573e:df41:e3b0:afa6[36973] DestAddr : fd11:33:0:0:8add:7b25:7f49:86c1[5540] Payload Length 62
 Cluster 0x0000_002A update version to 905e3181
[34250[P][ZCL]MatterPostAttributeChangeCallback, value:1

[34256[P][ZCL]Unknown cluster ID: 0x0000_002A
[34261[D][DMG]Endpoint 0, Cluster 0x0000_002A update version to 905e3182
[34268[P][ZCL]MatterPostAttributeChangeCallback, value:255

[34273[P][ZCL]Unknown cluster ID: 0x0000_002A
[34278[D][EVL]LogEvent event number: 0x0000000000000008 priority: 1, endpoint id:  0x0 cluster id: 0x0000_002A event id: 0x0 Sys timestamp: 0x00000000000085E6
---->[34293[P][SWU]Unknown idle state reason so set the periodic timer for a next attempt
---->[34301[P][SWU]Stopping the watchdog timer
---->[34305[P][SWU]Starting the periodic query timer, timeout: 86400 seconds
[34311[D][IN]SecureSession[0x400dc40]: Released - Type:2 LSID:28002
[34318[D][DMG]Decreasing reference count for CommandHandler, remaining 0
[34326[P][EM]<<< [E:36885r M:65937972 (Ack:161090354)] (S) Msg TX to 1:000000000001B669 [0000] --- Type 0001:09 (IM:InvokeCommandResponse)
[34339[P][IN](S) Sending msg 65937972 on secure session with LSID: 28001
[34346[D][DMG]Command handler moving to [CommandSen]
[34351[D][DMG]Command handler moving to [AwaitingDe]
[34356[E][DL]Long dispatch time: 926 ms, for event type 3
[34362[P][EM]>>> [E:36884r M:252007273 (Ack:123487886)] (U) Msg RX from 0:52D97D1131185C1A [0000] --- Type 0000:10 (SecureChannel:StandaloneAck)
[34378[D][EM]Found matching exchange: 36884r, Delegate: 0
[34383[D][EM]Rxd Ack; Removing MessageCounter:123487886 from Retrans Table on exchange 36884r
[34397[E][IN]Data received on an unknown session (LSID=28002). Dropping it!
[34409[D][IN]Received a duplicate message with MessageCounter:161090354 on exchange 36885r
[34419[P][EM]>>> [E:36885r M:161090354] (S) Msg RX from 1:000000000001B669 [0000] --- Type 0001:08 (IM:InvokeCommandRequest)
[34434[D][EM]Found matching exchange: 36885r, Delegate: 0
[34445[D][EM]Forcing tx of solitary ack for duplicate MessageCounter:161090354 on exchange 36885r
[34455[P][EM]<<< [E:36885r M:65937973 (Ack:161090354)] (S) Msg TX to 1:000000000001B669 [0000] --- Type 0000:10 (SecureChannel:StandaloneAck)
[34470[P][IN](S) Sending msg 65937973 on secure session with LSID: 28001
[34568[D][IN]UDP Message Received packet nb : 38 SrcAddr : fd11:33:0:0:573e:df41:e3b0:afa6[5540] DestAddr : fd11:33:0:0:8add:7b25:7f49:86c1[5540] Payload Length 1062
[34589[E][IN]Data received on an unknown session (LSID=28002). Dropping it!
[34817[D][EM]Retransmitting MessageCounter:65937972 on exchange 3[34827[D][IN]UDP Message Received packet nb : 39 SrcAddr : fd11:33:0:0:573e:df41:e3b0:afa6[5540] DestAddr : fd11:33:0:0:8add:7b25:7f49:86c1[5540] Payload Length 1062
[34846[D][IN]UDP Message Received packet nb : 40 SrcAddr : fd11:33:0:0:573e:df41:e3b0:afa6[36973] DestAddr : fd11:33:0:0:8add:7b25:7f49:86c1[5540] Payload Length 34
6885r Send Cnt 1
[34864[P][IN](S) Sending msg 65937972 on secure session with LSID: 28001
[34871[E][IN]Data received on an unknown session (LSID=28002). Dropping it!
[34879[P][EM]>>> [E:36885r M:161090355 (Ack:65937972)] (S) Msg RX from 1:000000000001B669 [0000] --- Type 0000:10 (SecureChannel:StandaloneAck)
[34892[D][EM]Found matching exchange: 36885r, Delegate: 0
[34898[D][EM]Rxd Ack; Removing MessageCounter:65937972 from Retrans Table on exchange 36885r
---->[34907[P][DL]OpenThread SED interval is 1000ms
[34911[D][IN]SecureSession[0x400dad0]: Released - Type:2 LSID:28001
[35014[D][DL]SRP update succeeded
[35199[D][IN]UDP Message Received packet nb : 41 SrcAddr : fd11:33:0:0:573e:df41:e3b0:afa6[5540] DestAddr : fd11:33:0:0:8add:7b25:7f49:86c1[5540] Payload Length 1062
[35219[E][IN]Data received on an unknown session (LSID=28002). Dropping it!
[35471[D][EM]Retransmitting MessageCounter:19604937 on exchange 36121i Send Cnt 1
[37462[D][IN]UDP Message Received packet nb : 42 SrcAddr : fd11:33:0:0:573e:df41:e3b0:afa6[5540] DestAddr : fd11:33:0:0:8add:7b25:7f49:86c1[5540] Payload Length 1062
[38607[D][IN]UDP Message Received packet nb : 43 SrcAddr : fd11:33:0:0:573e:df41:e3b0:afa6[5540] DestAddr : fd11:33:0:0:8add:7b25:7f49:86c1[5540] Payload Length 1062
---->At this point, the assert fails.

@bzbarsky-apple
Copy link
Contributor

@marius-alex-tache Can you please attach, not paste, a full log, not one starting with the "[34112[D][IN]Expiring all sessions for fabric 0x1!!" bit? I need to understand what happened before that bit, because for self-removing we should not really be getting there at all.

@marius-alex-tache
Copy link
Contributor

marius-alex-tache commented Feb 1, 2023

Hi, sorry about the logs. I've attached one: contact-sensor-remove-fabric.log. You can see the number of OTAP retries (8 - max retries number) in the last lines of the log.

@bzbarsky-apple
Copy link
Contributor

bzbarsky-apple commented Feb 1, 2023

OK, so in that log at the point when we receive RemoveFabric we have three sessions around:

  • 38976, which was used to receive AnnounceOTAProvider from node 0x1B669.
  • 38977, which was used to receive RemoveFabric from node 0x1B669
  • 38978, which was used to send QueryImage to node 1 and is being used for the BDX transfer.

Then the following things happen:

  1. Session 38976 is destroyed immediately, since no one is currently using it (timestamp 23940 in the log)
  2. Session 38978 notifies its consumer that it's being evicted, that leads to the "[BDX]exchange timed out" log and the "[BDX]aborting due to timeout" log, and then the session is destroyed (timestamp 24087 in the log).
  3. Session 38977 is used to send the response to RemoveFabric and sticks around while that's getting acked and whatnot. But I also see that the client seems to not get the response and resends the request, and then we try to ack that new request, etc... I don't ever see session 38977 being actually released in this log, so I would expect its exchange to still be pointing to it.

Looking at the screenshot above, it's not 100% clear where we are aborting, because a lot of the line numbers on the right are cut off. Is it on this line in ReliableMessageMgr::ExecuteActions:

            // Don't check whether the session in the exchange is valid, because when the session is released, the retrans entry is
            // cleared inside ExchangeContext::OnSessionReleased, so the session must be valid if the entry exists.
            SessionHandle session = entry->ec->GetSessionHandle();

or somewhere else? If it's on that line, then what would be really useful would be a complete log and from the same run the value of entry->ec (the ExchangeContext *) at the crash point. Then we can try to figure out which exchange, and hence which session, is involved here. @marius-alex-tache

@marius-alex-tache
Copy link
Contributor

I've added a log at the beginning of the lambda function. Please see this log: contact-sensor-remove-fabric-print-exchange-context.log. It seems the exchange context is 6275i, which is the context related to BDX messages.

@bzbarsky-apple
Copy link
Contributor

bzbarsky-apple commented Feb 1, 2023

Thanks, I think I know what's going on now. In that log, we tear down the session here:

[49100[P][SC]SecureSession[0x400fcb0]: Moving from state 'kActive' --> 'kPendingEviction'
[49113[E][BDX]exchange timed out
[49117[D][BDX]aborting due to timeout
...
[49202[D][IN]SecureSession[0x400fcb0]: Released - Type:2 LSID:19002

Now the exchange should have been removed from the MRP bits as part of that, but it's not because we end up doing the following:

  1. ExchangeContext::OnSessionReleased is called. We are not in the IsSendExpected state and are in the IsResponseExpected state, so we notify our delegate that we have timed out.
  2. This calls BDXMessenger::OnResponseTimeout (that's our delegate).
  3. That calls BDXDownloader::OnDownloadTimeout.
  4. That calls SetState(State::kIdle, OTAChangeReasonEnum::kTimeOut);
  5. BDXDownloader::SetState calls mStateDelegate->OnDownloadStateChanged().
  6. That lands us in DefaultOTARequestor::OnDownloadStateChanged, which sees we are entering the kIdle state and calls mBdxMessenger.Reset();
  7. BDXMessenger::Reset does mExchangeCtx->Close();
  8. This calls ExchangeContext::DoClose(false), so that we do not remove the exchange from the MRP tracking.
  9. Now the stack unwinds and we call Abort in ExchangeContext::OnSessionReleased, but we are already closed so the DoClose(true) call that makes ignored.
  10. Now we have a reference to a session-less exchange sitting around and that blows up.

The key part: Item 7 above is an API contract violation. You are NOT allowed to call Close from inside OnResponseTimeout. If we did not run into this other issue, we would run into the problem that we are going to underflow the exchange refcount when the stack unwinds and we do the Abort...

@Damian-Nordic @holbrookt @carol-apple what is the best way to fix this code to follow the expected contract for exchanges? Some thoughts:

  • It looks like we can't just rely on OTAChangeReasonEnum::kTimeOut because that's used for things other than exchange timeouts.
  • We could do something where we null out mExchangeCtx before calling OnDownloadTimeout in BDXMessenger::OnResponseTimeout, if we are sure nothing under OnDownloadTimeout will try to use the exchange.
  • We could keep some state in BDXMessenger where we flag that we are inside OnResponseTimeout and ignore Reset calls during that time, but that makes it use a bit more memory.

Other options?

@Damian-Nordic
Copy link
Contributor

Nulling out mExchangeCtx in OnDownloadTimeout sounds like the best option to me. mExchangeCtx is a private member of BDXMessenger, so the messenger can drop the exchange whenever it no longer needs it and knows that it will be destroyed properly.

@bzbarsky-apple
Copy link
Contributor

@marius-alex-tache Could you verify that the change in #24818 fixes the issue for you?

@marius-alex-tache
Copy link
Contributor

I confirm the fix works. Thank you! @fuxiaoming-lumi, could you also try to test on your side with this fix?

bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Feb 2, 2023
@jrhees-cae
Copy link
Contributor

I can also confirm this bug and that the fix corrects the issue (on a Nordic/Thread platform).

@fuxiaoming-lumi
Copy link
Contributor Author

I also confirm this bug is fixed with our product. Thanks all your support!

kkasperczyk-no pushed a commit to kkasperczyk-no/sdk-connectedhomeip that referenced this issue Mar 15, 2023
kkasperczyk-no pushed a commit to kkasperczyk-no/sdk-connectedhomeip that referenced this issue Mar 15, 2023
lecndav pushed a commit to lecndav/connectedhomeip that referenced this issue Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants