-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arduino LMIC stops after some time (6-9 hours) and doesn’t sends data #968
Comments
Hi, Additional information as i think I have the same problem with an STM32L081CZ and an sx1272 radio. [2024-10-20 00:26:40.749] Current interrupts:1000 Or the next with bit more details: Current interrupts:1000 The ostime_t is a uint32_t, it should not be an overflow, neither became a negative value so I dont understand that behavior. |
Negative time values are expected and not a problem. The LMIC is coded taking that overflow into account. It's one of the two possible styles of handling time overflow. The classic one is using |
Thank you very much for your quick reply, I really appreciate it! |
Sorry you're having problems. Because other users don't have this problem (specifically, because MCCI has run many devices for many months at a time), I have to guess that it's a system problem (the combination of LMIC and the rest of the firmware on the system) rather than a bug that is localized to the LMIC. To solve these, one needs to look at the entire code base. I suggest that you file a separate ticket, and attach a complete code base that demonstrates the problem -- of course, if you don't want to disclose your full code, you may need to develop a toy version that shows the bug. Sometimes preparing the toy version makes the bug go away; and that also gives a clue as to what to look for. |
Thank You, I will do so! |
Please remember to tag me there, I am also having the problem. |
In the last few days I have been testing my node with the basic ttn-otaa example program that I have attached too, but the problem still exists. The behavior of the node is the same. The next is from the screenshot from the ChiprStack server DeviceData: It can be seen that the scheduled data transmission is missed after a time (it is usually 7-10 hours) and the node start uploading data once or twice a day in apparently random cases. The original frequency of data transmission was 10 minutes, but the phenomenon is the same in case of 5 minutes. as well. When restarting, everything starts from the beginning - that is the workaround applied by @mirhamza708 |
Once again. I'm sure everyone thinks they've given me an exact repro, but nobody has. It does not happen for me, with thousands of devices deployeed and a varient of applciations. In order to debug this, I need;
I know this might be too much work. But I won't be able to respond unless someone undertakes this. This is free software, folks: if you have a problem, the onus is partially on you to help replicate in order to get support. Best regards, |
I will try to share the requested information. I have not used TTN but I will set this up. I will try to share a complete example with the issue being properly elaborated. Thank you. |
Hi, Many thanks and Regards, |
Thank you for posting the files. I have reviewed the code. I see you have serial debug output enabled. Can you please also post the debug output when things start to malfunction? Best regards, --Terry |
Hi, Current interrupts:1000 Also, attached a whole debug log file containing when a node is starting until the malfunction happens. Thanks & Regards, |
Thank you for posting the log file. At line 1509, I see I see that ALL transmits in the recent path past prior to the failure are on channel 867,700,000 MHz. This behavior started at around line 269; prior to that it was using the other channels. I don't know why the LMIC started to stick on one channel. The device is transmitting at SF12. Based on the times printed, it appears that your message takes about 2.3 seconds of airtime. The fact that there are no more launches of the transmission strongly indicates that It is possible that on your BSP, something bad is happening to the LMIC's idea of time -- however, that wouldn't account for the missing text between 1511 and 1512. In addition, by the way, you have an error in your
suggesting that the sent was scheduled, even though it was not. The Try fixing the |
Sadly I have the same problem here: getting stuck in a loop after 6-9h. I added some code inside the engineUpdate function to help me tag the problem (basically printing NOIDLE) but it did not help.. the code is basically still the same as the original ttn-otaa example. |
Hi, I modified the do_send() function of my otaa sketch to the same like in the ttn-otaa.ino example:
The test with the sketch ends with the same result new log file is uploaded. However I have an important feedback regarding the time, when the problem occurs. [2024-11-14 20:55:49.721] Leaving onEvent There are no lost lines, logging is continuous when the system time changes its sign. Regards, |
@robepapp Thanks for the log. @Gooseman42 sorry, wanted to get v5 out, and so that took all my time this weekend. @robepapp Your code needs to capture (and print) the result of calling However, in this case, it probably didn't fail. The opmode shows that the LMIC has chosen a new channel ( I think the thing to do is confirm the path the code is taking. Please change line 2726 of LMIC_X_DEBUG_PRINTF("%"LMIC_PRId_ostime_t": next engine update in %"LMIC_PRId_ostime_t"\n", now, txbeg-TX_RAMPUP); to LMIC_DEBUG_PRINTF("%"LMIC_PRId_ostime_t": next engine update in %"LMIC_PRId_ostime_t"\n", now, txbeg-TX_RAMPUP); I'm pretty certain that this path is being taken. If you are feeling brave, please add a couple of extra items to that print: LMIC_DEBUG_PRINTF("%"LMIC_PRId_ostime_t": next engine update in %"LMIC_PRId_ostime_t
". globalDutyRate=%" LMIC_PRId_ostime_t
" globalDutyAvail=%" LMIC_PRId_ostime_t
" txend=%" LMIC_PRId_ostime_t"
" txcChnl=%d "
"band=%d "
"band.avail=%" LMIC_PRId_ostime_t
"\n",
now, txbeg-TX_RAMPUP, LMIC.globalDutyRate, LMIC.globalDutyAvail, LMIC.txend,
LMIC.txChnl, LMIC.channelFreq[LMIC.txChnl] & 0x3,
LMIC.bands[LMIC.channelFreq[LMIC.txChnl] & 0x3].avail); This will show exactly what is going into the incorrect decision and will probably lead to a patch. |
no worries about prioritizing v5, any help is very much appreciated.
|
there we are.. I think I see the problem but I am too far away from the code to propose a solution..
|
I actually left it running and the log showed a brief comeback but got stuck in what seems to be the same loop right away:
|
The key printout:
I cannot understand why "Next engine update" time matches I bet the problem is at line 2601. In signed 32-bit arithmetic, What version of GCC is being used on your port? I have seen exactly this kind of weird "optimization" problem -- arguably incorrect, but you can't win this kind of argument -- in newer compilers, in the name of either "efficiency" or "correctness". MCCI's BSPs use a relatively older version of the GCC (2017 or so). My first thing to try would be to change line 2601 as follows: if( (ostime_t) (txbeg - (now + TX_RAMPUP)) < 0 ) { (In other words, explicitly cast the computation before comparing to zero. If by some chance it's been widened to int64_t, that will force it to be re-narrowed.) If that doesn't work, I'd hit the compiler with a bigger hammer: volatile ostime_t nonce = txbeg - (now + TX_RAMPUP);
if (nonce < 0) { The If you are impatient, you can try the second fix first. But it would be good to try to narrow down exactly what is happening. |
SoC is ESP32-S3, using Arduino 2.3.3. Per board manager and output window compile messages, the IDE is using the following tool (it also says version 2302 in the manager and compile window):
Implementing fix (1) now.. |
I suppose the others (who are using STM32 but not MCCI's BSP) are also using newer compilers. It will be interesting to see what happens. |
looks great, made it through the sign change:
|
Lovely, so it's a compiler "feature". It suggests that I need to review all the code that uses @Gooseman42 what board are you using for reference? Is it commercially available so I can add it to our menagerie and do additional testing? Others (@robepapp, @mirhamza708): I will push a patch on a branch with this change: if( (ostime_t) (txbeg - (now + TX_RAMPUP)) < 0 ) { at line 2601 of |
Great that you found the issue. let me put a device for test today and will get back to you tomorrow with updates. |
@mirhamza708 wrote:
Thanks. All, I have pushed a branch https://github.com/mcci-catena/arduino-lmic/tree/issue968, which contains the release candidate fix. Please test if you have a chance. |
returned to the lab this morning to see the code run through without problems, congrats so far. I have seen however a funny thing with the debug messages (only showing those now):
likely the same problem with txend but shows that other instances in the code might need the cast.
@terrillmoore custom-designed, but basically a plain ESP32-S3 connected to an RFM95W without anything else, optimized for low deep-sleep. I could send you one if you really are interested :) |
we are not completely out of the woods yet. I have not touched the system and yet we have a similar situation during the day. First element is a locally generated timestamp from the capture software, this is not coming from the code:
|
@Gooseman42 I suspect we got lucky the first time. I will update my patch to include the volatile trick and will advise when it passes CI tests. |
@Gooseman42 change pushed, but need to pass the CI tests before you try it. |
@Gooseman42 -- It passed CI, so it's ok to try it. Interestingly enough, looking at the logs, the other part of the LMIC knows that |
The pushed code seems to work. Code has been running since Friday without issues. |
@Gooseman42 Thanks for testing. I'll push a release later today. |
Hello guys,
I need help with this because I am trying for a month now to fix this. I have a node device and I have done all the configuration according to the github documentation of MCCI
I use RFM96 chip with stm32f103c8t6 and the pinmap is as follows
my platformio.ini file is as follows
The devices functions properly just the Lorawan side of things doesnt seem right, here is an image from chirpstack server.
a screenshot from the serial port the join at 9:07 pm is here:
I don’t understand why the time gets overflown? its a signed integer in the library, can someone guide why used a signed integer for time keeping? I saw a comment in the source that the negative means the time has already passed. Its confusing so please if someone can guide I will be really thankful.
here is an image from the chirpstack server
I have placed an 8 hours reset of the lorawan stack so it does this every 8 hours just to avoid a complete stop:
Without the 10% clock error (clock is setup at 24MHz using the 8MHz external oscillator and passed through the pll mul) set the join procedure takes 10-15 minutes. The hardware was designed by someone else so I can't really say much about it but the question is if it runs for 7-8 hours what's the reason that makes it stall.
Thank you guys, and if anyother information is required then please let me know.
The text was updated successfully, but these errors were encountered: