-
-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory corruption when saving network settings #2313
Comments
In a quick and dirty test I made NetworkSettings execute enableAdminMode() and applyConfig() from inside its loop() (to have these functions execute synchronously to the main loop()), but it seems the issue is not fully resolved: Log
It seems I can't reproduce the issue when I wait for the device to print that it got it's IP address before hitting the "Save" button again, at least with the change I described in this comment. Edit: I found this: #2298 (comment) which could be related. |
@schlimmchen yes I have read that issue #2298 too. Basically your fix sounds good, but there seems to be an RF task running in parallel when I look at the Example Logs you provided. Here the IRQ subroutine may be triggered which could “interrupt” [sic] your inline code path. |
It would be helpfull to show use the log output in vscode as it automatically integrates the exception parser and shows a proper stack trace with readable symbols. |
When testing the migration to arduino core 3 I also realized that it immediatly crashes in |
@schlimmchen is this solved with your other fix from today that prevents allocating a |
IMHO this is a different issue as it only occours when pressing save. |
see #2360 |
Nope... Got this on the first try using v24.10.15 including the write guard. The backtrace seems to suggest that the Async TCP Server can't handle the underlying network connection breaking, or to be more specific: Once the connection breaks, handling the disconnect triggers some kind of issue.
Hm, similar context. Maybe a double free?
One more for good measure. Again, similar context. Maybe @mathieucarbou is willing to take a look? |
Yeah maybe same context, the plot thickens as they say:
While the webrequest seems to cope with AsyncWebServer::_handleDisconnect() during AsyncWebServerRequest::_onDisconnect() Can we provoke this situation somehow? I assume maybe something like a WLAN disconnect / Application Level Firewall or a tcpdump may tell us a bit more what happens on the network and where in the code / process this breaks. |
I'm sorry I won't be of a great help on this at the moment (too few info)... Having a reproductible use case outside of the app would help. Although we see AsyncTCP and ESPAsycnWS in the stack following user interaction it does not mean that the issue is there, but the normal processing of ESPAsyncWS could be impacted by other things running. There are a lof of similar ones reported in https://github.com/espressif/esp-idf. Questions, in the process to try isolate the cause:
|
These stack traces are nearly the same:
AsyncWebServerRequest::~AsyncWebServerRequest() {
_headers.clear();
_pathParams.clear();
if (_response != NULL) {
delete _response;
} For that to happen, the created response pointer (in this case So the This is important to note that This is important to consider because some code like this is brittle and relied probably on the fact that the response was sent and received by the browser, which is not the case: In WebApi.sendJsonResponse(request, response, __FUNCTION__, __LINE__);
Utils::removeAllFiles();
RestartHelper.triggerRestart(); The 2 list lines will be executed, and the request will be sent once the middleware chain and request handler have finished. Another one:
Here, the last lines will execute BEFORE the request is sent. All the things executed AFTER a response is attached to a request have to be carefully written in order to not have some impacts on any pointer the response could still reference. In the case of a json response, in some use cases, ArduinoJson won't do a copy but just point to the existing pointers (but this is not the issue we saw). In the issue we saw above, the lwip layer fails (wifi disconnect or something else) which triggers the response deletion.
void AsyncWebServerRequest::_onPoll() {
// os_printf("p\n");
if (_response != NULL && _client != NULL && _client->canSend()) {
if (!_response->_finished()) {
_response->_ack(this, 0, 0);
} else {
AsyncWebServerResponse* r = _response;
_response = NULL;
delete r;
_client->close();
}
}
}
void AsyncWebServerRequest::_onAck(size_t len, uint32_t time) {
// os_printf("a:%u:%u\n", len, time);
if (_response != NULL) {
if (!_response->_finished()) {
_response->_ack(this, len, time);
} else if (_response->_finished()) {
AsyncWebServerResponse* r = _response;
_response = NULL;
delete r;
_client->close();
}
}
}
``
I suspect this might be the issue... |
In AsyncWebServerRequest::~AsyncWebServerRequest() {
_headers.clear();
_pathParams.clear();
if (_response != NULL) {
delete _response;
_response = NULL;
} or (better): AsyncWebServerResponse* r = _response
_response = NULL;
delete r; To be sure the response is not used by any of the callbacks above. |
Thanks for looking into this. I see you spent quite some of your time, thanks! I could not yet fully understand your longer comment. I'll re-read it again later. What I do understand is that one has to be careful when sending a response (actually queuing sending a response) but executing code in the same context, which runs before the response was actually sent. I know about the example you gave, where the ESP is restarted. I did not dare to question it, but indeed it seems that sending the response and restarting the ESP (be it with a delay or not) is something of a race. What we would actually like to do is wait for the response to be sent, then trigger the reboot. The reason I asked whether you would want to have a look is that I suspect that you are interested in making ESPAsyncWebServer rebust against the issue we are looking at, even if we are using it in a questionable manner. Assuming that something can be done in the lib... The changes you proposed unfortunatly don't prevent the issue from being triggered.
For the record: I edited the file
I assume that's okay for a quick check? |
Yes! That will do. In Except if you know it already ? |
I have released Let mw know when you'll have more logs to pinpoint what causes the issue :-) |
@mathieucarbou thanks for your pointers to look closely at, this is my little bed-time crime story 🔍 🧐 🎩 for the evening I am still trying to understand what exactly happens, AsyncWebServerResponse* r = _response
_response = NULL;
delete r; You do that exactly the same way in the following three locations:
But before you send it you do something else:
void AsyncWebServerRequest::send(AsyncWebServerResponse* response) {
if (_sent)
return;
if (_response)
delete _response;
_response = response;
if (_response == NULL) {
_client->close(true);
_onDisconnect();
_sent = true;
return;
}
if (!_response->_sourceValid())
send(500);
} I also noticed that it always complains about being unable to free the heap memory for a std::list<AsyncWebHeader> _headers; Though I found only explicit code for clearing the memory of such a
Could it be that we are left with a dangling pointer to this / actually no valid AsyncWebHeader list through the above NULL and delete operations ? |
we are in the destructor, so the idea is to set the ref to the response to null ASAP in case we have some code elsewhere that could still see this pointer (this is the case for the 2 other callbacks). Then, once this is done, we can trigger the object deletion (which can take some time), but at least during this time the response in the request will be null.
Yes, this is to have the pointer set to null asap, then free after. void AsyncWebServerRequest::send(AsyncWebServerResponse* response) {
if (_sent)
return;
if (_response)
delete _response;
_response = response;
if (_response == NULL) {
_client->close(true);
_onDisconnect();
_sent = true;
return;
}
if (!_response->_sourceValid())
send(500);
} This code is just a feature to swap the response by another one. You are not using that. A middleware could decide to change a response that was set by a handler. So if a response was set, we delete it, then set the new one. This operation happens during the middleware chain processing (just after the handler and before the requests is sent on the network). So it is not the cause of the issue here.
That is exactly what I also find strange...
Not in the case of this list: AsyncWebHeader does not need to any destructor because this is a holder object of 2 strings and the list is storing objects by copying the value in a new instance held in the node (which is freed at node destruction). But the way a linked list work, is by pointing to the next structure, so when the object is freed from memory, each node are freed and this is a longer operation, compared to just remove an array from memory if we got a vector instead. The issue with a vector is that it requires reallocation. @schlimmchen should try to add some logs to know what is being executed (which request when it fails and also log before the delete calls on a response), also, point to the new version :-) We will have more info. It is possible that when the lwip layer sends an error (following network issue, com broken, etc) then there is a concurrency issue happening. |
What happened?
I noticed a bunch of ESP reboots after exceptions and started digging. One of them I could isolate and pin down to also occur in this project: When saving the network settings, there is some kind of memory corruption.
To Reproduce Bug
Save the network settings by clicking the save button in the web UI. Repeat until you observe a random crash.
Expected Behavior
Graceful application of network settings.
Install Method
Self-Compiled
What git-hash/version of OpenDTU?
3559007
What firmware variant (PIO Environment) are you using?
generic_esp32s3_usb
Relevant log/trace output
No response
Anything else?
Example 1
Example 2
Example 3
Please confirm the following
The text was updated successfully, but these errors were encountered: