Skip to content

Conversation

@plieven
Copy link
Contributor

@plieven plieven commented Apr 8, 2025

  • dynamically adjust prometheus endpoint response buffer size

stream->addHeader("Cache-Control", "no-cache");
if (stream->available() > initialResponseBufferSize) {
initialResponseBufferSize = stream->available();
MessageOutput.printf("Increased /api/prometheus/metrics initialResponseBufferSize to %u bytes\r\n", initialResponseBufferSize);
Copy link
Owner

@tbnobody tbnobody Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use PRIu32 instead of %u? Otherwise we might have problems when migrating to newer gcc versions. (and force push)

e.g.
MessageOutput.printf("Increased /api/prometheus/metrics initialResponseBufferSize to %" PRId32 " bytes\r\n", initialResponseBufferSize);

@plieven plieven force-pushed the fix_esp_async_ws branch from d700e54 to b5f723a Compare April 8, 2025 18:12
@plieven
Copy link
Contributor Author

plieven commented Apr 8, 2025

fixed, I should have known better ...

@tbnobody
Copy link
Owner

tbnobody commented Apr 8, 2025

Thank you. I've merged it into my local dev branch, and will release a new version tomorrow.

@tbnobody
Copy link
Owner

tbnobody commented Apr 8, 2025

I see a Failed to allocate on each request:

[374549][E][WebResponses.cpp:827] AsyncResponseStream(): Failed to allocate<\r><\n>
Increased /api/prometheus/metrics initialResponseBufferSize to 1863 bytes<\r><\n>
[384389][E][WebResponses.cpp:827] AsyncResponseStream(): Failed to allocate<\r><\n>
[385314][E][WebResponses.cpp:827] AsyncResponseStream(): Failed to allocate<\r><\n>
[386019][E][WebResponses.cpp:827] AsyncResponseStream(): Failed to allocate<\r><\n>

I am pretty sure that's new.

@tbnobody
Copy link
Owner

tbnobody commented Apr 8, 2025

Thats interessting..... I output _content->size() using the log_e method which shows the error. Following result

AsyncResponseStream::AsyncResponseStream(const char *contentType, size_t bufferSize) {
  _code = 200;
  _contentLength = 0;
  _contentType = contentType;
  // internal buffer will be null on allocation failure
  _content = std::unique_ptr<cbuf>(new cbuf(bufferSize));
  if (_content->size() != bufferSize) {
#ifdef ESP32
    log_e("Failed to allocate %u", _content->size());
#endif
  }
}
Increased /api/prometheus/metrics initialResponseBufferSize to 1862 bytes<\r><\n>
[ 20341][E][WebResponses.cpp:827] AsyncResponseStream(): Failed to allocate 1863<\r><\n>
[ 29457][E][WebResponses.cpp:827] AsyncResponseStream(): Failed to allocate 1863<\r><\n>
[ 31403][E][WebResponses.cpp:827] AsyncResponseStream(): Failed to allocate 1863<\r><\n>
[ 32526][E][WebResponses.cpp:827] AsyncResponseStream(): Failed to allocate 1863<\r><\n>
[ 33352][E][WebResponses.cpp:827] AsyncResponseStream(): Failed to allocate 1863<\r><\n>
[ 34078][E][WebResponses.cpp:827] AsyncResponseStream(): Failed to allocate 1863<\r><\n>
[ 34721][E][WebResponses.cpp:827] AsyncResponseStream(): Failed to allocate 1863<\r><\n>
[ 35056][E][WebResponses.cpp:827] AsyncResponseStream(): Failed to allocate 1863<\r><\n>
[ 35392][E][WebResponses.cpp:827] AsyncResponseStream(): Failed to allocate 1863<\r><\n>

@mathieucarbou is there maybe something wrong with size vs. index?

@plieven
Copy link
Contributor Author

plieven commented Apr 8, 2025 via email

@tbnobody
Copy link
Owner

tbnobody commented Apr 8, 2025

I think I found the reason...

cbuf::cbuf(size_t size) :
    next(NULL), _size(size+1), _buf(new char[size+1]), _bufend(_buf + size + 1), _begin(_buf), _end(_begin)
{
}

size_t cbuf::size()
{
    return _size;
}

They are allocating size+1 but return _size in cbuf.cpp.
Therefor the returned size is always one bigger then the size passed as parameter.

And I think size() is the wrong method as it always returns the requested amount of bytes but not the real allocated bytes.

@mathieucarbou
Copy link
Contributor

crap!

@mathieucarbou
Copy link
Contributor

Why I didn't get the warning when I run my MRE ?

@mathieucarbou
Copy link
Contributor

mathieucarbou commented Apr 8, 2025

@tbnobody : weird, I do not have the same cbuf impl. as you 🤯 I do not have the extra character.
You are using latest arduin core ?

the one I have , which works:

cbuf::cbuf(size_t size) : next(NULL), has_peek(false), peek_byte(0), _buf(xRingbufferCreate(size, RINGBUF_TYPE_BYTEBUF)) {
  if (_buf == NULL) {
    log_e("failed to allocate ring buffer");
  }
  CBUF_MUTEX_CREATE();
}

Oh... !
The code you have is from Arduino Core 2... That's why... I tested with latest Arduino Core 3. They have changed the implementation and broke compatibility on API for this!

@mathieucarbou
Copy link
Contributor

@mathieucarbou i think the check in ESPAsyncWebserver is wrong. I guess cbuf allocates an extra byte for a terminating \0. That check should be >= and not ==.

More like if (bufferSize && _content->size() < bufferSize) {

@plieven
Copy link
Contributor Author

plieven commented Apr 8, 2025

@mathieucarbou i think the check in ESPAsyncWebserver is wrong. I guess cbuf allocates an extra byte for a terminating \0. That check should be >= and not ==.

More like if (bufferSize && _content->size() < bufferSize) {

Yes, I replied via e-mail and haven't checked the source... The change you propose should work with both core versions. Just to be 100% sure if the initial allocation fails, cbuf isn't NULL or is it?

@plieven
Copy link
Contributor Author

plieven commented Apr 8, 2025

I see a Failed to allocate on each request:

[374549][E][WebResponses.cpp:827] AsyncResponseStream(): Failed to allocate<\r><\n>
Increased /api/prometheus/metrics initialResponseBufferSize to 1863 bytes<\r><\n>
[384389][E][WebResponses.cpp:827] AsyncResponseStream(): Failed to allocate<\r><\n>
[385314][E][WebResponses.cpp:827] AsyncResponseStream(): Failed to allocate<\r><\n>
[386019][E][WebResponses.cpp:827] AsyncResponseStream(): Failed to allocate<\r><\n>

I am pretty sure that's new.

thats only visible on the serial console, right? I only checked the webconsole though.

@mathieucarbou
Copy link
Contributor

Yes, I replied via e-mail and haven't checked the source... The change you propose should work with both core versions. Just to be 100% sure if the initial allocation fails, cbuf isn't NULL or is it?

cbuf cannot be null: it is not a new (std::nothrow). If cbuf cannot be created, it will fail.

On Arduino Core 3, not likely because the call to xRingbufferCreate(size, RINGBUF_TYPE_BYTEBUF) in the ctor will return a null ring buffer, so the whole class will act as a no-op.

On Arduino Core 2, there is a new chat[] in the ctor, so the ctor will crahs the board on a failed allocation because of the usage of new. Nothing that we can not - this is arduino core code.

OpenDTU should switch to Arduino Core 3 with pioarduino:

platform = https://github.com/pioarduino/platform-espressif32/releases/download/54.03.20/platform-espressif32.zip

@mathieucarbou
Copy link
Contributor

thats only visible on the serial console, right? I only checked the webconsole though.

that's because of the arduino core 2 implementation not returning the same size as arduino core 3 implementation.

this is just an annoying error log: the cbuf is allocated behind.

@plieven
Copy link
Contributor Author

plieven commented Apr 8, 2025

Okay, understood. In Core 2 it will crash, but in Core 3 the buffer can be up to 3 bytes larger as @tbnobody mentioned from the docs. So we need the modified check?

@mathieucarbou
Copy link
Contributor

mathieucarbou commented Apr 8, 2025

Okay, understood. In Core 2 it will crash, but in Core 3 the buffer can be up to 3 bytes larger as @tbnobody mentioned from the docs. So we need the modified check?

Yes and no, the code he mentioned is arduino core 2 code.

the cbuf implementation in arduino core 3 is using a ring buffer and the size matches the requested size. The hack with the added byte is only in core 2 it seems.

  • on arduino core 3: everything is OK: size matches and failed allocation won't crash, but will log.
  • on arduino core 2: extra character added, and if allocation fails within the cbuf ctor, board will crash because they used new and not new (std::nothrow)

@plieven
Copy link
Contributor Author

plieven commented Apr 8, 2025

okay, what are the official requirements for ESPAsyncWebServer? If you allow arduino core 2, you should fix that bogus error message.

While looking at WebResponses.cpp I checked what happens in AsyncResponseStream::write if the cbuf::resize fails. It will return the orignal size. We do not detect this and just overwrite contents of the ring buffer if it wraps. Do not know what the right behaviour is. return 0 and do nothing, return the requested write size and do nothring or overwrite...

@mathieucarbou
Copy link
Contributor

ESPAsyncWebServer

We support latest released versions of Arduino Core 2 and 3, but we strongly advise to use 3 because at one point we won't be able to keep 2 compatibility.

Arduino Core team does not maintain 2 anymore: this is a EOL version.

Everybody should switch to 3 now....

@mathieucarbou
Copy link
Contributor

I have a PR ready that is building and will release a v3.7.6 in a few minutes: ESP32Async/ESPAsyncWebServer#152

@plieven
Copy link
Contributor Author

plieven commented Apr 8, 2025

I have a PR ready that is building and will release a v3.7.6 in a few minutes: ESP32Async/ESPAsyncWebServer#152

Yeah, the commit creeped in the conversation so I missed it.

@mathieucarbou
Copy link
Contributor

While looking at WebResponses.cpp I checked what happens in AsyncResponseStream::write if the resize fails. It will return the orignal size. We do not detect this and just overwrite contents of the ring buffer if it wrappes. Do not know what the right behaviour is. return 0 and do nothing, return the requested write size and do nothring or overwrite...

Oh !!! Yes that should be fixed... No overwrite should happen. Is it on Arduino core 2 ?

@plieven
Copy link
Contributor Author

plieven commented Apr 8, 2025 via email

@mathieucarbou
Copy link
Contributor

Added a commit to log also the failed re-allocations: ESP32Async/ESPAsyncWebServer@bf9f457

plieven added 2 commits April 8, 2025 20:18
after OpenDTU v24.9.30 was released there was a subtle change in the ESPAsyncWebServer
which changed the internal data structure for AsyncResponseStream from cbuf to StringStream.
This introduced WDT resets when scraping at least the prometheus API endpoint [1].
ESPAsyncWebServer v3.7.6 addresses this issue by reverting the data structure back to cbuf.

[1] tbnobody#2535

Signed-off-by: Peter Lieven <[email protected]>
…buffer size

the current implementation allocates a 40kB buffer right from the start.
In most cases this is way too big. This patch dynamically learns the right
required size for the local setup and preallocates just as much space as needed.

Signed-off-by: Peter Lieven <[email protected]>
@plieven plieven force-pushed the fix_esp_async_ws branch from b5f723a to e6e1f3a Compare April 8, 2025 20:19
@plieven plieven changed the title bump ESPAsyncWebServer to v3.7.5 bump ESPAsyncWebServer to v3.7.6 Apr 8, 2025
@plieven
Copy link
Contributor Author

plieven commented Apr 8, 2025

@tbnobody updated the PR to include ESPAsyncWebServer v3.7.6. The arduino core update to v3 is a different thing that should be thoughtfully tested.

@plieven
Copy link
Contributor Author

plieven commented Apr 8, 2025

@mathieucarbou just a question that was not trivial to answer for me because I am not that familiar with the ESPAsyncWebServer code. If I use a AsyncJsonResponse does it interally use an AsyncStreamResponse at some point? Becuase other users that are not using prometheus reported that they can trigger the ESP resets by subsequently calling API endpoints like /api/system/status which internally all use AsyncJsonResponse.

@mathieucarbou
Copy link
Contributor

AsyncJsonResponse

AsyncJsonResponse has its own implementation: it is not using AsyncStreamResponse. You can use AsyncStreamResponse to send a JsonDocument because AsyncStreamResponse is a Print, but AsyncJsonResponse has its own tricky implementation:

size_t PrettyAsyncJsonResponse::_fillBuffer(uint8_t *data, size_t len) {
  ChunkPrint dest(data, _sentLength, len);
#if ARDUINOJSON_VERSION_MAJOR == 5
  _root.prettyPrintTo(dest);
#else
  serializeJsonPretty(_root, dest);
#endif
  return len;
}

The ChunkPrint is a hack to skip the first bytes if _fillBuffer is called again (which triggers a new json serialization).

What I do not understand though in this code is why len is returned instead of return min(len, serializeJson(_root, dest));

@tbnobody
Copy link
Owner

tbnobody commented Apr 9, 2025

@tbnobody updated the PR to include ESPAsyncWebServer v3.7.6. The arduino core update to v3 is a different thing that should be thoughtfully tested.

Have merged it into my local dev branch. Works great and fast.. Will keep it running tomorrow during the day and release a new version at evening..

And yes, core v3 is a complete different topic. Some parts don't even compile at the moment (W5500 driver) some parts are already fixed.. But the biggest problem is the additional flash usage. currently 89% flash are used.. with core 3 it's 97%.

@mathieucarbou
Copy link
Contributor

And yes, core v3 is a complete different topic. Some parts don't even compile at the moment (W5500 driver) some parts are already fixed.. But the biggest problem is the additional flash usage. currently 89% flash are used.. with core 3 it's 97%.

You should look at https://github.com/mathieucarbou/MycilaSafeBoot.

It allows having a bootable recovery partition used specifically for updates (which is then small), and a > 3MB partition size for the app. I am using MycilaESPConnect for network manager, but you could easily reproduce the same mechanism by reloading the network info from safeboot so that it keeps same AP name / IP / etc.

This dual-partition system largely used to facilitate OTA is a waste of space. Nearly half of the space is always unused.

@tbnobody tbnobody merged commit b70af57 into tbnobody:master Apr 10, 2025
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 11, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants