-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in MessageExchangeStore #429
Comments
Waiting a possible workaround is to use a MessageExchangeStore with expirable cache instead of classic map. (like guava or cache2k)
|
I will spent more time on this next week. |
Exactly and we don't succeed to reproduce it. #418 is probably a good start. I should have time tomorrow to look at this more deeply. I will let you know. |
Without surprise, we also observe memory leak in InMemoryRandomTokenProvider which should be cleaned when exchange is complete. On our side, we use as workaround a stateless random token provider. Our number of clients and request don't expose us to high risk of collisions. I still think that using an "endpoint identifier" + stateless random generation (#173 (comment)) would be better than this reserve/release pattern, but this is probably another debate. |
@sbernard31 |
I'm not sure yet, I have some metrics in production which allow me to monitor that, but we don't deploy a version with #441 for now. I let it open waiting, I check this. But if you prefer we can close it and I will reopen it only if this is not fixed. |
Adding a "reverse observe" to the (Re-)Extending the logging in Until now, I didn't really find the root-cause for that, but a cure extending the I continue to analyse the root cause, but for me, the above cure, looks much simpler as the "current complete" approach. Therefore I would also go to clean up that. Does anyone have larger change in "core" open? |
Did you observe the leak in BlockwiseLayer too ? |
Currently I only focused on the message exchange store. |
Seems, that a blockwise notify is sometimes processed with a status of a normal blockwise response. The notify uses different token ( In the logs it seems, that this happens, if the processing of So, I would got for using the extended ExchangeObserver and leave the status for now. |
Do you have any news or any work in progress about this ? |
See PR #539 Currently I'm work on the improve the fix. |
Cleaning up the exchange house-keeping shows, that there are too many race conditions on parallel processing the same exchange. Therefore I started to add a striped executor to the CoapEndpoint. This looks much better, but unfortunately I'm not ready :-). I could provide a "early preview" for that work, but it would be too early to start discussing details. |
With PR #551 my test didn't show any leaks.
Remove the Californium???.properties before startin the test to ensure the right setup.
Tests blockwise notifies (NON and CON) and normal requests. After the clients finished, wait until the pending request times out, then the stores should be empty (validated by logging the health status). |
I tried to reproduce leak using commit : 05b12ed I only succeed 2 time on more than 15 tries.
a few
and a lot of
(I didn't investigate) Then I try with commit b84062e I don't see any leak either, but not so relevant as I was not able to reproduce it too much before and I only tested 2 times because tests are far longer. I sill face I don't know if it's relevant as test code changed between b84062e and 05b12ed
I would to know if you observe that too ? I will now trying to read all the changes. 😓 |
05b12ed calls ever 250ms About the I didn't recognize the |
Seems to be caused by a new observe with different token. |
Seems to be a race condition when sending a new current request while receiving a response (reuse the token in blockwise causes such issues). Even with the striped exchange execution, this section is not protected. |
I pushed a fix for the MID issues. |
That could be great if we could have commit history in a way we can test before and after easily with exactly the same tests. (by the way I have trouble with logback config as ext-plugtest-* depend on cf-plugtest-* and both have a logback.xml config) |
Sorry, I will not be able to spent too much more time in this issue than the already spent 2 weeks.
Yes, I recognized this also. But the solution hints I found in the net, didn't work or would take too much time for now. If you know, how to fix it, go for it :-). |
Extract CleanupMessageObserver from ExchangeCleanupLayer and add CON response to cleanup into it. Add also a CleanupMessageObserver to blockwise notifies in BlockwiseLayer. Signed-off-by: Achim Kraus <[email protected]>
Signed-off-by: Simon Bernard <[email protected]> Also-by: Achim Kraus <[email protected]>
The idea is to keep "cleaning exchange" code closer to the "adding exchange" one. Consequences : we don't need ExchangeCleanupLayer.
Extract CleanupMessageObserver from ExchangeCleanupLayer and add CON response to cleanup into it. Add also a CleanupMessageObserver to blockwise notifies in BlockwiseLayer. Signed-off-by: Achim Kraus <[email protected]>
Signed-off-by: Achim Kraus <[email protected]>
Extract CleanupMessageObserver from ExchangeCleanupLayer and add CON response to cleanup into it. Add also a CleanupMessageObserver to blockwise notifies in BlockwiseLayer. Signed-off-by: Achim Kraus <[email protected]>
Signed-off-by: Simon Bernard <[email protected]> Also-by: Achim Kraus <[email protected]>
The idea is to keep "cleaning exchange" code closer to the "adding exchange" one. Consequences : we don't need ExchangeCleanupLayer.
Signed-off-by: Simon Bernard <[email protected]> Also-by: Achim Kraus <[email protected]>
Extract CleanupMessageObserver from ExchangeCleanupLayer and add CON response to cleanup into it. Add also a CleanupMessageObserver to blockwise notifies in BlockwiseLayer. Signed-off-by: Achim Kraus <[email protected]>
We integrated Californium 2.0.0-M11. It behaves better about memory leak. To summarize progress : I will investigate this last? leak but for now, I haven't any clue :-/ |
Not that bad :-). Still in the exchange store? |
Yep |
DTLSConnector:
Does this "overflow" occur? |
No. I don't see warning like this. |
I have a track ! I'm logging exchanges which are leaking. For now, I notice this is only exchanges about coap request which are acknowledged but without any response (no observe, no blockmode) In our server application, we have LWM2M features and pure CoAP features. If next logs confirm this, that means the issue is not in californium ! 🎉 🎊 🎉 ! I added a CoAP API in Leshan to be able to send pure CoAP request to a device registered with LWM2M. This API handle response timeout, so switching to this API should fix the leak. |
All the logs show that leaks follow the pattern described above. So I pretty confident that the last remaining leaks are not because of californium issues. \o/ |
Any news on this? |
Unfortunately, the new CoAP API in Leshan is still not used in our product. So I can not check if my supposition is true. As I'm pretty confident, if you want we can close this and I will reopen if I was wrong. |
We recently integrated the new CoAP API of Leshan (with response timeout) in production. And until now 4 days without any leaks observed. |
No leak in 10 days now. I think we can close this ! |
Hi, When we are doing the performance test, we are facing the below issue, 2019-03-19T12:04:38.386Z org.eclipse.californium.core.network.CoapEndpoint SEVERE: Exception in protocol stage thread: message with already registered ID [51340] is not a re-transmission, cannot register exchange I am not understanding what is happening here. Can you please help me ? |
Please open a new issue. |
It seems that memoryExchangeStore leaks.
Analyzing heap dump show that the leak is in ExchangeByToken map and mainly concern observe request with block2.
The strange thing is most of exchanges have a response set, so it seems there are "complete" but not removed from the map...
Here some old discussion about that :
https://dev.eclipse.org/mhonarc/lists/cf-dev/msg01404.html
#418 this PR aime to reproduce it via Junit.
The text was updated successfully, but these errors were encountered: