-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 [Bug]: Cache stampede with caching middleware #2182
Comments
Thanks for opening your first issue here! 🎉 Be sure to follow the issue template! If you need help or want to chat with us, join us on Discord https://gofiber.io/discord |
@demming can you support us with a pull request with the best solution from your point of view which solves the problem ? |
Can i try this? |
Of course |
@ReneWerner87 Sure, will see what I can do. @sod-lol: I was going to suggest a PR this weekend. A trivial "fix" is to move the call to That spot I linked above is where the stampede occurs--- I suggest we try to add a synchronized map of mutexes. The synchronization might curb performance a bit and their choreography might be a little difficult. On a side note, I've been looking into a different solution with proper thread-safe LFU and ARC policies for this middleware layer. There are a couple of popular Go implementations we could use underneath. But it's a deviation from the present implementation and would mean a return in spirit to the original implementation. I suggest we fix the stampede first, and then discuss the rest. Quite some effort has been put into the current implementation. But I'm not so sure about the merits of maintaining a separate process-memory cache implementation. |
Your comment hadn't appeared before I posted. I'll wait for @sod-lol's reply then. |
Just to follow up, I've just tried replacing the shared mutex with a synced mutex map, looks promising, performance is good. |
Hi @demming. |
Alright, I'll take a shot this weekend then. Hoped to coordinate with the other contributors involved previously in it. Hope we still get to discuss it when I suggest my PR so we can test it from different angles. Thank you too, I'll keep it in mind 😊 |
@demming any progress? |
@ReneWerner87: apologies, been out of the loop for the while here, thanks for pinging me, let me reassess the issue, happy to contribute back. Btw, I met a person just recently you might know. |
Nice to hear that
Whom? |
Bug Description
I've come across several instances of the so-called cache stampede in a variety of web app frameworks. Here are the issues I opened in the respective GitHub repos:
In those cases cache stampede brings the GC to its knees, while with Fiber the effects are less pronounced but still raise the typical 20m RSS to over 300m RSS in just a couple of minutes or so for just 100 connections, which in relative terms is a lot and even compared to my Akka HTTP implementation running on OpenJDK 19 Hotspot (not even the low-profile Semeru OpenJ9) consumes up to 50% more memory.
It appears Fiber's caching middleware suffers from it too. It's a kind of cascading failure in distributed systems. Specifically, when I start several concurrent connections to a cached Fiber endpoint, multiple method invocations and (likely) cache populations take place, ranging from "for more than one connections" to "for each connection" which is incredibly inefficient.
There are three basic mitigation strategies, of which one most commonly introduces a mutex, so that all but one (virtual) threads must wait.
For additional details just look into the issues I've linked above. Let me know if you need additional input and how I can help here out. Would appreciate if someone could confirm my findings by reproducing it.
How to Reproduce
One can set up a cached endpoint that can do whatever one likes, I make it a bit less boring and make it sanitize remote HTML, this way I can also gauge the throughput and framework-induced latencies.
One can run
bombardier -c 100 -d 10s -l "http://localhost:3000/sanitize?address=http://localhost:8080"
or any other target URL. You can also run something less trivial, such as a chain of services to assure data equivalence
bombardier -c 100 -d 10s -l "http://localhost:3000/sanitize?address=http://localhost:8080/website?address=<remote_url>"
or even run a recursive request.
As you certainly are aware, this will create 100 concurrent connection and use them for a load test of a short burst ranging over 10 seconds, with implicit connection reuse, and display a histogram of observed latencies.
Expected Behavior
I expect only one cache population on the initial cache miss, not 100. Specifically,
getWebpageHandler
below should not be invoked more than once per cache cycle, regardless of the number of inbound concurrent connections.Fiber Version
v2.39.0
Code Snippet (optional)
Checklist:
The text was updated successfully, but these errors were encountered: