🐛 [Bug]: Cache stampede with caching middleware #2182

demming · 2022-10-31T12:51:10Z

Bug Description

I've come across several instances of the so-called cache stampede in a variety of web app frameworks. Here are the issues I opened in the respective GitHub repos:

ASP.NET Core (only Response Caching affected, performance issue remaining),
Quarkus (WIP & performance),
PlayFramework (awesomely fixed, only minor performance issue remaining),
Micronaut (status unknown).

In those cases cache stampede brings the GC to its knees, while with Fiber the effects are less pronounced but still raise the typical 20m RSS to over 300m RSS in just a couple of minutes or so for just 100 connections, which in relative terms is a lot and even compared to my Akka HTTP implementation running on OpenJDK 19 Hotspot (not even the low-profile Semeru OpenJ9) consumes up to 50% more memory.

It appears Fiber's caching middleware suffers from it too. It's a kind of cascading failure in distributed systems. Specifically, when I start several concurrent connections to a cached Fiber endpoint, multiple method invocations and (likely) cache populations take place, ranging from "for more than one connections" to "for each connection" which is incredibly inefficient.

There are three basic mitigation strategies, of which one most commonly introduces a mutex, so that all but one (virtual) threads must wait.

For additional details just look into the issues I've linked above. Let me know if you need additional input and how I can help here out. Would appreciate if someone could confirm my findings by reproducing it.

How to Reproduce

One can set up a cached endpoint that can do whatever one likes, I make it a bit less boring and make it sanitize remote HTML, this way I can also gauge the throughput and framework-induced latencies.

One can run

bombardier -c 100 -d 10s -l "http://localhost:3000/sanitize?address=http://localhost:8080"

or any other target URL. You can also run something less trivial, such as a chain of services to assure data equivalence

bombardier -c 100 -d 10s -l "http://localhost:3000/sanitize?address=http://localhost:8080/website?address=<remote_url>"

or even run a recursive request.

As you certainly are aware, this will create 100 concurrent connection and use them for a load test of a short burst ranging over 10 seconds, with implicit connection reuse, and display a histogram of observed latencies.

Expected Behavior

I expect only one cache population on the initial cache miss, not 100. Specifically, getWebpageHandler below should not be invoked more than once per cache cycle, regardless of the number of inbound concurrent connections.

Fiber Version

v2.39.0

Code Snippet (optional)

import (
  "log"
  "github.com/gofiber/fiber/v2"
  "github.com/gofiber/fiber/v2/middleware/cache"
  "github.com/microcosm-cc/bluemonday"
)

var counterGetWebpage = 0
var counterGetWebpageHandler = 0

func main() {
  log.Print("Starting the app...")

  cacheConfig := cache.Config{
    Next: func(c *fiber.Ctx) bool {
      return c.Query("refresh") == "true"
    },
    Expiration:   30 * time.Second,
    CacheControl: true,
    KeyGenerator: func(c *fiber.Ctx) string {
      return c.Path()
    },
  }


  app := fiber.New()

  // ![ ] FIXME: Cache stampede!
  app.Use(cache.New(cacheConfig))

  app.Get("/sanitize/", getWebpageHandler)

  log.Fatal(app.Listen(":3000"))
}

func getWebpageHandler(ctx *fiber.Ctx) error {
  counterGetWebpageHandler += 1
  log.Print("getWebpageHandler invoked #", counterGetWebpageHandler)

  url := ctx.Query("url")
  webpage := getWebpage(url)

  return ctx.SendString(webpage)
}

func getWebpage(url string) string {
  counterGetWebpage += 1
  log.Print("getWebpage invoked #", counterGetWebpage)

  client := fiber.AcquireClient()
  a := client.Get(url)
  status, body, err := a.String()

  if err != nil {
    log.Print("Could not obtain the remote resource: ", err)
  }

  if status > 200 {
    log.Print("Bad return status: ", status)
  }

  fiber.ReleaseClient(client)

  return body
}

Checklist:

I agree to follow Fiber's Code of Conduct.
I have checked for existing issues that describe my problem prior to opening this one.
I understand that improperly formatted bug reports may be closed without explanation.

The text was updated successfully, but these errors were encountered:

welcome · 2022-10-31T12:51:12Z

Thanks for opening your first issue here! 🎉 Be sure to follow the issue template! If you need help or want to chat with us, join us on Discord https://gofiber.io/discord

ReneWerner87 · 2022-11-13T15:50:06Z

@demming can you support us with a pull request with the best solution from your point of view which solves the problem ?

keyvan-dadash · 2022-11-16T04:37:59Z

Can i try this?

li-jin-gou · 2022-11-16T05:01:26Z

Can i try this?

Of course

demming · 2022-11-16T20:55:30Z

@ReneWerner87 Sure, will see what I can do.

@sod-lol: I was going to suggest a PR this weekend. A trivial "fix" is to move the call to c.Next in middleware/cache.go inside the critical section (watch out for the expiration check though). But this will degrade performance a bit. Running a separate signal makes sense but adds to the overhead too. One can also broadcast through a condition variable. I've tried to fix it today, the stampede is gone but at the cost of a slight performance degradation. In addition, an interfering load on another route will unlock the section and let the connections on the former endpoint slip through, since the mutex is shared across the routes; I believe it should depend on the key to avoid the data race. There were also reports of a potential deadlock (see below). So it's not the best solution in my opinion. But it leaves the scope of the implementation the way it is.

That spot I linked above is where the stampede occurs---Next is app.Get in my example at the top of the page. I see @apoq decided to unlock the section to fix a bug. See also @apoq's #2057 (comment). Looks like a regression. Would appreciate you chiming in here. I believe your original issue was due to using a non-concurrent map, the lock should depend on the key.

I suggest we try to add a synchronized map of mutexes. The synchronization might curb performance a bit and their choreography might be a little difficult. ~~I'll try to come up with a PR this weekend, unless~~ you guys have got any other ideas?

On a side note, I've been looking into a different solution with proper thread-safe LFU and ARC policies for this middleware layer. There are a couple of popular Go implementations we could use underneath. But it's a deviation from the present implementation and would mean a return in spirit to the original implementation.

I suggest we fix the stampede first, and then discuss the rest. Quite some effort has been put into the current implementation. But I'm not so sure about the merits of maintaining a separate process-memory cache implementation.

demming · 2022-11-16T20:58:40Z

@li-jin-gou

Your comment hadn't appeared before I posted. I'll wait for @sod-lol's reply then.

demming · 2022-11-16T21:36:09Z

Just to follow up, I've just tried replacing the shared mutex with a synced mutex map, looks promising, performance is good.

keyvan-dadash · 2022-11-17T05:09:10Z

Hi @demming.
Honestly, I didn't have yet chance to analyze the issue thoroughly, so it would be best if you send your PR since you analyzed the source code and source of the problem. Let me know if I can help you. Thanks.

demming · 2022-11-24T12:20:21Z

Hi @demming.

Honestly, I didn't have yet chance to analyze the issue thoroughly, so it would be best if you send your PR since you analyzed the source code and source of the problem. Let me know if I can help you. Thanks.

Alright, I'll take a shot this weekend then. Hoped to coordinate with the other contributors involved previously in it. Hope we still get to discuss it when I suggest my PR so we can test it from different angles.

Thank you too, I'll keep it in mind 😊

ReneWerner87 · 2023-08-28T06:05:02Z

@demming any progress?

demming · 2024-02-06T19:50:32Z

@ReneWerner87: apologies, been out of the loop for the while here, thanks for pinging me, let me reassess the issue, happy to contribute back. Btw, I met a person just recently you might know.

ReneWerner87 · 2024-02-06T20:43:36Z

Nice to hear that

Btw, I met a person just recently you might know.

Whom?

demming added the ☢️ Bug label Oct 31, 2022

ReneWerner87 added the 👍 Accepting PR label Nov 13, 2022

li-jin-gou assigned keyvan-dadash Nov 16, 2022

keyvan-dadash removed their assignment Nov 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 [Bug]: Cache stampede with caching middleware #2182

🐛 [Bug]: Cache stampede with caching middleware #2182

demming commented Oct 31, 2022 •

edited

Loading

welcome bot commented Oct 31, 2022

ReneWerner87 commented Nov 13, 2022

keyvan-dadash commented Nov 16, 2022

li-jin-gou commented Nov 16, 2022

demming commented Nov 16, 2022 •

edited

Loading

demming commented Nov 16, 2022

demming commented Nov 16, 2022

keyvan-dadash commented Nov 17, 2022

demming commented Nov 24, 2022

ReneWerner87 commented Aug 28, 2023

demming commented Feb 6, 2024

ReneWerner87 commented Feb 6, 2024

🐛 [Bug]: Cache stampede with caching middleware #2182

🐛 [Bug]: Cache stampede with caching middleware #2182

Comments

demming commented Oct 31, 2022 • edited Loading

Bug Description

How to Reproduce

Expected Behavior

Fiber Version

Code Snippet (optional)

Checklist:

welcome bot commented Oct 31, 2022

ReneWerner87 commented Nov 13, 2022

keyvan-dadash commented Nov 16, 2022

li-jin-gou commented Nov 16, 2022

demming commented Nov 16, 2022 • edited Loading

demming commented Nov 16, 2022

demming commented Nov 16, 2022

keyvan-dadash commented Nov 17, 2022

demming commented Nov 24, 2022

ReneWerner87 commented Aug 28, 2023

demming commented Feb 6, 2024

ReneWerner87 commented Feb 6, 2024

demming commented Oct 31, 2022 •

edited

Loading

demming commented Nov 16, 2022 •

edited

Loading