Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Response Caching [ASP.NET Core 6 & 7]: cache populated for each concurrent connection rather than once #44696

Open
1 task done
demming opened this issue Oct 22, 2022 · 15 comments
Assignees
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions Perf

Comments

@demming
Copy link

demming commented Oct 22, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

Response caching goes bust when multiple concurrent inbound connections arrive.

It appears that cache population begins with the first concurrent inbound request hitting the endpoint and not finishing until all 100 of them arrive. So each of them invokes the service, and GC can't cope with the load. The number 100 is chosen here almost arbitrarily. It can't serve C10k requests with response caching---it just crashes.

If, however, I run a preliminary request against the cached endpoint as described below in "Reproduction," then the cache gets populated and all 100 concurrent requests are ordinarily served from the cache and only 1 service invocation occurs. If instead I let that cache expire, then 101 service invocations take place.

When properly served from cache and the runtime VM is warmed up, my MacBook Air 2020 gives 3-4 GB/s throughput on that ASP.NET Core endpoint with connection reuse according to bombardier, whereas my Akka HTTP implementation with in-memory Caffeine cache gives 7-9 GB/s, all other things being equal, with up to tenfold smaller 99th percentile latency. I'm not sure where to look for the root cause of this performance mismatch.

More importantly, the runtime throws a System.OutOfMemory exception when the Duration is not long enough. I suppose multiple cache populations and evictions are taking place at the same time, they overlap, and RSS goes beyond 1 GiB, before shrinking while still serving requests from the then populated cache.

Expected Behavior

I expect it to populate the cache on the first approaching request, regardless of how many identical inbound requests arrive at the same time or prior to cache population. All but one requests should be served from the cache. Now they all induce cache population. Perhaps simply one of them should be picked at random to populate the cache.

Steps To Reproduce

I define a very simple controller,

[ApiController]
[ApiVersion("1.2")]
[Route("v{version:apiVersion}/{controller}")]
[Produces(MediaTypeNames.Text.Html)]
public class WebsiteController : Controller
{
  private readonly ILogger<WebsiteController> _logger;
  private readonly HtmlSanitizerService       _service; // must be registered, as a singleton
  private          uint                       _requestNumber = 0;

  public WebsiteController (ILogger <WebsiteController> logger, HtmlSanitizerService service)
  {
    _logger       = logger;
    _service = service;
  }

with just the following endpoint

  [HttpGet]
  [Route("")]
  [ProducesResponseType(StatusCodes.Status200OK)]
  [ResponseCache(
    Duration = 30
  , Location = ResponseCacheLocation.Any
  , NoStore = false
  , VaryByQueryKeys = new[] { "address" }
  )]
  public async Task<ActionResult <string>> GetSanitizedHtml ([FromQuery] string? address)
  { ... }

which calls a service. Nothing fancy.

Here's the service that gets invoked
public class HtmlSanitizerService
{
  private readonly IHttpClientFactory             _httpClientFactory;
  private readonly HttpClient                     _httpClient;
  private readonly IHtmlSanitizer                 _sanitizer;
  private readonly ILogger <HtmlSanitizerService> _logger;

  private uint _requestNumber = 0;
  private uint _responseNumber = 0;
  public HtmlSanitizerService (IHttpClientFactory httpClientFactory, ILogger<HtmlSanitizerService> logger)
  {
    _logger            = logger;
    _logger.LogDebug("HtmlSanitizerService being initialized...");

    _httpClientFactory = httpClientFactory;
    _httpClient        = _httpClientFactory.CreateClient("name");

    _sanitizer         = new HtmlSanitizer();

    _logger.LogDebug("HtmlSanitizerService initialized...");
  }

  public async Task <string> SanitizeAsync (Uri uri)
  {
    _requestNumber++;

    _logger.LogDebug("> #{RequestNumber} request / HtmlSanitizerService.SanitizeAsync being executed...", _requestNumber);
    var html          = await _httpClient.GetStringAsync(uri);
    var htmlSanitized = _sanitizer.SanitizeDocument(html);

    _responseNumber++;
    _logger.LogDebug("< #{ResponseNumber} response / HtmlSanitizerService.SanitizeAsync complete...", _responseNumber);
    return htmlSanitized;
  }
}

Now if I run a load test on this endpoint on Kestrel,

bombardier -c 100 -d 10s "http://localhost:5104/website?address=localhost:8080"

I observe exactly 100 service invocations (each of which is an expensive operation). To prevent this from happening all I need to do is run a single preliminary request against this endpoint, e.g., something like

curl -v "http://localhost:5104/website?address=localhost:8080" 1>/dev/null

which populates the cache, so that the same 100 concurrent connections immediately afterwards are served from cache and no additional invocations are taking place.

Exceptions (if any)

System.OutOfMemory from Ganss.Xss.HtmlSanitizer (inside a singleton service) due to multiple concurrent invocations

.NET Version

6 and 7-rc2

Anything else?

ASP.NET Core 6.0 and 7.0-rc2

@DamianEdwards
Copy link
Member

Looks like a cache stampede and casting my mind back to when response caching was made, this might be a known design trait, but would be good to have others confirm. The new Output Cache system in .NET 7 should not suffer from this as I believe it implements protection against this, @sebastienros

@demming
Copy link
Author

demming commented Oct 23, 2022

@DamianEdwards Thanks for pointing it out. With OutputCache the major issue of this ticket doesn't reappear, this is great, albeit comes at a cost of an apparent performance degradation. Is this expected behavior or can I perhaps reconfigure it somehow? Here are my consistent observations, taken on the same machine multiple times in random order, with all other things being made equal.

ResponseCache

Cold start after a single preliminary request,

Statistics        Avg      Stdev        Max
  Reqs/sec      1819.03     747.16    4768.04
  Latency       54.82ms   260.31ms      3.68s
  Latency Distribution
     50%    16.63ms
     75%    39.37ms
     90%    94.10ms
     95%   169.66ms
     99%   359.51ms
  HTTP codes:
    1xx - 0, 2xx - 18267, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:     1.61GB/s

Warmed up VM

Statistics        Avg      Stdev        Max
  Reqs/sec      2403.11     760.91    6255.78
  Latency       41.59ms    38.91ms   392.12ms
  Latency Distribution
     50%    31.12ms
     75%    43.83ms
     90%    78.70ms
     95%   135.67ms
     99%   199.67ms
  HTTP codes:
    1xx - 0, 2xx - 24069, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:     2.13GB/s

OutputCache

Cold start (no preliminary request, no concurrent cache population), 330m RSS at peak,

Statistics        Avg      Stdev        Max
  Reqs/sec      1245.57    1236.28   10094.09
  Latency       75.79ms   291.21ms      6.81s
  Latency Distribution
     50%    35.62ms
     75%    55.45ms
     90%    94.81ms
     95%   168.56ms
     99%   700.20ms
  HTTP codes:
    1xx - 0, 2xx - 12972, 3xx - 0, 4xx - 0, 5xx - 0
    others - 303
  Errors:
    the server closed connection before returning the first response byte. Make sure the server returns 'Connection: close' response header before closing the connection - 303
  Throughput:     1.13GB/s

Warmed up VM

Statistics        Avg      Stdev        Max
  Reqs/sec      1843.32     690.15    4483.41
  Latency       54.08ms    43.51ms   394.33ms
  Latency Distribution
     50%    41.19ms
     75%    65.43ms
     90%   114.77ms
     95%   154.14ms
     99%   211.52ms
  HTTP codes:
    1xx - 0, 2xx - 18531, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:     1.64GB/s

Akka HTTP with Caffeine cache

OpenJDK 19.0.1 with only -Xmx64m, 240m RSS at peak.

Cold start

Statistics        Avg      Stdev        Max
  Reqs/sec      3865.37    1945.63    9622.19
  Latency       25.82ms    72.63ms      1.42s
  Latency Distribution
     50%    17.88ms
     75%    25.79ms
     90%    38.83ms
     95%    53.65ms
     99%   102.69ms
  HTTP codes:
    1xx - 0, 2xx - 38743, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:     2.60GB/s

Warmed up VM

Statistics        Avg      Stdev        Max
  Reqs/sec      8110.44    1403.61   13254.54
  Latency       12.32ms    14.04ms   761.26ms
  Latency Distribution
     50%    11.14ms
     75%    14.62ms
     90%    18.74ms
     95%    21.85ms
     99%    33.78ms
  HTTP codes:
    1xx - 0, 2xx - 81168, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:     5.45GB/s

On a side note, neither Spring WebFlux nor Quarkus reach the caching performance of OutputCache above out of the box---if this is of any help.

@DamianEdwards
Copy link
Member

@sebastienros

@DamianEdwards
Copy link
Member

Looking at your numbers, it seems that latency was higher for Output Cache, but so was throughput. Given everything is local (and without a profile) it's hard to draw any real conclusions from that. @sebastienros do you have any thoughts? Certainly, server profiles from both runs would show what's going on here.

@demming
Copy link
Author

demming commented Oct 24, 2022

@DamianEdwards The throughput figures with Output Cache are actually lower (1.64 < 2.13), which makes sense given the latency numbers. I'd be glad to provide more and more reliable data points. The local environment does certainly not provide a strict baseline profile. What server profiles are you referring to?

@DamianEdwards
Copy link
Member

Ah sorry I was misreading the results (Max vs. Avg).

By "profile" I mean a performance profile using CPU sampling, via the VS Performance Profile or dotnet trace

@ghost
Copy link

ghost commented Oct 24, 2022

Thanks for contacting us.

We're moving this issue to the .NET 8 Planning milestone for future evaluation / consideration. We would like to keep this around to collect more feedback, which can help us with prioritizing this work. We will re-evaluate this issue, during our next planning meeting(s).
If we later determine, that the issue has no community involvement, or it's very rare and low-impact issue, we will close it - so that the team can focus on more important and high impact issues.
To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

@demming
Copy link
Author

demming commented Oct 28, 2022

Ah sorry I was misreading the results (Max vs. Avg).

By "profile" I mean a performance profile using CPU sampling, via the VS Performance Profile or dotnet trace

Sure thing, will do. For the time being it appears there's a memory leak in output cache. I've just noticed the RSS of my dotnet process going through the roof, under the same load and soak test as above, RSS ever increasing stepwise about each time cache should be evicted. Almost a gigabyte now.

But when I hit the response-cached endpoint in the same dotnet process it first steps a bit higher but then plummets back to more reasonable about 300m.

@demming
Copy link
Author

demming commented Oct 28, 2022

Here's a memory profile memory profile (png)

  1. running a 30s burst against the output-cached endpoint with 100 concurrent connections,
  2. waiting just shy of a minute, and then
  3. running a single request against the response-cached endpoint to populate the cache and avoid cache stampede, prior to the pretty long GC sweep, and finally
  4. replicating 1. against the response-cached endpoint for 10s, which provoked the longer GC sweep.

I can upload the project and the profile if it would help.

@ben-manes
Copy link

fwiw, the imemorycache might be a bottleneck when judging from this comparison by a cache written by an azure engineer.

@sebastienros
Copy link
Member

@ben-manes the benchmarks actually should that on their machine it supports ~40M transactions per second. I think it actually shows that it won't be bottleneck.

@sebastienros
Copy link
Member

@demming I confirm that Response Caching doesn't handle cache stampede like Output Caching does, by design. This is one reason we introduced Output Caching. Though in theory we could also implement it in Response Caching in a future release (maybe a contribution?).

Wath version did you use for the benchmarks WRT Output Caching. I updated it to use binary serialization in RC2 in order to improve performance, it might help. However with Response Caching there is no such serialization step (it's always using in-memory storage) which could explain performance differences.

@DamianEdwards
Copy link
Member

@demming I'm not seeing enough there to conclude there's an actual memory leak, vs. expected allocations and respective GC under pressure. If you have an app and profile you can share that'd be great.

@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Aug 25, 2023
@dotnet-policy-service dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 6, 2024
@wtgodbe wtgodbe removed the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 6, 2024
@demming
Copy link
Author

demming commented Feb 6, 2024

@sebastienros, @DamianEdwards: I've been out of the loop on this particular issue, appreciate your staying on track with it. Do you still need my input/assistance on it?

@dotnet-policy-service dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 6, 2024
@wtgodbe wtgodbe removed the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 13, 2024
@dotnet dotnet deleted a comment from dotnet-policy-service bot Feb 13, 2024
@dotnet dotnet deleted a comment from dotnet-policy-service bot Feb 13, 2024
@mgravell
Copy link
Member

mgravell commented Feb 16, 2024

I'm going to cross-reference #53255 here, which is a proposed net9 extension to IDistributedCache (actually, a new tier) that gives cache stampede protection (among other things). If we're talking about output-cache, it should be noted that output-cache doesn't currently use IDistributedCache - it has a separate backend layer; however, two options present themselves:

  1. build a wrapper tier for IOutputCacheStore that allows it to use the new abstraction, gaining all the things for free, or
  2. re-use (copy) the stampede protection implementation between the two libraries

I'm more enthusiastic about 1; I'll try a prototype, to see how well it works; update: not hard, but completely untested (and we'd probably want some kind of key isolation/disambiguation):

// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

using Microsoft.AspNetCore.OutputCaching;

namespace Microsoft.Extensions.Caching.Distributed.Tests;

// context: experiment for https://github.com/dotnet/aspnetcore/issues/44696
internal class OutputCacheWrapper(ReadThroughCache underlying) : IOutputCacheStore
{
    ValueTask IOutputCacheStore.EvictByTagAsync(string tag, CancellationToken cancellationToken)
        => underlying.RemoveTagAsync(tag, cancellationToken);

    async ValueTask<byte[]?> IOutputCacheStore.GetAsync(string key, CancellationToken cancellationToken)
        => (await underlying.GetAsync<byte[]>(key, null, cancellationToken)).Value;

    ValueTask IOutputCacheStore.SetAsync(string key, byte[]? value, string[]? tags, TimeSpan validFor, CancellationToken cancellationToken)
        => value is null
            ? underlying.RemoveKeyAsync(key, cancellationToken)
            : underlying.SetAsync<byte[]>(key, value, new ReadThroughCacheEntryOptions(validFor), tags, cancellationToken);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions Perf
Projects
None yet
Development

No branches or pull requests

9 participants