Skip to content

Almost complete rewrite of cache filter#37990

Closed
ravenblackx wants to merge 81 commits intoenvoyproxy:mainfrom
ravenblackx:cache_stream
Closed

Almost complete rewrite of cache filter#37990
ravenblackx wants to merge 81 commits intoenvoyproxy:mainfrom
ravenblackx:cache_stream

Conversation

@ravenblackx
Copy link
Contributor

@ravenblackx ravenblackx commented Jan 13, 2025

Commit Message: Almost complete rewrite of cache filter
Additional Description: This is necessarily a massive change because the old cache filter was almost non-functional. Important differences:

  1. Support for vary headers has been removed pending another (much smaller) PR - turns out the old implementation had bugged support that led to crashes. At this version the behavior in the presence of vary headers is to not cache, meaning results will be correct but the cache will not be useful. Postponing is in aid of getting this PR landed because it's blocking various other things - correct vary support requires a lot more test coverage.
  2. it's now resilient against "thundering herd" problem - only one request goes upstream if there are multiple simultaneous requests for the same resource.
  3. it now works with range requests, which, if there previously were only range requests, as can happen with Microsoft BITS as a client, would all just bypass the cache. The new behavior is a range request causes a full file request upstream, and then serves the range response once the requested part of the file is cached. This could be improved by also supporting partial cache entries, but it's reasonable for that to be a later PR.
  4. a lot of work that the cache filter previously offloaded to the cache implementation is now owned by the cache filter (e.g. vary headers, validation, thundering herd resistance). This makes cache implementation much less onerous, but adds some intermediate layers.
  5. as a compromise towards implementation simplicity, headers and trailers are cached in memory in the in-between layer (CacheSession). This will use a bit more memory than having everything offloaded to the cache implementation (for cache implementations that didn't keep anything in memory).
  6. cache requests to upstream now bypass any filters that are upstream of the cache filter and aren't on the cluster (including e.g. the router filter is where add_headers operations in the RouteConfiguration get applied). This would cause breakages and require config change for any configuration that relies on that not being the case. I tried many approaches to make it not break in this way, but apparently it's just not possible without large risky core changes.

Risk Level: Pretty bad, but the filter is still tagged WIP, so not critical risk.
Testing: There's quite a lot, but still needs a bit more coverage. Existing integration tests still function mostly as they were, and added integration tests for range and parallel requests which would have failed with the old cache.
Docs Changes: Added details of the new rather more awkward configuration requirements.
Release Notes: Yes.
Platform Specific Features: n/a

Signed-off-by: Raven Black <ravenblack@dropbox.com>
@repokitteh-read-only
Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #37990 was opened by ravenblackx.

see: more, trace.

Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
@ravenblackx
Copy link
Contributor Author

/coverage

@repokitteh-read-only
Copy link

Coverage for this Pull Request will be rendered here:

https://storage.googleapis.com/envoy-pr/37990/coverage/index.html

The coverage results are (re-)rendered each time the CI envoy-presubmit (check linux_x64 coverage) job completes.

🐱

Caused by: a #37990 (comment) was created by @ravenblackx.

see: more, trace.

Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
@ravenblackx
Copy link
Contributor Author

/retest

Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
@phlax
Copy link
Member

phlax commented Aug 5, 2025

i think this needs some offline resolution

/wait

@ravenblackx
Copy link
Contributor Author

This has run into an immoveable wall - the change to the API has been deemed unacceptable, but the implementation basically can't work usefully without that change, so the whole thing is now irredeemable.

The previous state of the cache filter is also bug-ridden and essentially nonfunctional for many use-cases, so this is a serious problem that seems unlikely to be resolved in any direction in the near future.

There are two semi-viable proposals for fixing it:

  1. make AsyncClient capable of reproducing the behavior of the HttpConnectionManager and partial filter chain from a filter. Currently AsyncClient can't do that, and accessing the necessary filter chain and HCM details from a filter (which would be a prerequisite of initializing such an AsyncClient if it had that capability) is also not possible. It's likely that any attempt to implement this would be met with "no you can't make that change because it's not supposed to do that".
  2. make HttpConnectionManager capable of accepting a disconnect from downstream without automatically closing the upstream request. This seems like the most viable option but there are already objections to this concept too, and there's nobody interested in trying to implement it or push it through those objections.

Both of these would also require significant rewrites of the cache filter, again, even if the prerequisites were not blocked.

@github-actions
Copy link

github-actions bot commented Sep 4, 2025

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 4, 2025
@github-actions
Copy link

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this Sep 11, 2025
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
@ravenblackx
Copy link
Contributor Author

New plan is to make this HttpCacheFilterV2 (and the sub-extensions also V2), with WIP and alpha status so people shouldn't get too attached to it, while the old version fades into contrib/deprecated.

…d; ok

Signed-off-by: Raven Black <ravenblack@dropbox.com>
@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Sep 12, 2025
yanavlasov pushed a commit that referenced this pull request Sep 29, 2025
This is essentially just moving #37990 into a
new namespace. In addition to simple renames, some factory functions
were changed from using throw to returning StatusOr,
`allowed_vary_headers` was hidden as not implemented, and the whole API
was tagged WIP as the old version always should have been. There was
also a side-effect that the protobuf stableHashKey value for a given
cache entry changed because the proto message full-names changed; I
bumped the file system cache version so that pre-existing cache entries
will be invalidated.

From #37990 the improved features of this new cache implementation are:
* Thundering-herd resistance (shared streaming from the upstream for
cacheable responses)
* `vary` headers are unsupported temporarily, rather than
broken-supported as they were before.
* Range requests now provoke a full upstream request to populate the
cache, rather than just passing through and never caching. This may be
worse for use-cases that would benefit from partial caching, but neither
version supported partial caching.
* A lot of work that used to be required of the cache implementation is
now part of the cache filter, so cache implementations can be much
simpler - essentially just storage now, no need to understand and
process vary headers.
* Headers and trailers are temporarily cached in memory as well as in
the cache storage.
* More test coverage. Old cache implementation has an exception in the
allowable test coverage file.

And the big dubious feature is:
* Cache requests to upstream now bypass any filters that are upstream of
the cache filter and aren't on the cluster, including e.g. the router
filter where add_headers operations would be applied. Some awkward
configuration is required (and documented) to work around this
constraint. Unfortunately the constraint was unavoidable with how envoy
works at this time.

Risk Level: Very low since it doesn't change existing APIs and only adds
a WIP no-security-posture filter, don't use it!
Testing: Loads.
Docs Changes: Mostly-duplicate docs for the new version.
Release Notes: Follow shortly.
Platform Specific Features: No windows support for file system cache,
same as before.

---------

Signed-off-by: Raven Black <ravenblack@dropbox.com>
@ravenblackx
Copy link
Contributor Author

Replaced by #41148

@ravenblackx ravenblackx deleted the cache_stream branch September 29, 2025 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants