Delete non-LRU cache in SPIRE Agent #5383

amoore877 · 2024-08-13T22:52:09Z

Pull Request check list

[X] Commit conforms to CONTRIBUTING.md?
[X] Proper tests/regressions included?
[X] Documentation updated?

Affected functionality

Remove the original caching implementation in SPIRE Agent, as well as size configuration. Redo of #5184

Description of change

Per plan in #4224 , perform final update actions by making LRU Cache with size 1000 the default, non-configurable implementation.

Also per #4224 , not for release until at least 1.11 based on #5150 , which deprecates the options but does not delete them, and went out in 1.10.

Which issue this PR fixes

Fixes #4224

amoore877 · 2024-08-13T23:10:24Z

hmm little confused how https://github.com/spiffe/spire/actions/runs/10378374004/job/28734731515?pr=5383#step:9:1301 is failing:

[2024-08-13T23:06:18Z] executing 05-fetch-x509-svids...
[2024-08-13T23:06:18Z] Expected 10 X.509-SVIDs and received 10 for uid 1001

we want 10 and got 10. so should be good?

amoore877 · 2024-08-13T23:13:45Z

hmm little confused how https://github.com/spiffe/spire/actions/runs/10378374004/job/28734731515?pr=5383#step:9:1301 is failing:
[2024-08-13T23:06:18Z] executing 05-fetch-x509-svids...
[2024-08-13T23:06:18Z] Expected 10 X.509-SVIDs and received 10 for uid 1001
we want 10 and got 10. so should be good?

ahh the test is setting cache size 8 and expecting a drop

amoore877 · 2024-08-14T15:19:11Z

[2024-08-14T14:39:55Z] executing 04-ban-agent...
[2024-08-14T14:39:55Z] banning agent...
Error: rpc error: code = NotFound desc = agent not found
[2024-08-14T14:39:55Z] step 04-ban-agent failed

hmm separate from this PR I think this integration test might be flaky; we may not be doing enough checking that the agent is fully up and in the SPIRE DB

Signed-off-by: amoore877 <[email protected]>

sorindumitru · 2024-09-10T18:46:13Z

cmd/spire-agent/cli/run/run.go

@@ -120,10 +120,6 @@ type experimentalConfig struct {
 	UseSyncAuthorizedEntries bool   `hcl:"use_sync_authorized_entries"`

 	Flags fflag.RawConfig `hcl:"feature_flags"`
-
-	UnusedKeyPositions   map[string][]token.Pos `hcl:",unusedKeyPositions"`
-	X509SVIDCacheMaxSize int                    `hcl:"x509_svid_cache_max_size"`


I was wondering if we shouldn't keep this setting but promote it out of experimental. It has some implications for existing deployments, namely that the initial fetch of a SVID may become slower now if there are more active workloads than 1000. It can become noticeable for the case where you run lots of short lived workloads.

Or maybe keep it as experimental in case someone complains and remove after it a few releases if nobody complains?

Or maybe keep it as experimental in case someone complains and remove after it a few releases if nobody complains?

so this is actually what is being executed. 1.10 already has a merged PR #5150 for deprecating this feature (via warnings in logs). This PR is planned for 1.11 to finally remove it entirely.

I'm sure if there are raised complaints wrt to the deprecation and removal, SPIRE maintainers and the community would be happy to discuss. You can raise this in SPIFFE slack or in contributor sync

It can become noticeable for the case where you run lots of short lived workloads.

fwiw, and this of course would vary significantly based on operator environment, my organization's own load testing showed on the version this feature was introduced at most 1-3s of fetchx509svid latency experienced by consumers when the cache size was set to 1k and the agent was being cycled through 26k registrations.

I'd be curious to know for your own environment how registrations are being managed for short-lived- are they always statically registered? if not, are they aggressively or lazily culled? from a security perspective it's best to have an agent only be assigned precisely the workload identities that are expected to be running with it in that exact moment, though of course there are windows where there would be excess dependent on strategy.

It's just something I wanted to raise as a potential issue as I've ran into a similar issue with other pieces of software. 1-3 seconds for something that is supposed to be started lots of times, for example batch jobs, can add up to a lot. With the probably more common use case of longer lived services I don't think a couple of seconds of start up delay is going to matter that much.

We don't mandate short-lived vs long-lived, but registration entries would usually be towards the longer lived end so I don't expect to have any issues. There are some plans for using it in a place with shorter lived workloads, so I'll see at that point if I'll actually have any issues.

I was wondering if we shouldn't keep this setting but promote it out of experimental

The LRU SVID cache has been enabled by default since v1.9.0, using the cache size of 1000. Since we haven't heard from users the need of tuning it with a different value, we went ahead with the plan to deprecating x509_svid_cache_max_size in v1.10.0 and remove it in v1.11.0.

I'm not sure if that catches some of the more subtle way this changes deployments. For a while people have been used to thinking that the agent caches X509-SVIDs for all workloads. This means that if spire-server becomes unavailable for some time, you still have up to X509-SVID TTL/2 to fix it since the agent can serve svids from the cache.

This now changes it so that the agent caches up to 1000 X509-SVIDs, but no more until the workloads actually start and start requesting SVIDs. This means that the time to fix things can be shortened drastically. 1000 is an arbitrary number, it would still be nice to have the ability to specify how many workloads you want to have in the cache by default.

I don't want to hard block this, but I think it's worth having the maintainers consider this a bit more.

@sorindumitru totally understand :) I definitely appreciate wanting to protect global operator experience and the benefit of open source is being able to have these sorts of conversations and concerns raised.

1000 is an arbitrary number, it would still be nice to have the ability to specify how many workloads you want to have in the cache by default.

iiuc , there is generally a desire from maintainers to limit just how many levers and knobs SPIRE has. It can already be difficult for some groups to adopt it, and each control is a potential point of confusion or breakage as well as a new dimension and permutation of deployment that maintainers need to support and receive issue/feature requests about. so that is a key motivator to removing this configuration. there is also definitely a split in operator groups- one that takes the binaries / source code as-is, and one that internally forks it to make adjustments (such as on hard-coded values as the most trivial example).

fwd: @amartinezfayo - if there's more to add not covered here or in #5383 (comment)

Thank you @sorindumitru and @amoore877 for all the feedback and thoughts on this.
I think that the last comment from @amoore877 reflects pretty well what the maintainer's group analyzes when a new setting is b being introduced.
We have discussed this again in our last maintainer's call, and after reevaluating this, we feel that it would be prudent to keep the x509_svid_cache_max_size setting. The concerns pointed by @sorindumitru are some valid concerns. At this point, I personally think that being in a situation where you would need to adjust the cache size seems to be a lot more problematic than the fact that there is a new setting that can be tweaked, mostly considering that there will be a default value and users will not need to be aware of this setting if they don't need to adjust it. Having a proper documentation explaining when you may need to use this setting (e.g. when the agent handles more than 1000 active workloads) will help.
I think that we can promote it as a stable setting. @amoore877 Could you update the PR to reflect that?

based on that plan, would it not be better to close/abandon this PR and then another PR reverts #5150 which marked the setting as deprecated?

yet another PR (to reduce how much has to be reviewed) would then promote the setting out of experimental.

would we still be removing the non-LRU cache code?

would be good to update #4224 with the full goal state from maintainers

based on that plan, would it not be better to close/abandon this PR and then another PR reverts #5150 which marked the setting as deprecated?

Since the x509_svid_cache_max_size setting will not belong to the experimental section anymore, I think that the deprecation notice was OK, because the setting will not be recognized there anymore. In terms of closing this PR, I was thinking that it could just be updated to move the x509_svid_cache_max_size setting out of the experimental section, being now a stable configurable. But if you prefer to handle that in a separate PR, that would work also.

would we still be removing the non-LRU cache code?

Yes, no changes with that, we are removing the non-LRU cache code in v1.11.0.

would be good to update #4224 with the full goal state from maintainers

I've updated #4224 accordingly.

pkg/agent/manager/cache/lru_cache.go

pkg/agent/manager/cache/workload.go

amartinezfayo · 2024-09-11T13:23:57Z

cmd/spire-agent/cli/run/run.go

@@ -120,10 +120,6 @@ type experimentalConfig struct {
 	UseSyncAuthorizedEntries bool   `hcl:"use_sync_authorized_entries"`

 	Flags fflag.RawConfig `hcl:"feature_flags"`
-
-	UnusedKeyPositions   map[string][]token.Pos `hcl:",unusedKeyPositions"`
-	X509SVIDCacheMaxSize int                    `hcl:"x509_svid_cache_max_size"`


I was wondering if we shouldn't keep this setting but promote it out of experimental

The LRU SVID cache has been enabled by default since v1.9.0, using the cache size of 1000. Since we haven't heard from users the need of tuning it with a different value, we went ahead with the plan to deprecating x509_svid_cache_max_size in v1.10.0 and remove it in v1.11.0.

Signed-off-by: amoore877 <[email protected]>

amoore877 · 2024-09-25T15:09:03Z

another PR will follow adjusted plan

amoore877 · 2024-09-27T17:35:45Z

adjusted plan for expediency on release from talking to @amartinezfayo :

this PR will be merged
maintainers will add config for cache size in a new PR

amartinezfayo

Thank you @amoore877 for this!
We will be opening a PR adding the x509_svid_cache_max_size setting.

amoore877 requested review from evan2645, amartinezfayo, azdagron, MarcosDY and rturner3 as code owners August 13, 2024 22:52

amoore877 mentioned this pull request Aug 14, 2024

reduce flakiness in evict-agent CI #5386

Merged

2 tasks

azdagron assigned amartinezfayo Aug 15, 2024

azdagron added this to the 1.11.0 milestone Aug 15, 2024

amoore877 added 8 commits August 19, 2024 12:50

compiles wheeeee

c57535d

Signed-off-by: amoore877 <[email protected]>

tests compile

4ef00d8

Signed-off-by: amoore877 <[email protected]>

local tests passing

b14ca20

Signed-off-by: amoore877 <[email protected]>

fix: lint (linux), lint (windows)

4cdf1a2

Signed-off-by: amoore877 <[email protected]>

fix integration test fetch-x509-svids

f838ce8

Signed-off-by: amoore877 <[email protected]>

one more removeable config ref

f7ac93a

Signed-off-by: amoore877 <[email protected]>

fix: lint (linux), lint (windows)

7a077c2

Signed-off-by: amoore877 <[email protected]>

helpful test comments

b148c8f

Signed-off-by: amoore877 <[email protected]>

amoore877 force-pushed the delete_non_lru_for_reals branch from c531e41 to b148c8f Compare August 19, 2024 19:51

amoore877 added 4 commits August 20, 2024 08:18

Merge branch 'main' into delete_non_lru_for_reals

81a6821

Merge branch 'main' into delete_non_lru_for_reals

a6653fa

Merge branch 'main' into delete_non_lru_for_reals

02fd57b

Merge branch 'main' into delete_non_lru_for_reals

d7330e6

sorindumitru reviewed Sep 10, 2024

View reviewed changes

amartinezfayo reviewed Sep 11, 2024

View reviewed changes

amoore877 and others added 4 commits September 11, 2024 11:04

Merge branch 'main' into delete_non_lru_for_reals

d11a640

struct doc

37ec2dd

Signed-off-by: amoore877 <[email protected]>

please stop the screaming

5cbf474

Signed-off-by: amoore877 <[email protected]>

struct doc 2

0dce619

Signed-off-by: amoore877 <[email protected]>

amoore877 and others added 4 commits September 11, 2024 11:30

simplify struct

42e33b3

Signed-off-by: amoore877 <[email protected]>

comment fix

7e6e85a

Signed-off-by: amoore877 <[email protected]>

Merge branch 'main' into delete_non_lru_for_reals

a1c5b9e

Merge branch 'main' into delete_non_lru_for_reals

686804f

amoore877 mentioned this pull request Sep 23, 2024

Enable SPIRE Agent LRU Cache #4224

Closed

amoore877 closed this Sep 25, 2024

amoore877 reopened this Sep 27, 2024

amoore877 and others added 2 commits September 27, 2024 10:35

Merge branch 'main' into delete_non_lru_for_reals

dd27054

Merge branch 'main' into delete_non_lru_for_reals

2c0fb70

amartinezfayo approved these changes Sep 27, 2024

View reviewed changes

amartinezfayo merged commit 182b594 into spiffe:main Sep 27, 2024
34 checks passed

amartinezfayo mentioned this pull request Oct 2, 2024

Have x509_svid_cache_max_size as agent config setting #5531

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete non-LRU cache in SPIRE Agent #5383

Delete non-LRU cache in SPIRE Agent #5383

amoore877 commented Aug 13, 2024 •

edited

Loading

amoore877 commented Aug 13, 2024

amoore877 commented Aug 13, 2024

amoore877 commented Aug 14, 2024

sorindumitru Sep 10, 2024

amoore877 Sep 10, 2024

sorindumitru Sep 10, 2024

amartinezfayo Sep 11, 2024

sorindumitru Sep 13, 2024

amoore877 Sep 18, 2024

amartinezfayo Sep 23, 2024

amoore877 Sep 23, 2024 •

edited

Loading

amartinezfayo Sep 25, 2024

amartinezfayo Sep 11, 2024

amoore877 commented Sep 25, 2024

amoore877 commented Sep 27, 2024 •

edited

Loading

amartinezfayo left a comment

Delete non-LRU cache in SPIRE Agent #5383

Delete non-LRU cache in SPIRE Agent #5383

Conversation

amoore877 commented Aug 13, 2024 • edited Loading

amoore877 commented Aug 13, 2024

amoore877 commented Aug 13, 2024

amoore877 commented Aug 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amoore877 Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amoore877 commented Sep 25, 2024

amoore877 commented Sep 27, 2024 • edited Loading

amartinezfayo left a comment

Choose a reason for hiding this comment

amoore877 commented Aug 13, 2024 •

edited

Loading

amoore877 Sep 23, 2024 •

edited

Loading

amoore877 commented Sep 27, 2024 •

edited

Loading