Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Events based cache issues in SPIRE 1.11 #5771

Open
sorindumitru opened this issue Jan 13, 2025 · 6 comments
Open

Events based cache issues in SPIRE 1.11 #5771

sorindumitru opened this issue Jan 13, 2025 · 6 comments
Assignees
Labels
priority/urgent Issue is approved and is must be completed in the assigned milestone
Milestone

Comments

@sorindumitru
Copy link
Collaborator

We’ve seen some issues over the weekend with the events based cache. Instances that have restarted have taken a lot of time to populate the cache and become available:

{"file":"authorized_entryfetcher.go:40","func":"[github.com/spiffe/spire/pkg/server/endpoints.NewAuthorizedEntryFetcherWithEventsBasedCache](https://github.com/spiffe/spire/pkg/server/endpoints.NewAuthorizedEntryFetcherWithEventsBasedCache)","level":"info","msg":"Building event-based in-memory entry cache","subsystem_name":"endpoints","time":"2025-01-11T14:13:56Z"}
{"file":"authorized_entryfetcher.go:45","func":"[github.com/spiffe/spire/pkg/server/endpoints.NewAuthorizedEntryFetcherWithEventsBasedCache](https://github.com/spiffe/spire/pkg/server/endpoints.NewAuthorizedEntryFetcherWithEventsBasedCache)","level":"info","msg":"Completed building event-based in-memory entry cache","subsystem_name":"endpoints","time":"2025-01-11T14:41:02Z"}

From the metrics I can tell this it spent this time fetching registration entries. Hydrating the cache seems to now:

  1. Read all entries from the datastore
  2. Read all events from the datastore
  3. Read all entries associated with those events

In this case we had a large number of events, maybe ~500k, which meant the server was stuck here for ~30 minutes. Looking through the past logs I can see it taking multiple seconds or minutes before.

Additionally, we mark the fetched events as processed in

, but later on we might completely bail on processing the entries associated with them due to the datastore returning an error in . Since we then only fetch changed events from the last processed one, we might skip some entry updates.

The version of the events based cache we had in 1.10.4 doesn’t seem to have these issues. It would be worth considering reverting the latest changes to the cache since they only improve a small edge case (de-duplicating entries that might be created due to long running transactions started before the cache was hydrated; I can’t imagine there being more than a couple and usually 0 of these). #5509 is the change that would need to be reverted.

@amoore877, @stevend-uber tagging you here since I know you were also using the events based cache.

  • Version: >= 1.11.0
  • Platform: any
  • Subsystem: server, agent
@stevend-uber
Copy link
Contributor

@sorindumitru Unfortunately right now we are using the 1.10 version of SPIRE-Server and don't currently see that. We did very thorough testing of this exact scenario during the performance evaluation of the feature and confirm that this wasn't an issue in anything in prior to or including the 1.10 version of the events-based cache.

@sorindumitru
Copy link
Collaborator Author

@stevend-uber I believe this is only an issue starting with 1.11.0. That's where #5509 was included.

@faisal-memon
Copy link
Contributor

Please assign this to me, Ill take a look

@rturner3 rturner3 added the triage/in-progress Issue triage is in progress label Jan 14, 2025
@faisal-memon
Copy link
Contributor

faisal-memon commented Jan 28, 2025

Started looking into this. Prior to #5509 we had a workflow that did this at startup:

  1. Got the last event id
  2. Full read of all registration entries to load the cche
  3. Start polling every 5 seconds for all events after the last event id at startup

Seems like #5509 inadvertently removed that first step so it just steps though all the stale events at startup.

@sorindumitru
Copy link
Collaborator Author

sorindumitru commented Jan 28, 2025

Seems like #5509 inadvertently removed that first step so it just steps though all the stale events at startup.

Yes, I think if we bring back that initial read of skipped events and last event id it should be ok from that regard. We likely also wouldn't need the "updateCache" call during the initial cache hydration.

There's also the return in

which makes it possible to not apply some updates. I think we could fix that by not clearing the map of entries to fetch at the beginning of the refresh (
a.fetchEntries = make(map[string]struct{})
) and manually removing the entries that are succesfully fetched (which we already do in ). That would mean that failed entries will be retried on the next refresh.

@faisal-memon
Copy link
Contributor

Thanks @sorindumitru for confirming my suspicions. Ill get started on a fix for the initial cache hydration.

Will look into the issues with refresh next.

@amartinezfayo amartinezfayo added priority/urgent Issue is approved and is must be completed in the assigned milestone and removed triage/in-progress Issue triage is in progress labels Jan 28, 2025
@amartinezfayo amartinezfayo added this to the 1.11.2 milestone Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/urgent Issue is approved and is must be completed in the assigned milestone
Projects
None yet
Development

No branches or pull requests

5 participants