Events based cache issues in SPIRE 1.11 #5771

sorindumitru · 2025-01-13T13:10:05Z

We’ve seen some issues over the weekend with the events based cache. Instances that have restarted have taken a lot of time to populate the cache and become available:

{"file":"authorized_entryfetcher.go:40","func":"[github.com/spiffe/spire/pkg/server/endpoints.NewAuthorizedEntryFetcherWithEventsBasedCache](https://github.com/spiffe/spire/pkg/server/endpoints.NewAuthorizedEntryFetcherWithEventsBasedCache)","level":"info","msg":"Building event-based in-memory entry cache","subsystem_name":"endpoints","time":"2025-01-11T14:13:56Z"}
{"file":"authorized_entryfetcher.go:45","func":"[github.com/spiffe/spire/pkg/server/endpoints.NewAuthorizedEntryFetcherWithEventsBasedCache](https://github.com/spiffe/spire/pkg/server/endpoints.NewAuthorizedEntryFetcherWithEventsBasedCache)","level":"info","msg":"Completed building event-based in-memory entry cache","subsystem_name":"endpoints","time":"2025-01-11T14:41:02Z"}

From the metrics I can tell this it spent this time fetching registration entries. Hydrating the cache seems to now:

Read all entries from the datastore
Read all events from the datastore
Read all entries associated with those events

In this case we had a large number of events, maybe ~500k, which meant the server was stuck here for ~30 minutes. Looking through the past logs I can see it taking multiple seconds or minutes before.

Additionally, we mark the fetched events as processed in

spire/pkg/server/endpoints/authorized_entryfetcher_registration_entries.go

Line 138 in a49eaad

a.lastEvent = event.EventID

, but later on we might completely bail on processing the entries associated with them due to the datastore returning an error in

spire/pkg/server/endpoints/authorized_entryfetcher_registration_entries.go

Line 230 in a49eaad

return err

. Since we then only fetch changed events from the last processed one, we might skip some entry updates.

The version of the events based cache we had in 1.10.4 doesn’t seem to have these issues. It would be worth considering reverting the latest changes to the cache since they only improve a small edge case (de-duplicating entries that might be created due to long running transactions started before the cache was hydrated; I can’t imagine there being more than a couple and usually 0 of these). #5509 is the change that would need to be reverted.

@amoore877, @stevend-uber tagging you here since I know you were also using the events based cache.

Version: >= 1.11.0
Platform: any
Subsystem: server, agent

The text was updated successfully, but these errors were encountered:

stevend-uber · 2025-01-13T20:20:43Z

@sorindumitru Unfortunately right now we are using the 1.10 version of SPIRE-Server and don't currently see that. We did very thorough testing of this exact scenario during the performance evaluation of the feature and confirm that this wasn't an issue in anything in prior to or including the 1.10 version of the events-based cache.

sorindumitru · 2025-01-13T21:28:40Z

@stevend-uber I believe this is only an issue starting with 1.11.0. That's where #5509 was included.

faisal-memon · 2025-01-13T23:46:03Z

Please assign this to me, Ill take a look

faisal-memon · 2025-01-28T08:31:40Z

Started looking into this. Prior to #5509 we had a workflow that did this at startup:

Got the last event id
Full read of all registration entries to load the cche
Start polling every 5 seconds for all events after the last event id at startup

Seems like #5509 inadvertently removed that first step so it just steps though all the stale events at startup.

sorindumitru · 2025-01-28T10:52:07Z

Seems like #5509 inadvertently removed that first step so it just steps though all the stale events at startup.

Yes, I think if we bring back that initial read of skipped events and last event id it should be ok from that regard. We likely also wouldn't need the "updateCache" call during the initial cache hydration.

There's also the return in

spire/pkg/server/endpoints/authorized_entryfetcher_registration_entries.go

Line 230 in a49eaad

return err

which makes it possible to not apply some updates. I think we could fix that by not clearing the map of entries to fetch at the beginning of the refresh (

spire/pkg/server/endpoints/authorized_entryfetcher_registration_entries.go

Line 44 in a49eaad

a.fetchEntries = make(map[string]struct{})

) and manually removing the entries that are succesfully fetched (which we already do in

spire/pkg/server/endpoints/authorized_entryfetcher_registration_entries.go

Line 248 in a49eaad

delete(a.fetchEntries, entryId)

). That would mean that failed entries will be retried on the next refresh.

faisal-memon · 2025-01-28T17:36:47Z

Thanks @sorindumitru for confirming my suspicions. Ill get started on a fix for the initial cache hydration.

Will look into the issues with refresh next.

rturner3 assigned faisal-memon Jan 14, 2025

rturner3 added the triage/in-progress Issue triage is in progress label Jan 14, 2025

faisal-memon mentioned this issue Jan 28, 2025

Proposal for an alternative algorithm for events-based cache event tracking #5624

Open

amartinezfayo added priority/urgent Issue is approved and is must be completed in the assigned milestone and removed triage/in-progress Issue triage is in progress labels Jan 28, 2025

amartinezfayo added this to the 1.11.2 milestone Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Events based cache issues in SPIRE 1.11 #5771

Events based cache issues in SPIRE 1.11 #5771

sorindumitru commented Jan 13, 2025

stevend-uber commented Jan 13, 2025

sorindumitru commented Jan 13, 2025

faisal-memon commented Jan 13, 2025

faisal-memon commented Jan 28, 2025 •

edited

Loading

sorindumitru commented Jan 28, 2025 •

edited

Loading

faisal-memon commented Jan 28, 2025

Events based cache issues in SPIRE 1.11 #5771

Events based cache issues in SPIRE 1.11 #5771

Comments

sorindumitru commented Jan 13, 2025

stevend-uber commented Jan 13, 2025

sorindumitru commented Jan 13, 2025

faisal-memon commented Jan 13, 2025

faisal-memon commented Jan 28, 2025 • edited Loading

sorindumitru commented Jan 28, 2025 • edited Loading

faisal-memon commented Jan 28, 2025

faisal-memon commented Jan 28, 2025 •

edited

Loading

sorindumitru commented Jan 28, 2025 •

edited

Loading