-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Events based cache issues in SPIRE 1.11 #5771
Comments
@sorindumitru Unfortunately right now we are using the 1.10 version of SPIRE-Server and don't currently see that. We did very thorough testing of this exact scenario during the performance evaluation of the feature and confirm that this wasn't an issue in anything in prior to or including the 1.10 version of the events-based cache. |
@stevend-uber I believe this is only an issue starting with 1.11.0. That's where #5509 was included. |
Please assign this to me, Ill take a look |
Started looking into this. Prior to #5509 we had a workflow that did this at startup:
Seems like #5509 inadvertently removed that first step so it just steps though all the stale events at startup. |
Yes, I think if we bring back that initial read of skipped events and last event id it should be ok from that regard. We likely also wouldn't need the "updateCache" call during the initial cache hydration. There's also the return in
|
Thanks @sorindumitru for confirming my suspicions. Ill get started on a fix for the initial cache hydration. Will look into the issues with refresh next. |
We’ve seen some issues over the weekend with the events based cache. Instances that have restarted have taken a lot of time to populate the cache and become available:
From the metrics I can tell this it spent this time fetching registration entries. Hydrating the cache seems to now:
In this case we had a large number of events, maybe ~500k, which meant the server was stuck here for ~30 minutes. Looking through the past logs I can see it taking multiple seconds or minutes before.
Additionally, we mark the fetched events as processed in
spire/pkg/server/endpoints/authorized_entryfetcher_registration_entries.go
Line 138 in a49eaad
spire/pkg/server/endpoints/authorized_entryfetcher_registration_entries.go
Line 230 in a49eaad
The version of the events based cache we had in 1.10.4 doesn’t seem to have these issues. It would be worth considering reverting the latest changes to the cache since they only improve a small edge case (de-duplicating entries that might be created due to long running transactions started before the cache was hydrated; I can’t imagine there being more than a couple and usually 0 of these). #5509 is the change that would need to be reverted.
@amoore877, @stevend-uber tagging you here since I know you were also using the events based cache.
The text was updated successfully, but these errors were encountered: