Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TTL cache retries more frequently on failures #806

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yizzlez
Copy link
Collaborator

@yizzlez yizzlez commented Aug 7, 2024

Summary

This PR adds a new parameter to the TTL cache -- failureTTLMillis. This is a custom TTL for entries with type Failure. Currently failureTTLMillis == ttlMillis, which means this should have no behavior change. At Stripe, we will be changing some of the caches to have a significantly shorter failureTTLMillis.

Why / Goal

At Stripe, we ran into an incident involving this particular piece of caching code for groupByServing info.

Our internal KV store was returning a handful of timeout errors for some requests. The issue is that this Failure is stored in a TTL Cache with a timeout of 2 hours. This caused hosts to enter a bad state and error repeatedly, as it was fetching the previously stored error from the TTLCache instead of retrying against our KVStore.

With this code change, it's possible to configure this TTL cache to have a failureTTLMillis == 5s, in which case we will automatically retry after 5s if we encounter a KV store timeout error.

Test Plan

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested

Checklist

  • Documentation update

Reviewers

* Test with current thread executor
@yizzlez yizzlez force-pushed the yizhao--ttl-failure-cache-changes branch from 8396cae to db2126d Compare August 7, 2024 20:34
refreshIntervalMillis: Long = 8 * 1000 // 8 seconds
refreshIntervalMillis: Long = 8 * 1000, // 8 seconds
// same as ttlMillis, so behavior is unchanged barring an override
failureTTLMillis: Long = 2 * 60 * 60 * 1000 // 2 hours
Copy link
Contributor

@nikhilsimha nikhilsimha Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should lower this to 30 seconds - I am no longer at airbnb, but I think it would benefit airbnb too. we have had incidents in the past, similar to yours where this could have helped.

cc: @pengyu-hou who is familiar with the Airbnb incident.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I feel 30 seconds may be too frequent as default. Some of these TTLCache is used to cache metadata, which doesn't get updated frequently anyway.

But I can see why for things like BatchIr cache, a more frequent failure refresh is desired.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nikhilsimha. Our incident was caused by a stale but valid metadata. To mitigate it, we would have to flush the TTL cache. This should be addressed with @yuli-han 's recent work that we will only fetch active configs.

I am curious what is the failureTTLMillis from Stripe side? @yizzlez

For failure cases, I agree that we should use a lower TTL.

Comment on lines +71 to +76
val minFailureUpdateTTL = Math.min(intervalMillis, failureTTLMillis)
val shouldUpdate = entry.value match {
// Encountered a failure, update according to failure TTL.
case Failure(_) => nowFunc() - entry.updatedAtMillis > minFailureUpdateTTL
case _ => nowFunc() - entry.updatedAtMillis > intervalMillis
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val minFailureUpdateTTL = Math.min(intervalMillis, failureTTLMillis)
val shouldUpdate = entry.value match {
// Encountered a failure, update according to failure TTL.
case Failure(_) => nowFunc() - entry.updatedAtMillis > minFailureUpdateTTL
case _ => nowFunc() - entry.updatedAtMillis > intervalMillis
}
val effectiveExpiry = entry.map(_ => intervalMillis).getOrElse(Math.min(intervalMillis, failureTTLMillis))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor simplification.

if (
(nowFunc() - entry.updatedAtMillis > intervalMillis) &&
shouldUpdate &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
shouldUpdate &&
(nowFunc() - entry.updatedAtMillis > effectiveExpiry) &&

@@ -0,0 +1,138 @@
package ai.chronon.online.test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding this!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@nikhilsimha nikhilsimha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the change!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants