TTL cache retries more frequently on failures #806

yizzlez · 2024-08-07T20:30:15Z

Summary

This PR adds a new parameter to the TTL cache -- failureTTLMillis. This is a custom TTL for entries with type Failure. Currently failureTTLMillis == ttlMillis, which means this should have no behavior change. At Stripe, we will be changing some of the caches to have a significantly shorter failureTTLMillis.

Why / Goal

At Stripe, we ran into an incident involving this particular piece of caching code for groupByServing info.

Our internal KV store was returning a handful of timeout errors for some requests. The issue is that this Failure is stored in a TTL Cache with a timeout of 2 hours. This caused hosts to enter a bad state and error repeatedly, as it was fetching the previously stored error from the TTLCache instead of retrying against our KVStore.

With this code change, it's possible to configure this TTL cache to have a failureTTLMillis == 5s, in which case we will automatically retry after 5s if we encounter a KV store timeout error.

Test Plan

Added Unit Tests
Covered by existing CI
Integration tested

Checklist

Documentation update

Reviewers

* Test with current thread executor

nikhilsimha · 2024-08-07T20:57:09Z

online/src/main/scala/ai/chronon/online/TTLCache.scala

-                     refreshIntervalMillis: Long = 8 * 1000 // 8 seconds
+                     refreshIntervalMillis: Long = 8 * 1000, // 8 seconds
+                     // same as ttlMillis, so behavior is unchanged barring an override
+                     failureTTLMillis: Long = 2 * 60 * 60 * 1000 // 2 hours


I think we should lower this to 30 seconds - I am no longer at airbnb, but I think it would benefit airbnb too. we have had incidents in the past, similar to yours where this could have helped.

cc: @pengyu-hou who is familiar with the Airbnb incident.

Hmm, I feel 30 seconds may be too frequent as default. Some of these TTLCache is used to cache metadata, which doesn't get updated frequently anyway.

But I can see why for things like BatchIr cache, a more frequent failure refresh is desired.

Thanks @nikhilsimha. Our incident was caused by a stale but valid metadata. To mitigate it, we would have to flush the TTL cache. This should be addressed with @yuli-han 's recent work that we will only fetch active configs.

I am curious what is the failureTTLMillis from Stripe side? @yizzlez

For failure cases, I agree that we should use a lower TTL.

nikhilsimha · 2024-08-07T21:06:36Z

online/src/main/scala/ai/chronon/online/TTLCache.scala

+      val minFailureUpdateTTL = Math.min(intervalMillis, failureTTLMillis)
+      val shouldUpdate = entry.value match {
+        // Encountered a failure, update according to failure TTL.
+        case Failure(_) => nowFunc() - entry.updatedAtMillis > minFailureUpdateTTL
+        case _ => nowFunc() - entry.updatedAtMillis > intervalMillis
+      }


Suggested change

val minFailureUpdateTTL = Math.min(intervalMillis, failureTTLMillis)

val shouldUpdate = entry.value match {

// Encountered a failure, update according to failure TTL.

case Failure(_) => nowFunc() - entry.updatedAtMillis > minFailureUpdateTTL

case _ => nowFunc() - entry.updatedAtMillis > intervalMillis

}

val effectiveExpiry = entry.map(_ => intervalMillis).getOrElse(Math.min(intervalMillis, failureTTLMillis))

minor simplification.

nikhilsimha · 2024-08-07T21:07:11Z

online/src/main/scala/ai/chronon/online/TTLCache.scala

      if (
-        (nowFunc() - entry.updatedAtMillis > intervalMillis) &&
+        shouldUpdate &&


Suggested change

shouldUpdate &&

(nowFunc() - entry.updatedAtMillis > effectiveExpiry) &&

nikhilsimha · 2024-08-07T21:07:32Z

online/src/test/scala/ai/chronon/online/test/TTLCacheTest.scala

@@ -0,0 +1,138 @@
+package ai.chronon.online.test


Thanks a lot for adding this!

nikhilsimha

thanks for the change!

TTL cache retries more frequently on failures

db2126d

* Test with current thread executor

yizzlez force-pushed the yizhao--ttl-failure-cache-changes branch from 8396cae to db2126d Compare August 7, 2024 20:34

nikhilsimha reviewed Aug 7, 2024

View reviewed changes

nikhilsimha approved these changes Aug 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TTL cache retries more frequently on failures #806

TTL cache retries more frequently on failures #806

yizzlez commented Aug 7, 2024

nikhilsimha Aug 7, 2024 •

edited

Loading

hzding621 Aug 7, 2024

pengyu-hou Aug 9, 2024

nikhilsimha Aug 7, 2024

nikhilsimha Aug 7, 2024

nikhilsimha Aug 7, 2024

nikhilsimha Aug 7, 2024

pengyu-hou Aug 9, 2024

nikhilsimha left a comment

	shouldUpdate &&
	(nowFunc() - entry.updatedAtMillis > effectiveExpiry) &&

TTL cache retries more frequently on failures #806

Are you sure you want to change the base?

TTL cache retries more frequently on failures #806

Conversation

yizzlez commented Aug 7, 2024

Summary

Why / Goal

Test Plan

Checklist

Reviewers

nikhilsimha Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

hzding621 Aug 7, 2024

Choose a reason for hiding this comment

pengyu-hou Aug 9, 2024

Choose a reason for hiding this comment

nikhilsimha Aug 7, 2024

Choose a reason for hiding this comment

nikhilsimha Aug 7, 2024

Choose a reason for hiding this comment

nikhilsimha Aug 7, 2024

Choose a reason for hiding this comment

nikhilsimha Aug 7, 2024

Choose a reason for hiding this comment

pengyu-hou Aug 9, 2024

Choose a reason for hiding this comment

nikhilsimha left a comment

Choose a reason for hiding this comment

nikhilsimha Aug 7, 2024 •

edited

Loading