-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP new circuit breaker #74315
WIP new circuit breaker #74315
Conversation
src/sentry/utils/circuit_breaker2.py
Outdated
class CircuitBreakerState(Enum): | ||
CLOSED = "circuit_closed" | ||
BROKEN = "circuit_broken" | ||
RECOVERY = "recovery" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming nitpick:
I'd either go with the standard circuit breaker nomenclature (closed, open, half-open), or the color nomenclature (green, red, yellow).
I prefer the latter, because the standard circuit breaker nomenclature is kinda not great (connection can be open, when the circuit breaker is closed). Stop light nomenclature is easily understood.
src/sentry/utils/circuit_breaker2.py
Outdated
if cache.has_key(self.broken_state_key): | ||
return (CircuitBreakerState.BROKEN, cache.get(self.broken_state_key) - now) | ||
|
||
if cache.has_key(self.recovery_state_key): | ||
return (CircuitBreakerState.RECOVERY, cache.get(self.recovery_state_key) - now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These cache.get()
calls need to have very strict limit on how long they can take (I'd go with something like 20ms, for comparison our average ping to Redis is 400μs, max is 6ms).
AFAIK that's not possible to add connection and command timeout with using just abstract Django cache, and you'd need to use cache._cache
which is the Redis backend. Or you can as well just use Redis directly, you need Redis client for the SlidingWindow anyways.
If there's any failure to connect, timeout, etc. it should fall back to returning the CLOSED, as you don't want to make CircuitBreaker a SPoF.
src/sentry/utils/circuit_breaker2.py
Outdated
try: | ||
if breaker.should_allow_request(): | ||
response = call_chase_simulation_service("/hall-of-fame", payload) | ||
else: | ||
logger.warning("Request blocked by circuit breaker!") | ||
return None | ||
except TimeoutError: | ||
breaker.record_error() | ||
return None | ||
|
||
if response.status == 500: | ||
breaker.record_error() | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try: | |
if breaker.should_allow_request(): | |
response = call_chase_simulation_service("/hall-of-fame", payload) | |
else: | |
logger.warning("Request blocked by circuit breaker!") | |
return None | |
except TimeoutError: | |
breaker.record_error() | |
return None | |
if response.status == 500: | |
breaker.record_error() | |
return None | |
try: | |
if breaker.should_allow_request(): | |
response = call_chase_simulation_service("/hall-of-fame", payload) | |
else: | |
logger.warning("Request blocked by circuit breaker!") | |
return None | |
except (InvalidInputDataError, PEBKACError): | |
# explicitly list errors that are ignored by the breaker | |
# for example errors caused by user data not validating | |
return None | |
except Exception: | |
# any exception not listed above should be counted by the breaker | |
breaker.record_error() | |
return None | |
if response.status == 500: | |
breaker.record_error() | |
return None |
8582378
to
51b6350
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #74315 +/- ##
========================================
Coverage 78.13% 78.14%
========================================
Files 6670 6671 +1
Lines 298455 298578 +123
Branches 51352 51374 +22
========================================
+ Hits 233208 233312 +104
- Misses 58990 59001 +11
- Partials 6257 6265 +8
|
51b6350
to
7b425fb
Compare
Notes:
I think the config is too complicated. Am working on setting more of the values automatically. (I realized the defaults for recovery error limit and window were wrong/didn't do what I'd want them to do, and if I can mess that up after thinking about this stuff for a week, it's too confusing.)should_allow_request
andrecord_error
pass a sanity check.There's obv still a bunch of cleanup I have to do.