Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP new circuit breaker #74315

Closed
wants to merge 16 commits into from

Conversation

lobsterkatie
Copy link
Member

@lobsterkatie lobsterkatie commented Jul 15, 2024

Notes:

  • I think the config is too complicated. Am working on setting more of the values automatically. (I realized the defaults for recovery error limit and window were wrong/didn't do what I'd want them to do, and if I can mess that up after thinking about this stuff for a week, it's too confusing.)
  • I included the tests in case you're curious but really I mostly just want to make sure the two main methods should_allow_request and record_error pass a sanity check.
  • There's obv still a bunch of cleanup I have to do.
  • Ugh, overloads.

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Jul 15, 2024
Comment on lines 19 to 33
class CircuitBreakerState(Enum):
CLOSED = "circuit_closed"
BROKEN = "circuit_broken"
RECOVERY = "recovery"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nitpick:

I'd either go with the standard circuit breaker nomenclature (closed, open, half-open), or the color nomenclature (green, red, yellow).

I prefer the latter, because the standard circuit breaker nomenclature is kinda not great (connection can be open, when the circuit breaker is closed). Stop light nomenclature is easily understood.

Comment on lines 319 to 339
if cache.has_key(self.broken_state_key):
return (CircuitBreakerState.BROKEN, cache.get(self.broken_state_key) - now)

if cache.has_key(self.recovery_state_key):
return (CircuitBreakerState.RECOVERY, cache.get(self.recovery_state_key) - now)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These cache.get() calls need to have very strict limit on how long they can take (I'd go with something like 20ms, for comparison our average ping to Redis is 400μs, max is 6ms).

AFAIK that's not possible to add connection and command timeout with using just abstract Django cache, and you'd need to use cache._cache which is the Redis backend. Or you can as well just use Redis directly, you need Redis client for the SlidingWindow anyways.

If there's any failure to connect, timeout, etc. it should fall back to returning the CLOSED, as you don't want to make CircuitBreaker a SPoF.

Comment on lines 95 to 103
try:
if breaker.should_allow_request():
response = call_chase_simulation_service("/hall-of-fame", payload)
else:
logger.warning("Request blocked by circuit breaker!")
return None
except TimeoutError:
breaker.record_error()
return None

if response.status == 500:
breaker.record_error()
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try:
if breaker.should_allow_request():
response = call_chase_simulation_service("/hall-of-fame", payload)
else:
logger.warning("Request blocked by circuit breaker!")
return None
except TimeoutError:
breaker.record_error()
return None
if response.status == 500:
breaker.record_error()
return None
try:
if breaker.should_allow_request():
response = call_chase_simulation_service("/hall-of-fame", payload)
else:
logger.warning("Request blocked by circuit breaker!")
return None
except (InvalidInputDataError, PEBKACError):
# explicitly list errors that are ignored by the breaker
# for example errors caused by user data not validating
return None
except Exception:
# any exception not listed above should be counted by the breaker
breaker.record_error()
return None
if response.status == 500:
breaker.record_error()
return None

Copy link

codecov bot commented Jul 16, 2024

Codecov Report

Attention: Patch coverage is 93.13725% with 7 lines in your changes missing coverage. Please review.

Project coverage is 78.14%. Comparing base (7149891) to head (51b6350).
Report is 4 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #74315    +/-   ##
========================================
  Coverage   78.13%   78.14%            
========================================
  Files        6670     6671     +1     
  Lines      298455   298578   +123     
  Branches    51352    51374    +22     
========================================
+ Hits       233208   233312   +104     
- Misses      58990    59001    +11     
- Partials     6257     6265     +8     
Files Coverage Δ
src/sentry/utils/circuit_breaker2.py 93.13% <93.13%> (ø)

... and 13 files with indirect coverage changes

@lobsterkatie
Copy link
Member Author

Broke this up into 3 PRs:
#74557
#74559
#74560

Closing this in favor of those.

@lobsterkatie lobsterkatie deleted the TEMP-7-15-circuit-breaker-rough-draft branch July 23, 2024 18:04
@github-actions github-actions bot locked and limited conversation to collaborators Aug 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants