-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Will the circuit breaker trip if enough minor transient errors happen? Should frequency of exceptions or interval between exceptions be a factor? #41
Comments
Agreed, I was thinking that you could specify another timespan that would reset the count if an exception hadn't occurred in that period of time. Thoughts? |
Good idea, however could you have some internal checks on the handle exception to get round it....? You can argue that to keep it simple, you should implement it yourself in this layer instead, instead of adding the override to the circuit breaker? Tbh, I would like to see the self-implementation of circuit breaker fork merged instead if that is in-line with your design, as that sort of covers this, ie do it yourself in a custom implementation, and also allows us to handle circuit breaks in a central place across several out-of-process services........ Though maybe my solution to tims request is the same for me... Haven't thought it through fully yet...😏 |
I'm not sure from @TimGebhardt's original question, if he was assuming there were successes in-between the failures across the three-hour period or not. My reading of the codebase was that the breaker state is The question made me think tho about a frequency-based circuit-breaker (which Michael Nygard also hints at in his book). Something like: "If the policy experiences N exceptions in any period T, break the circuit for a given timespan" (eg "Break the circuit if we receive 3 exceptions in 30 seconds"). I have just firmed up a sketch I put together a few days ago (when first saw Tim's q) and committed to https://github.com/reisenberger/Polly/tree/FrequencyBasedCircuitBreaker - is this an avenue worth pursuing at all? It essentially maintains a FIFO queue of the timestamps of the last N exceptions, and uses this quite simply to determine if N have occurred within the given timespan. |
Or for an alternative mathematical averaging approach? (doesn't maintain a FIFO queue) (I can see merits and demerits to all of the alternatives ...) |
Regardless of the mathematics, I think your approach to the problem is fine as it makes more sense to what you would want a circuit breaker to be in practice. However, to refer back to the other part of my note, all of these circuit breakers are great for the simple one-process approach. IMO, the type of things you are circuit breaking (external services) are highly likely to be used across processes. Unless I've completely missed it, I don't think Polly handles this scenario. So, should Polly implement this ? (I think not, as the backing store to hold this data, and the appropriate trigger logic for the breaker, is probably very system specific) but I'm not sure of the best way to go about implemented this outside of Polly and injecting the functionality into it's fluent based framework... |
Hi @bunceg . If I understand you rightly, the scenario is that if services A, B and C (say) all use another service M, then if service A breaks its circuit to M, it could be useful for services B and C to know about this and break their calls to service M too? We went through a similar line of thinking, but concluded - is there a need? If service A cannot reach service M for a reason that would also prevent service B, won't service B just discover that of its own accord (and also circuit-break) in due course? Is the overhead of inter-process co-ordination (which might also fail) worth it? (And A could also fail to reach M for a reason which doesn't affect B ...) We did, however, go through a similar line of thinking - that it could be useful to be able to inject state (as you say) into a CircuitBreaker - for manually breaking a circuit. Sam Newman's (highly recommended) book on microservices suggests one might manually break all circuits to a downstream service in order to upgrade that downstream service. I fretted for a while over the fact that we couldn't inject state into a Polly CircuitBreaker to achieve this. For now, we have however gone with the same pragmatic above logic - do we need to? If we build our processes robust enough to be able to survive downstream unavailability (whether unexpected or 'expected' eg due to upgrade), then that should be enough. But @michael-wolfenden - could being able to manually break a circuit (inject state) perhaps be a useful addition sometime? |
HI, Yes you do understand, and we are following the same principle in that A, B or C will find out of their own accord eventually. I take on board your comment about the co-ordinator could fail, which is a real concern to our situation too, but IMO that is a decision for the implementing team based on the services involved and the consumers of those services. However, your approach of injecting state may well solve the problem anyway - something else can check the shared state first and manually break the circuit and I like the way of doing this. Our implementation uses a master switch to avoid calling the service anyway but your option sounds a bit cleaner. ps: thanks for the book recommendation, I'll add it to my collection with "release it!" :) |
How about using a Also agree re not having a centralised way to notifying other services about failures. What if the connection from just Serivce A -> M goes down/degrades, but the rest are OK? Now you're telling services B & C that they can't talk to M even though they can. Dangerous and lots of code/systems overhead. They'll figure it out. |
Hi @reisenberger. Thanks to the team for developing Polly. It is a nice library. In regards to this issue, I concur it would be nice with a feature that would allow the circuit to break if N exceptions occur over a period of T. One example, where I would like a circuit to break is in the case, where a client is communicating with a server that is overloaded. The server would once in a while successfully handle a request, but most of the time return timeouts (As it will not be able to handle a request within the alloted time). Breaking the circuit for a little while, might help the server get it self up again. Another example is the same as @TimGebhardt talked about where, during slow periods (Maybe nighttime), a single request is being retried and failing every once in a while, which breaks the circuit and can keep the circuit open, when good requests begin to filter in. I looked at both implementations you made and think that both would solve the two examples above. The first implementation is easier to read. The second implementation requires are bit more thought to convience that there isn't an outlying case that could result in improper behavior. Is this a feature that the team wants into Polly? |
@kristianhald Thank you for your interest in Polly - and offer of help! More sophisticated approaches to circuit-breaking are very much on the Polly roadmap (see towards end of doc) (for reasons TimGebhardt and you mention, and others) Right now AppvNext are reflecting further on the best forward path for circuit-breaker: whether to go for the frequency-based and proportion-based circuit-breakers already mooted in the roadmap, and/or whether something more sophisticated along the lines of Hystrix’s time-slicing approach (somewhat similar to the suggestion from @DarrellMozingo above). Community views? (may set out further views/options for feedback, later in this thread) |
( and @kristianhald - we may yet take you up on your offer of help! ) |
Looked through the source for Hysterix regarding their Circuit Breaker implementation. As I read it, they use a bucket, where each item is a slice of the '_durationOfExceptionRelevance'. The precision is then defined by the size of the slice. I think that the 'Frequency-based' solution is the same that Hysterix has implemented (Except that they look at percentages instead of the total number within the frequency given). The hubris in me, would state that the first implementation that @reisenberger has made with the queue of exception times, is time slicing where the window is very small and each window only contains a single information that an error occurred. 😃 I also like that the Hysterix implementation, before triggering the circuit breaker, checks if the number of requests received during the period is above a certain threshold (This is probably necessary, as else a single error will have a large impact on the error percentage, if the throughput is low). // check if we are past the statisticalWindowVolumeThreshold
if (health.getTotalRequests() < properties.circuitBreakerRequestVolumeThreshold().get()) {
// we are not past the minimum volume threshold for the statisticalWindow so we'll return false immediately and not calculate anything
return false;
} Personally, the few times I have used circuit breaker, it has mostly been leaning towards the frequency based. It is probably a personal preference, because I always like comparing stuff that happens with when they happened. |
@kristianhald Many thanks for your input! The Polly circuit-breaker internals have evolved since last August, and I pushed some new prototypes last couple days to https://github.com/reisenberger/Polly/tree/CircuitBreakerSpikeMarch16 . This includes a new sketch closer to the Hystrix approach. @kristianhald / wider community: very happy for any feedback on these. (New variants are also optimised for performance over the previous, and factor out metrics controlling the circuit more clearly.)
Commentary: An advantage of the
... which brings us to...
Commentary: Initially harder to understand. But the approach deals well with both quiet period and high densities (combining as it does elements of both ratio and frequency measures). The approach may be better suited to higher throughput systems because of the time-slicing, tho user has control over duration of slice. Community views on all these options welcome |
@kristianhald If you are interested in helping/contributing, I can set out what would be needed to bring one or more of the above variants to production in Polly (wiring up the configuration; unit tests), and you can state what you might be interested in taking on? ... The unit tests for any of the above time-based policies will be quite fun 😺, using Polly’s (AppvNext efforts at the moment are focused on #14 (a major step to unlocking some of the other elements on the roadmap), and I will have extra time challenges from April.) As a first step also, the wider AppvNext team must decide how many of the above variants to bring to market. As ever, balance to be struck between power/options and simplicity. Community feedback welcome again: would some of the circuit-breaker variants in previous post be more useful than others? |
Further development effort has been made on |
Woaw, nice progress!! A bit faster than I expected. Here I come and say that I want to help and then I am nowhere to be seen. I apologise for that. Enough selfpity. I think the Also I am a bit uncertain if the rollover is necessary and how hard it will be to understand. Without it will mean that in worst case, it will take up to 2* A scenario could be where the I know I said it before, but I have more time on my hand now. If there anything you need fixed before this can be merged, then please throw it to me. |
@kristianhald Thanks for your interest and involvement with this feature! If you have time available, any contribution to add the rolling slices/buckets within Nice observation about Re:
... I have thoughts exactly about this - about the most appropriate way to configure this kind of circuit-breaker. I will add thoughts either here or in draft documentation (review will be welcome...), in a few days' time. Thanks! |
@kristianhald Here are some specific thoughts around adding the rolling slices/buckets within [1] The number of buckets that Regarding not user-configurable: Simplicity of configuration is a Polly goal. My feeling is the extra configurability of number of internal buckets would not significantly add feature benefit versus the extra comprehension load. Our aim is just improve breaker fidelity-to-configuration to acceptable tolerance, any sensible value that achieves this is good ... (tho open to community feedback if anyone can make a case for configurable buckets ...) (Hystrix probably cares about configuring this more, because they emit stats at bucket frequencies) [2] Should we consider some minimum Rationale: At some point there will be diminishing returns in the trade off between more fidelity/a more responsive circuit-breaker, versus higher computation. We could performance-measure to determine exactly that point ... it might even be quite small ... But keeping pragmatic, is there likely to be a need to fine-tune breaker responsiveness (what buckets affects) at sub-half-second or sub-quarter-second levels? (And if that fine-tuning is needed, reducing the overall (In summary ... Usually I do not like to impose decisions from within a library on the user. But for this already-quite-complex feature, I propose starting with some sensible/pragmatic decisions to reduce configuration complexity, and await community feedback (always welcome!) if refinement is needed... ... ) (EDIT: Configuration assumptions/limits/tolerances would also be documented.) |
@reisenberger I can work on placing the health metrics in slices(buckets) to provide a rolling window for the circuit breaker. The other day I made a quick POC based on the code you did, which includes a single test: Would that be a way to go or were you thinking on going in a different direction? |
Any number higher than than 1 would indeed improve responsiveness. My initial thought was to have a quick calculation (Is only done once at the creation of the circuit breaker), which would give a larger bucket slice duration for larger windows with a minimum of 0.x seconds per slice (And maybe also a maximum). Something like
The reason for choosing a calculation like the above, is to state that if the window duration is long, then the responsiveness of the breaker is allowed to be lower. However, choosing a calculation that provides a good number for any input might be hard? |
I agree with the difficulty of using the feature, if the developer has to choose the bucket size or the bucket duration. I always like having the library providing me with reasonable defaults as I believe the developers of the library to have better knowledge of the area, than I do. Also if the default is based on the window duration provided, then a developer wanting more responsiveness can lower the window duration. I do not believe it is necessary for a 1.0 version of this feature(because the feature goes a long way and the bucket duration can be controlled by the window duration), but it can be hard for a library developer to anticipate all usages of their library(Having worked on a few software projects, where a core library had to be built upon and extension was only possible at the hooks they provided) and therefore there might be cases, where a developer will need to override the default.
I cannot imagine (Not that someone probably would need it) needing a bucket duration less than quarter- or half-a-second. We are talking about an application that needs to cut off as soon as enough errors are encountered and cannot wait quarter- to half-a-second before opening the breaker. In most applications I do not think that this is an issue. We also have to remember that we are using DateTime, which only has a resolution of 1-16ms depending on hardware and OS settings. I believe talking about bucket durations that are so low is an advanced topic. I still believe that in the first version, the library should provide reasonable defaults as this will be the setting mostly used. Also adding this feature will require some thought into what would be allowed and what should not be allowed. |
Hi @kristianhald. Thanks for all this thorough and detailed thinking; all very useful!
Thanks - will look at this shortly! I made a draft documentation (wiki page) earlier today at https://gist.github.com/reisenberger/d0ed99101634b0388dd7e3b92fbfadac . @kristianhald Re the proposed logarithmic calculation of bucket size from stat window duration, see my comments within the draft doco about the possible downside of this kind of circuit-breaker with long statistical windows: the responsiveness of the circuit-breaker seems in the worst case (do you agree?) essentially proportional to the stat window duration, however finely you divide it into buckets, because of the 'long tail of successes' problem described. Is the logarthimic calculation of bucket size still worthwhile in light of this? I cannot imagine anyone really wants a circuit-breaker that doesn't react for 5 minutes, 10 minutes, 30 mins, whatever, to an 100% failure problem. Given the relatively narrow range of times this leaves for which the timesliceDuration probably makes sense (a few seconds to a minute or two), I wonder if it is adequate just to adopt a fixed number of buckets (unless makes buckets too small as prev discussed). (While I like the elegance of the logarithmic approach, we have to consider also that we have perhaps to explain it in doco; that somebody has to maintain this in future, or ask/seek to understand why it was done this way, etc). @kristianhald, perhaps really this is the same as you are also already saying: choosing a calculation that provides a good number for any input might be hard... 😃 @kristianhald You'll see in the doco that I have varied the terms slightly ( |
@kristianhald To state more precisely part of my previous thinking: varying the size/number of buckets can increase the fidelity (fidelity-at-all-times) of the circuit-breaker to its configured thresholds, but cannot increase its responsiveness (overall speed of response) to the theoretical 'long-tail of successes' problem stated. (which also, after all, might not only be a theoretical problem in a lot of cases: some systems will behave exactly like this 100->0%! 😸 ) I will continue thinking about this, but would be interested in your reaction to this problem. Finally, regarding my comments about configuration values that are suitable / not-suitable: this is at this stage to share thinking and explore the contours of the problem, not suggesting yet to disallow values in code. However, we should indeed consider (later) the configuration limits for each parameter. Again: very many thanks for the productive contributions! |
Draft documentation (https://gist.github.com/reisenberger/d0ed99101634b0388dd7e3b92fbfadac) updated to be less prescriptive about which However, relationship between |
@kristianhald Lots more useful thoughts - thank-you! I think we are shaping up what are the important practical limits on each parameter. I will summarise these shortly for review. Re the POC sketch, I see no problems with this (nice refactor). I can see possible points-to-check later around possible/minor micro-optimisations [to consider if necess] and naming, but that best done after all the major conceptual issues and boundary issues (in discussion now) are resolved, implemented and tested/proved through testing. Thanks for all the great contribution! |
@reisenberger Made a new implementation, which is a bit more cleaned up version of the POC. I did not try to optimize the implementation, like as you said, we can do that when the fundamentals on done. I went from using 'Bucket' to using 'Window' instead, as I felt it was a better name for it. Do you see some test cases that are missing, which I should add? I was thinking, in regards to the issue with a low sampling duration and windows, maybe for simplicity only use rolling windows if the sampling duration is set high enough. It could be documented by stating that if sampling duration is set to x or higher, then rolling windows are used else a single window is used for the entire sampling duration. What do you think? |
Super; I will review further in the next few days. |
Hi @kristianhald. Re:
Yes, let's do this. I suggest herewith some decisions to keep us moving (tho comments certainly welcome if anyone sees other angles). Let us declare in the internal const int DefaultNumberOfInternalBuckets = 10; // or 'windows' ... align to preferred terminology
internal const long ResolutionOfCircuitTimer = TimeSpan.FromMilliseconds(20).Ticks; and run some decision code as you suggest, something like if (timesliceDuration < ResolutionOfCircuitTimer * DefaultNumberOfBuckets) { /* don't divide into windows */ } else { /* do divide into windows */ } Can the operational code be done so that it doesn't look too messy / branching depending on whether it's using further buckets/windows or not? Rationale: Per DateTime documentation, Similarly, in
for something like:
These approaches could be refined later (for example, between 200ms and 20ms we could adopt an algorithm which provided the maximum number of buckets which kept bucket size over 20ms). But let's start instead from a premise 'keep things as simple as possible until we know they need to be more complicated'. Regarding descending as far as Does this sound like a good way to proceed? Tomorrow hopefully I will add my promised comments/responses on the practical limits on each parameter and how they may interact. Have one more observation to add, but otherwise believe that is fairly thoroughly thought through now. Thanks for the great collaboration! EDIT: Draft documentation updated for these suggestions. |
@kristianhald I made a timing test harness also for the circuit breakers here: https://gist.github.com/reisenberger/92dc8d73b4df127b296ed8daee3ed93d The results on my current development machine are here: https://gist.github.com/reisenberger/a6fab34402731333a61600dc0f06d7b0 Later, we can use this for performance tuning, if/as necessary. At this stage, the intent was to determine that the impact of using-versus-not-using [Results based on my original spike of the If anyone can spot any thinking mistakes in the timing test harness, do shout! Thanks |
This is awesome stuff, you two! Sorry I've been unable to engage in the conversation, due to very tight commitments at the moment. I have been watching from the sidelines, but don't want to interject my opinions without thoroughly looking through the code and thinking through the scenarios. However, I will take @reisenberger's tests for a spin and offer up the results, as well as any suggestions that could possibly be helpful. You guys rock! |
Addendum: my performance stats test only the synchronous versions of the circuit-breaker. We should also test the async versions, as there'll be the extra |
@kristianhald Briefly re my previous comment:
(BTW, this was just a general thought - not based on any reading of the code). Thanks! |
@kristianhald I now feel relatively clear on the effects of setting different parameters near boundaries, as follows.
Effect of buckets/windows versus not: I believe the effects are much as we have discussed, but - given the very low 200ms boundary, which few users are likely to work near - my instinct is keep things simple (thinking of the majority of users): state the boundary but not document elaborately. Most circuits governing downstream network calls will likely be working in timescales (eg seconds), clear away from the boundary. If higher frequency requirements emerge, we can refine as-and-when. As a general comment: @kristianhald as you say, we cannot predict all ways that people will use the circuit-breaker. And performance characteristics of the underlying actions in question (for example whether more sssssssssssssssfffffffffffffffffffffffff [or] sssfsffsffsffsfsfssf) will also be a significant factor in the interaction between configuration and results. I think it is to be expected that users using a circuit of this sophistication (and with sufficient throughput to merit such a circuit) should expect to have to engage in some performance tuning in light of the real-life characteristics of their given system, and those characteristics are not something we can predict. However, we can warn users away from configurations which we know are likely to give unhelpful results. This is my summary of configuration characteristics I see as worth documenting. Is this missing something major, some more complex interaction? EDIT: To take the documentation away from too abstract discussion, I may add a 'suggested typical starting configuration' for downstream network calls. |
I have updated my fork with some additional features. Below I will go through the changes as they all relate to comments you have provided @reisenberger.
I have created the internal constants and updated the syntax to use the resolution of the circuit as the minimum allowed timesliceDuration (sampleDuration).
I have done circa the same, that you have done with the There are currently two implementations (RollingHealthMetrics and SingleHealthMetrics). The selection is based on At the moment the decision on using one or the other strategy is happening in the controller, but depending on your view, the choice can easily be done where the circuit controller is being created and then injected. I have a preference, but in this case I think that its better you make that decision 😃
I agree. Should not be hard to add. If we keep the current implementation, then its just another implementation of the interface and then doing a selection of when to use it (Probably when timeslice duration is between @reisenberger @joelhulen Did a run with the code from the fork I am working on using my development host. First result is where RollingHealthMetrics are used without inheriting from an interface: Second result is still with RollingHealthMetrics, but it inherits an interface and the interface is used in the circuit controller(The strategy based implementation): Using an interface decreases the performance, but only with around 10 ticks per iteration. |
I agree with every variable comment.
I looked at the draft documentation again and the part about configuration recommendations for the sampling duration requires some thought to it. I think that beginning the documentation with suggested configuration, which can allow the user to quickly be up an running and then having more detailed information later is a very good way of going 😸 |
@kristianhald Re:
Thank you for this extremely productive contribution! Hoping to review and then we can pull this together for release very shortly (keen altogether to get this feature out this week or next, due to other commitments) (just need find time to review 😄 ). |
Completely agree! Will adjust ... |
@kristianhald Merged your work on RollingHealthMetrics to my local fork. Thanks for the great contribution! @kristianhald @joelhulen Remaining to do (on my side) before this ready to PR against master:
Hopefully in the next day or so ... |
@kristianhald To offer some commentary on final decisions taken on previous points you raised:
Left this where you had it. Encapsulates the concern.
See const
I reviewed and agree with this. Breaking on a success seems counterintuitive. The circuit may have 100% recovered for now (sssssssssss...), in which case breaking would be counterproductive. If the circuit hasn't 100% recovered and is still intermitting (ssfsfsfsfsf), circuit will receive a failure soon enough within the same period, to break on. @kristianhald Please feel free to review the final changes to https://github.com/reisenberger/Polly/tree/CircuitBreakerSpikeMarch16 if you have an interest. I intend to PR this to master later today. |
Ahh, did not know that the ticks per millisecond for DateTime and TimeSpan is constant. Nice to know. |
Did a quick lookthrough the changes and I think they are good.
Cool. Out of curiosity, what is the procedure for nuget packaging the master branch? |
Merging to App-vNext/master (and earlier, creating the PR against it) automatically runs an AppVeyor verification build. We could push packages to nuget automatically on merging any PR, but opt to push them to Nuget manually. (Manual gives us the ability to occasionally merge PRs (eg readme doco fixes) without pushing a package and bumping rev number.) @joelhulen owns this process, and can correct me if any of that is wrong / out of date. |
@kristianhald PR to master in process. Thanks for your awesome contributions to this feature in code and thought. |
@kristianhald Yes, @reisenberger has it right. The AppVeyor build process generates release-level nuget packages when we merge a PR. We control the version number via a file named |
A further post to this closed issue just to document an issue that was discussed between @kristianhald and @reisenberger during merging down on to @reisenberger's fork. This post copies parts of the discussion from there (reisenberger#1) in case that is ever lost: [The issue has no bearing on the operation of the v4.2 @reisenberger wrote:
[this test was added and proves correct operation of the current statistics]
. This essentially documents the issue, tho there is further implementation discussion at Implementation would be as @kristianhald suggested with: An alternative could be: |
Noting against this closed issue for future reference, possible optimisations that could be made in the Any optimisation decisions should be driven by need, and by running before-and-after performance stats, for example with the test harness here: https://gist.github.com/reisenberger/92dc8d73b4df127b296ed8daee3ed93d Current performance: Performance analysis of the Possible optimisations: [a] Replacing the use of [b] Avoid the generation (and subsequent garbage-collection once out of scope) of [c] Maintain a running total of successes/failures across the timeslice, rather than re-summing it each failure (to be checked by performance analysis whether this optimises or not; it’s swings and roundabouts; the change would add a ++ to each success/fail; avoids larger summing each fail) EDIT: As stated, it would have to be measured whether these optimisations add value. However, they are all items which Hystrix optimise away for their high throughput implementation. |
Hi,
The circuit breaker doesn't handle the scenario where we get a small amount of errors over an extended period of time. The circuit breaker state keeps just a simple count and breaks for a hard set amount of time once it's seen the set number of exceptions.
Say we're using Polly to wrap a call an external web service that errors out once an hour. After three hours we'll lock up hard, even though we probably don't want to break the circuit in this scenario.
Does it make sense to have some sort of expiration on when an exception happens so that their effect dampens over time?
The text was updated successfully, but these errors were encountered: