Skip to content

Conversation

DrClick
Copy link

@DrClick DrClick commented Sep 12, 2017

This address the issue of a waiting state, and cleans up the tests. Sorry for the last PR, intellij moved code around automagically for me.

@diego-alvarez-hs
Copy link
Contributor

Hi @DrClick sorry for the delay,
I'm a little confused regarding this PR and #12

Is this PR a new version of the previous one? if yes, wouldn't be better to rebase this changes in the previous PR so we can continue the conversation there?

@diego-alvarez-hs
Copy link
Contributor

@DrClick are the tests passing in your local machine?

I'm getting this consistent failure:

[info] - should be tripped when custom failures are detected *** FAILED *** (194 milliseconds)
[info]   A timeout occurred waiting for a future to complete. Queried 10 times, sleeping 10 milliseconds between each query. (CircuitBreakerTest.scala:666)
[info]   org.scalatest.concurrent.Futures$FutureConcept$$anon$1:

Thread.sleep(retryDelay)
}.onComplete(_ => {
cb.attemptResetBrokenState(this, 0)
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this look suspicious, what were you trying to do here @DrClick , as blocking the tread and using the global implicit are not good idea (might be good just for testing purposes)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh I thought you were trying to make the CB auto-reopen itself, but it seems it's just for the first attempt, why is this needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you have time for chat on this. Is the objection just the use of the global execution context (easily addressable) or is sleeping the thread the Future is on (effectively a timer).

val failLimit: Int,
val retryDelay: FiniteDuration,
val isExponentialBackoff: Boolean = false,
val exponentialRetryCap: Option[Int],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can have just one variable exponentialRetryCap as when it's defined we know the caller wanted to use the exponential backoff making the isExponentialBackoff unnecessary in this case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i can defer to your call on this, I found it to be a bit criptic. I actually think it might make more sense to have Exponential Circuit Breaker Class on its own to prevent this argument sprawl.

val failLimit: Int,
val retryDelay: FiniteDuration,
val isExponentialBackoff: Boolean = false,
val exponentialRetryCap: Option[Int],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to have some strategy involved here where we can pass it as a parameter, a strategy could involved max attempts, other could be max delay, fibonacci delay, etc other strategy could be a combination of different strategies, but we can improve this in following PRs

@diego-alvarez-hs
Copy link
Contributor

it seems the version on #12 is simpler regarding adding an exponential retry backoff.

@DrClick
Copy link
Author

DrClick commented Sep 15, 2017

Regarding the two PRS, this was meant to be an extension of the first, adding the cap. I can merge them together. Let me explain what we are trying to do, so maybe the context makes more sense. 1) Exponential backoff. 2) After a certain number of retries, we want to shut down the docker and allow the mesosphere infrastructure to do its thing.

It was my understanding of the code base that the broken state was really two states in once broken/immediately fail and broken/retry so I am splitting into two states. The idea with the timer was to actively transition the state from broken/fail to broken/retry or more cleanly (flow, broken, retry).

RE the test failing, let me double check, i recall they were, but maybe something slipped through.

I dont think including AKKA to do this timer function is the right idea. If you dont think the timer is a good idea, I can be convinced. I may be missing a base understanding of the thread blocking issue and if so, i look forward to understanding it more. I thought it wold only block the Future thread that it is on. (Is that a problem)?

@DrClick
Copy link
Author

DrClick commented Sep 15, 2017

@diego-alvarez-hs
Copy link
Contributor

thanks for the additional context, it helped us understand more your case.

Let's try to find an alternative to the scheduled task of the first attempt of cb.attemptResetBrokenState(this, 0) , so we can avoid adding more dependencies like akka/Twitter future, and as it's used just once, let's see what alternatives are for it. The version on #12 didn't need this.

Additionally, please let me check with the team about splitting the broken and retry in two states, it sounds like a good idea; I just want to have some feedback from some people that initially created the code or might have more context about the initial design.

It looks like your scenario might have some similarities to something that we're doing internally, CB + retryBackoff, if the CB is short-circuiting the calls is because there is something bad happening and we haven't been able to recover from it, and the health checks are going to start alerting and the infrastructure can act according to the defined rules (restart instances, add/remove, etc)

The main difference is that we've separated the CB from the RetryBackoff util, so we have some Retrier utility that just keeps retrying an operation given some strategy, and on top of that we surround the code with the CB, this now gives us for free that we can now apply the retry logic to the single operation that the CB let pass to check if it can reset/recover from the bad state.

This might be one of the reasons why we haven't added the retry feature to the CB,

Our code looks like:

sealed trait RetrierWithCircuitBreaker {

  def retrier: Retrier
  def circuitBreaker: CircuitBreaker

  def apply[T: ClassTag](
    operationIdentifier: String
  )(operation: => Future[T])(implicit ec: ExecutionContext, scheduler: Scheduler): Future[T] =
    circuitBreaker.async() { // here we combine the CB + the retryBackoff
      retrier(operationIdentifier) { operation }
    }
...
}

This composition allows to use it like this:

retrierWithCircuitBreaker: RetrierWithCircuitBreaker // injected in the class a precofigured CB + retryBackoff

  override def getSomeData(
    profileId: Long
  )(implicit ec: ExecutionContext): Future[OurAdtErrorOr[SomeData]] = {
    val operationName = s"xxx.yyyy.getSomeData: $profileId"
    retrierWithCircuitBreaker(operationName) {
      unsafeTaskThatCanThrowOrReturnNonExpectedResults(profileId)
    }.recover {
      // represent failure as a value of the expected ADT
      case e: Throwable => Left(OurAdtErrorOr.UnexpectedException(e))
    }

  }

The Retrier and the CB is configured with something like:

service {
  xxxx {
    retry-backoff {
      backoff-strategy {
        type = "fibonaccibackoffstrategy"
        delay = 50 milliseconds
        max-delay = 5 seconds
      }
      stop-strategy {
        type = "maxretriesstrategy"
        retries = 12
        max-accumulated-delay = 2 minutes
      }
    }
    circuit-breaker {
      name = "cb-example"
      fail-limit = 10
      retry-delay = 500 milliseconds
    }
  }
}

There is a tradeoff regarding that this approach might the CB detect permanent failures as quickly as before, as when the CB let pass an operation this operation will start retrying until the attempts are exhausted. But, it allows us to recover more quickly from transient states.

I'm not sure if this last information might help you design your case, but it's good to share and work out of loud our experiences.

@diego-alvarez-hs
Copy link
Contributor

Hi @DrClick
It seems the newly added state AttemptResetState might be unnecessary, as it's doing the same as the "old" BrokenState, in this PR the BrokenState is just moving directly to AttemptResetState.

What do you think about closing this PR and moving the effort to #12 ?

#12 just needs to add some kind of limit (time or retries or both) and integrate the new arguments in an option argument (case class?)

@CLAassistant
Copy link

CLAassistant commented Aug 25, 2020

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants