Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC client circuit breaking logic improvements #5854

Open
friofry opened this issue Sep 19, 2024 · 4 comments
Open

RPC client circuit breaking logic improvements #5854

friofry opened this issue Sep 19, 2024 · 4 comments

Comments

@friofry
Copy link
Contributor

friofry commented Sep 19, 2024

tl;dr

Problems with the current implementation:

  • It's not clear which providers the client will use and in which order, and in fact some of them might be ignored even if the token is specified in the config.
  • RPS limiter logic could cause unnecessary throttling without actually preventing us from hitting the RPC limit.
  • Circuit breakers may not work as expected in some cases.
  • Clients do not have enough information to properly display provider errors

Table of content

Providers configuration

We transform WalletSecretConfig (set on the client side) via default_networks.go setRPCs() to params.Networks:

func setRPCs(networks []params.Network, request *requests.WalletSecretsConfig) []params.Network {
	var networksWithRPC []params.Network
	for _, n := range networks {
		if request.InfuraToken != "" {
			if strings.Contains(n.RPCURL, "infura") {
				n.RPCURL += request.InfuraToken
			}
			if strings.Contains(n.FallbackURL, "infura") {
				n.FallbackURL += request.InfuraToken
			}
		}
                //... the same logic for grove, ganache

Here are some of the issues:

  • We have more than 10 API tokens configurable in WalletSecretConfig and only 3 of them are actually respected by the setRPCs function. It would be great to have other providers available for RPC calls. For example, the Alchemy.
  • setRPCs relies on string comparison in URLs which is unreliable.
  • This logic doesn't allow you to have different tokens for different chains.

Suggestion

  1. Provider URLs should be excluded from the Network to a separate ProviderConfig. And there should be 1 to many (Network - *ProviderConfig).
  2. Allow different API tokens for different chainIDs
  3. There should be an explicit priority for the provider (or if the priorities are equal, then pick a random one)
  4. We should use all available providers in the circuit breaker when making RPC call

This will make this logic more predictable and as a bonus it will make it easier to implement a user settings UI for the custom tokens (e.g. if our providers are banned in their country by DNS).

Network type

Network has 6 RPC urls.

	DefaultRPCURL          string          `json:"defaultRpcUrl"`      // proxy rpc url
	DefaultFallbackURL     string          `json:"defaultFallbackURL"` // proxy fallback url
	RPCURL                 string          `json:"rpcUrl"`
	OriginalRPCURL         string          `json:"originalRpcUrl"`
	FallbackURL            string          `json:"fallbackURL"`
	OriginalFallbackURL    string          `json:"originalFallbackURL"`

Currently we are using 4 of them. And performing RPC attempts in the following order via ethclient.Client via circuit_breaker's command with 4 functors (or less)

  • proxy (DefaultRPCURL)
  • proxy-fallback (DefaultFallbackURL)
  • main (RPCURL)
  • fallback (FallbackURL)

Here are some issues:

  • It's not clear why we need OriginalRPCURL, OriginalFallbackURL in this datatype.
  • Provider URL and auth params are stored in separate types.
  • Provider priority is implicit and not flexible.

Suggestion:

  • As mentioned above, there should be any number of providers that can be set up for a particular chain
  • The provider config should have urls, auth params and priority

networks.go

  • There are few read-only methods (Find, GetAll) in networks.go that making unobvious side effects like setDefaultRPCURL.
func (nm *Manager) Find(chainID uint64) *params.Network {
	networks, err := newNetworksQuery().filterChainID(chainID).exec(nm.db)
	if len(networks) != 1 || err != nil {
		return nil
	}
	setDefaultRPCURL(networks, nm.configuredNetworks)
	return networks[0]
}

Suggestion: remove setDefaultRPCURL from these methods

RPC client

Currently we are using both RPS limiter to make sure that we are not hitting limits in chain/client.go

func (c *ClientWithFallback) makeCall(ctx context.Context, ethClients []*EthClient, f func(client *EthClient) (interface{}, error)) (interface{}, error) {
	if c.commonLimiter != nil {
		if allow, err := c.commonLimiter.Allow(c.tag); !allow {
			return nil, fmt.Errorf("tag=%s, %w", c.tag, err)
		}

		if allow, err := c.commonLimiter.Allow(c.groupTag); !allow {
			return nil, fmt.Errorf("groupTag=%s, %w", c.groupTag, err)
		}
	}

	c.LastCheckedAt = time.Now().Unix()

	cmd := circuitbreaker.NewCommand(ctx, nil)
	for _, provider := range ethClients {
		provider := provider
		cmd.Add(circuitbreaker.NewFunctor(func() ([]interface{}, error) {
			err := provider.limiter.WaitForRequestsAvailability(1)
			...

and circuit_breaker.go to try multiple providers before

func (cb *CircuitBreaker) Execute(cmd *Command) CommandResult {
	if cmd == nil || cmd.IsEmpty() {
		return CommandResult{err: fmt.Errorf("command is nil or empty")}
	}

	var result CommandResult
	ctx := cmd.ctx
	if ctx == nil {
		ctx = context.Background()
	}

	for i, f := range cmd.functors {
		if cmd.cancel {
			break
		}

		var err error
		// if last command, execute without circuit
		if i == len(cmd.functors)-1 {
			res, execErr := f.exec()
			err = execErr
			if err == nil {
				result = CommandResult{res: res}
			}
		} else {
			circuitName := f.circuitName
			if cb.circuitNameHandler != nil {
				circuitName = cb.circuitNameHandler(circuitName)
			}

			if hystrix.GetCircuitSettings()[circuitName] == nil {
				hystrix.ConfigureCommand(circuitName, hystrix.CommandConfig{
					Timeout:                cb.config.Timeout,
					MaxConcurrentRequests:  cb.config.MaxConcurrentRequests,
					RequestVolumeThreshold: cb.config.RequestVolumeThreshold,
					SleepWindow:            cb.config.SleepWindow,
					ErrorPercentThreshold:  cb.config.ErrorPercentThreshold,
				})
			}

			err = hystrix.DoC(ctx, circuitName, func(ctx context.Context) error {

So first we pass a functor to hystrix.DoC() which checks provider.limiter.WaitForRequestsAvailability(1) and may return ErrRequestsOverLimit without even calling the RPC provider, and it may increase the error percentage for circuit_breaker.

hystrix will record the response status of each service call, after the error rate reaches the threshold, the breaker will be open, and the fallback logic will execute before the breaker status changes back to closed

here is the config we're using:

	cbConfig := circuitbreaker.Config{
		Timeout:               20000, // is how long to wait for command to complete, in milliseconds
		MaxConcurrentRequests: 100, // is how many commands of the same type can run at the same time
		SleepWindow:           300000, //  is how long, in milliseconds, to wait after a circuit opens before testing for recovery
		ErrorPercentThreshold: 25, // causes circuits to open once the rolling measure of errors exceeds this percent of requests
	}

Here are some things that can be improved:

  • Complex logic: using both RPS limiter(provider.limiter.ReduceLimit()) and opening circuit breaker logic at the same time can be confusing. Circuit_breaker can become open because RPS limiter returns ErrRequestsOverLimit
  • RPS limiter:
    • We have a common limit in ClientWithFallback for all providers.
    • And there is also a provider.limiter which does not protect us from reaching the Provider RPS limit as we do not know the number of client and the remaining RPS limit on the service.
  • Circuit breaker
    • Currently we don't run the RPC call through hystrix.DoC if it is the last resort. It may make sense to let hystrix collect statistics. Here is why this is unpredictable: if circuitNameHandler is used to override the circuit name, then the limits may change. Or if the list of eth clients is updated, it may be useful to have previous statistics for the provider.
    • It looks like the circuit breaker doesn't detect connection problems and still imposes a 5 minute sleep window. (needs double checking). But we have a hack in GethStatusBackend::ConnectionChange to reset it explicitly hystrix.Flush().

Suggestions:

  • Remove the RPS limiter and simplify the logic.
  • It would be more reliable to introduce RPS limiting logic on the status proxy side.
  • Remove the hystrix.Flush() hack.
  • The last resort provider should try to run the command through hystrix first. And if that fails because the circuit breaker is open, then run it directly. In this case we will collect the stats correctly.

Reporting provider state

func (c *ClientWithFallback) toggleConnectionState(err error) {
	connected := true
	if err != nil {
		if !isNotFoundError(err) && !isVMError(err) && !errors.Is(err, ErrRequestsOverLimit) && !errors.Is(err, context.Canceled) {
			log.Warn("Error not in chain call", "error", err, "chain", c.ChainID)
			connected = false
		} else {
			log.Warn("Error in chain call", "error", err)
		}
	}
	c.SetIsConnected(connected)
}
		feed.Send(walletevent.Event{
			Type:     EventBlockchainStatusChanged,
			Accounts: []common.Address{},
			Message:  string(encodedmessage),
			At:       time.Now().Unix(),
			ChainID:  chainID,
		})

When provider is not "connected" client gets EventBlockchainStatusChanged event with "down" message.
Mobile just logs the event (previously there was a toast).
Desktop has more complex logic and keeps track on the last success timestamp and

proc getStateValue(message: string): connection_status_backend.StateValue =
    if message == "down":
      return connection_status_backend.StateValue.Disconnected
    elif message == "up":
      return connection_status_backend.StateValue.Connected
    else:
      return connection_status_backend.StateValue.Unknown

  proc getChainStatusTable(message: string): ConnectionStatusNotification =
    result = initCustomStatusNotification()

    let chainStatusTable = parseJson(message)
    if chainStatusTable.kind != JNull:
      for k, v in chainStatusTable.pairs:
        result[k] = connection_status_backend.initConnectionState(
          value = getStateValue(v.getStr)
        )

  proc getChainIdsDown(self: Service, chainStatusTable: ConnectionStatusNotification): (bool, bool, seq[int], int) =
    var allKnown: bool = true
    var allDown: bool = true
    var chaindIdsDown: seq[int] = @[]
    var lastSuccessAt: int = connection_status_backend.INVALID_TIMESTAMP # latest succesful connectinon between the down chains

    let allChainIds = self.networkService.getCurrentNetworksChainIds()
    for id in allChainIds:
      if chainStatusTable.hasKey($id) and chainStatusTable[$id].value != connection_status_backend.StateValue.Unknown:
        if chainStatusTable[$id].value == connection_status_backend.StateValue.Connected:
          allDown = false
        else:
          chaindIdsDown.add(id)
          lastSuccessAt = max(lastSuccessAt, chainStatusTable[$id].lastSuccessAt)
      else:
        allKnown = false

    return (allKnown, allDown, chaindIdsDown, lastSuccessAt)

Errors are contcatenated

func accumulateCommandError(result CommandResult, circuitName string, err error) CommandResult {
	// Accumulate errors
	if result.err != nil {
		result.err = fmt.Errorf("%w, %s.error: %w", result.err, circuitName, err)
	} else {
		result.err = fmt.Errorf("%s.error: %w", circuitName, err)
	}
	return result
}

Here are some issues:

  • There is no distinction between reaching the rps limit, network connectivity and partial unavailability of some providers. And it's not enough information to display error properly on the client side. This leads to annoying notifications.
  • The desktop client has some logic based on the lastSuccessAt timestamp which is not shared with the mobile client.
  • Circuit breaker accumulates errors strings, and error type is lost. So if 2 providers are hit the RPS limit and 2 are unreachable, it's unclear whether we should notify the user that we've reached the RPC limit, or whether it's a general error. What we would recommend to him is to: fix the network connection, or he should override the RPC credentials, or it's an internal status proxy error.

Suggestions:

Links:

@alaibe
Copy link
Collaborator

alaibe commented Sep 19, 2024

It's not clear why we need OriginalRPCURL, OriginalFallbackURL in this datatype.
As user needs to be able to revert to original rpc url after changing them

It would be more reliable to introduce RPS limiting logic on the status proxy side.
I would even most of it should be move to the proxy itself.

From a status go point of view we should only have: rpc url and original one (maybe in constants)
The proxy should be the one handling the fallback etc.

@friofry
Copy link
Contributor Author

friofry commented Sep 19, 2024

@dlipicar what do you think of the proposed changes?

@friofry
Copy link
Contributor Author

friofry commented Sep 19, 2024

It's not clear why we need OriginalRPCURL, OriginalFallbackURL in this datatype. As user needs to be able to revert to original rpc url after changing them

It would be more reliable to introduce RPS limiting logic on the status proxy side. I would even most of it should be move to the proxy itself.

From a status go point of view we should only have: rpc url and original one (maybe in constants) The proxy should be the one handling the fallback etc.

Removing both rps_limiter and circuit_breaker and moving the provider logic to the proxy sounds better. (Basically, this is what the proxy design pattern means)

But in this case we still should probably:

  • Return more explicit errors in the events sent to the client
  • Allow users to override the web3 client with their own urls/credentials

@alaibe
Copy link
Collaborator

alaibe commented Sep 19, 2024

Agree with your 2 points and: Allow users to override the web3 client with their own urls/credentials is already the case on desktop. That would simplify the UI :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants