Stop using the remote HTTP cache when it became unreachable #24685
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements the feature to stop using the remote HTTP cache when it became unreachable. Bazel handles the remote cache errors and continues building with local execution in case of errors. However, it continues making calls to remote cache for each action. This produces significant delays compared to local execution without the remote cache. The situation is especially bad in case of network timeout errors. On Windows the default TCP connect timeout is 21sec, on Linux/Mac - 127 sec. This time is lost before any action on attempt to retrieve the data from remote cache, and after the action on attempt to store the data in cache. The issue can be reproduced by specifying the non-existing address for the cache, such as
--remote_cache=https://10.255.255.1/
.The proposed implementation is based on already existing
CircuitBreaker
logic. This logic was enabled recently for HTTP in PR #20831. However, it doesn't resolve the issue, because:CircuitBreaker
to trigger on timeout errors.This PR includes the following changes:
isSuccess
predicate.HTTP_SUCCESS_CODES
defines success codes for HTTP, andRemoteRetrier.GRPC_SUCCESS_CODES
defines success codes for gRPC.FailureCircuitBreaker
defines the threshold required to calculate the failure rate:minCallCountToComputeFailureRate=100
. In case of connection timeout errors this requires a very large time interval to reach the threshold, much larger than the defaultexperimental_remote_failure_window_interval=60s
. To solve the problem, we have added one more threshold:DEFAULT_MIN_FAIL_COUNT_TO_COMPUTE_FAILURE_RATE=12
. This value is chosen to be slightly more than number of failures within window interval with defaultexperimental_remote_failure_rate_threshold=10
.