Stop using the remote HTTP cache when it became unreachable #24685

AlexanderGolovlev · 2024-12-13T11:38:34Z

This PR implements the feature to stop using the remote HTTP cache when it became unreachable. Bazel handles the remote cache errors and continues building with local execution in case of errors. However, it continues making calls to remote cache for each action. This produces significant delays compared to local execution without the remote cache. The situation is especially bad in case of network timeout errors. On Windows the default TCP connect timeout is 21sec, on Linux/Mac - 127 sec. This time is lost before any action on attempt to retrieve the data from remote cache, and after the action on attempt to store the data in cache. The issue can be reproduced by specifying the non-existing address for the cache, such as --remote_cache=https://10.255.255.1/.

The proposed implementation is based on already existing CircuitBreaker logic. This logic was enabled recently for HTTP in PR #20831. However, it doesn't resolve the issue, because:

Errors like "Connection refused" or "Connection timed out" are not counted as failures
It takes a very long time for CircuitBreaker to trigger on timeout errors.

This PR includes the following changes:

All non-retriable errors were counted as successful API calls. However, they include not only the cache miss errors (which should be considered as a successful call), but also the different connectivity errors (which should be considered as failures). To distinguish these errors, we have added the isSuccess predicate. HTTP_SUCCESS_CODES defines success codes for HTTP, and RemoteRetrier.GRPC_SUCCESS_CODES defines success codes for gRPC.
Current implementation of FailureCircuitBreaker defines the threshold required to calculate the failure rate: minCallCountToComputeFailureRate=100. In case of connection timeout errors this requires a very large time interval to reach the threshold, much larger than the default experimental_remote_failure_window_interval=60s. To solve the problem, we have added one more threshold: DEFAULT_MIN_FAIL_COUNT_TO_COMPUTE_FAILURE_RATE=12. This value is chosen to be slightly more than number of failures within window interval with default experimental_remote_failure_rate_threshold=10.

AlexanderGolovlev added 2 commits December 13, 2024 13:46

Enabling the CircuitBreaker for remote HTTP cache

ca17479

Duplicate code removed

bcfe59f

AlexanderGolovlev marked this pull request as ready for review December 16, 2024 12:45

AlexanderGolovlev requested a review from a team as a code owner December 16, 2024 12:45

github-actions bot added team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Dec 16, 2024

AlexanderGolovlev changed the title ~~Enabling the CircuitBreaker for remote HTTP cache~~ Stop using the remote HTTP cache when it became unreachable Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop using the remote HTTP cache when it became unreachable #24685

Stop using the remote HTTP cache when it became unreachable #24685

AlexanderGolovlev commented Dec 13, 2024 •

edited

Loading

Stop using the remote HTTP cache when it became unreachable #24685

Are you sure you want to change the base?

Stop using the remote HTTP cache when it became unreachable #24685

Conversation

AlexanderGolovlev commented Dec 13, 2024 • edited Loading

AlexanderGolovlev commented Dec 13, 2024 •

edited

Loading