Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop using the remote HTTP cache when it became unreachable #24685

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

AlexanderGolovlev
Copy link
Contributor

@AlexanderGolovlev AlexanderGolovlev commented Dec 13, 2024

This PR implements the feature to stop using the remote HTTP cache when it became unreachable. Bazel handles the remote cache errors and continues building with local execution in case of errors. However, it continues making calls to remote cache for each action. This produces significant delays compared to local execution without the remote cache. The situation is especially bad in case of network timeout errors. On Windows the default TCP connect timeout is 21sec, on Linux/Mac - 127 sec. This time is lost before any action on attempt to retrieve the data from remote cache, and after the action on attempt to store the data in cache. The issue can be reproduced by specifying the non-existing address for the cache, such as --remote_cache=https://10.255.255.1/.

The proposed implementation is based on already existing CircuitBreaker logic. This logic was enabled recently for HTTP in PR #20831. However, it doesn't resolve the issue, because:

  • Errors like "Connection refused" or "Connection timed out" are not counted as failures
  • It takes a very long time for CircuitBreaker to trigger on timeout errors.

This PR includes the following changes:

  • All non-retriable errors were counted as successful API calls. However, they include not only the cache miss errors (which should be considered as a successful call), but also the different connectivity errors (which should be considered as failures). To distinguish these errors, we have added the isSuccess predicate. HTTP_SUCCESS_CODES defines success codes for HTTP, and RemoteRetrier.GRPC_SUCCESS_CODES defines success codes for gRPC.
  • Current implementation of FailureCircuitBreaker defines the threshold required to calculate the failure rate: minCallCountToComputeFailureRate=100. In case of connection timeout errors this requires a very large time interval to reach the threshold, much larger than the default experimental_remote_failure_window_interval=60s. To solve the problem, we have added one more threshold: DEFAULT_MIN_FAIL_COUNT_TO_COMPUTE_FAILURE_RATE=12. This value is chosen to be slightly more than number of failures within window interval with default experimental_remote_failure_rate_threshold=10.

@AlexanderGolovlev AlexanderGolovlev marked this pull request as ready for review December 16, 2024 12:45
@AlexanderGolovlev AlexanderGolovlev requested a review from a team as a code owner December 16, 2024 12:45
@github-actions github-actions bot added team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Dec 16, 2024
@AlexanderGolovlev AlexanderGolovlev changed the title Enabling the CircuitBreaker for remote HTTP cache Stop using the remote HTTP cache when it became unreachable Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-review PR is awaiting review from an assigned reviewer team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant