Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REv2: RPCs failing with DEADLINE_EXCEEDED no longer return a useful error message #12898

Closed
EdSchouten opened this issue Jan 26, 2021 · 3 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@EdSchouten
Copy link
Contributor

Description of the problem / feature request:

When running a build with Bazel 4.0.0 against Buildbarn, one might see the following cryptic error message, without any further details:

ERROR: /home/ed/some/directory/BUILD:7:22: Compiling some/filename.c failed: (Exit 34): clang failed: error executing command foo ... (remaining ... argument(s) skipped). Note: Remote connection/protocol failed with: execution failed

The java.log contains the following:

210126 00:08:44.738:WT 380 [com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run] Aborting evaluation while evaluating ActionLookupData{actionLookupKey=ConfiguredTargetKey{label=[REDACTED], config=BuildConfigurationValue.Key[REDACTED]}, actionIndex=2
}
com.google.devtools.build.lib.skyframe.ActionExecutionFunction$ActionExecutionFunctionException: com.google.devtools.build.lib.actions.AlreadyReportedActionExecutionException: (Exit 34): [REDACTED] ... (remaining 437 argument(s) skipped). Note: Remote connection/protocol failed with: execution failed
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:326)
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:477)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: com.google.devtools.build.lib.actions.AlreadyReportedActionExecutionException: (Exit 34): [REDACTED] ... (remaining 437 argument(s) skipped). Note: Remote connection/protocol failed with: execution failed
        ... 6 more
Caused by: com.google.devtools.build.lib.exec.SpawnExecException: [REDACTED] ... (remaining 437 argument(s) skipped)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:195)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:102)
        at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
        at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:65)
        at com.google.devtools.build.lib.rules.cpp.CppLinkAction.beginExecution(CppLinkAction.java:306)
        at com.google.devtools.build.lib.actions.Action.execute(Action.java:127)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$5.execute(SkyframeActionExecutor.java:855)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1016)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:975)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:129)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:81)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:472)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:834)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:307)
        ... 5 more 

Investigating the gRPC Prometheus metrics on the Buildbarn side, I can see that this is caused by one or more RPCs failing with DEADLINE_EXCEEDED. I can confirm that reducing --remote_timeout to an extremely low value (e.g., 3 seconds) makes these errors more probable. Increasing this flag to an extremely high value (e.g., 3600 seconds) makes the errors go away.

Feature requests: what underlying problem are you trying to solve with this feature?

Make Bazel print a useful error when builds fail like this.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Run a build against a REv2 cluster, while passing in --remote_timeout=${low}.

What operating system are you running Bazel on?

Linux

What's the output of bazel info release?

release 4.0.0

Have you found anything relevant by searching the web?

No

Any other information, logs, or outputs that you want to share?

No

@gregestren gregestren added team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug untriaged labels Jan 26, 2021
@philwo philwo added P1 I'll work on this now. (Assignee required) and removed untriaged labels Feb 9, 2021
@sventiffe
Copy link
Contributor

This bug stalled since a few months but is marked as P1. Could you either update the bug, please, or adjust the priority?

@coeuvre
Copy link
Member

coeuvre commented May 11, 2021

#12564 made some improvements to reporting remote execution errors but wasn't included in 4.0.0. Can you please try 4.1rc4 and check whether it solves this issue?

@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) and removed P1 I'll work on this now. (Assignee required) labels May 11, 2021
@coeuvre
Copy link
Member

coeuvre commented Sep 13, 2021

Closing. Please re-open if you think there are still rooms for improvements.

@coeuvre coeuvre closed this as completed Sep 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

5 participants