Actions stuck in "active" and thus "[Prepa]" #13985

DavidANeil · 2021-09-13T18:02:46Z

Description of the problem:

The UiStateTracker sometimes reports actions in "Preparing" status (shortened to "[Prepa]") indefinitely.
Because this "stuck" action will be the longest running, it will often be the only displayed action (when --curses=no).
For example, here is what a sample CI run looks like:

[0 / 69] [Prepa] BazelWorkspaceStatusAction stable-status.txt
[15,599 / 17,556] 5 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 13s ... (296 actions, 90 running)
[26,933 / 50,441] 34 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 29s ... (344 actions, 268 running)
[55,044 / 92,565] 89 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 46s ... (438 actions, 275 running)
[57,666 / 93,495] 101 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 66s ... (546 actions, 312 running)
[58,588 / 93,662] 111 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 89s ... (577 actions, 323 running)
[58,915 / 93,691] 111 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 116s ... (577 actions, 324 running)
[59,156 / 93,935] 121 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 147s ... (572 actions, 321 running)
[59,582 / 94,344] 149 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 182s ... (577 actions, 323 running)
[60,425 / 95,053] 168 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 223s ... (578 actions, 325 running)
[61,058 / 95,161] 203 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 270s ... (577 actions, 323 running)
[62,440 / 96,339] 351 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 324s ... (577 actions, 324 running)
[63,079 / 96,913] 420 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 386s ... (578 actions, 324 running)
[64,779 / 114,781] 485 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 457s ... (581 actions, 326 running)
[65,352 / 115,200] 543 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 539s ... (579 actions, 323 running)
[68,085 / 115,383] 589 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 633s ... (581 actions, 323 running)
[76,800 / 115,433] 652 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 741s ... (571 actions, 305 running)
[95,197 / 115,457] 654 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 866s ... (404 actions, 137 running)
[116,032 / 116,315] 828 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 1009s ... (162 actions, 102 running)

Which is just ... not useful.

Note that actual action execution is not affected by this. This is purely a visual bug.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

This bug appears to be caused by a race condition, so is not easily reproducible.
I reckon it is a case where an action is added to the activeActions map, then removed after it completes, then re-added by a call to computeIfAbsent being called in a racy manner.

What operating system are you running Bazel on?

Linux 20.04

What's the output of `bazel info release`?

release 4.2.0

I have not been able to reproduce this issue on 4.1.0

The text was updated successfully, but these errors were encountered:

brentleyjones · 2021-09-13T18:07:58Z

This issue also exists in the latest rolling release.

bjacklyn · 2021-09-19T21:43:17Z

cc @coeuvre

Also seeing this after upgrading to 4.2.1 from 4.0.0. These are the changes that have gone in since the 4.1.0 release:

$ git log HEAD ^4.1.0 --oneline src/main/java/com/google/devtools/build/lib/runtime/UiStateTracker.java
bf5e2b8ec0 Remote: Async upload (Part 8)
b51b31dbe7 Remote: Fix a race when reporting action progresses.
0f812eb5e5 Remote: Display download progress when actions are downloading outputs from remote cache.
f0983dfddb Remote: Report checking cache status before the action is scheduled to run remotely.

It looks like there was an attempt at fixing the UI bug in b51b31dbe7.

By adding the below patch you can reproduce this issue:

id 'File:[[<execution_root>]bazel-out/k8-opt/.../foo' is already finished, and event Downloading foo, 5.2 KiB / 5.2 KiB is being processed
...

The problem seems to be like @DavidANeil mentioned -- actionProgress(ActionProgressEvent event) is probably called in ThreadA before actionCompletion(ActionCompletionEvent event) in ThreadB but the ThreadA goes to sleep and when it revives it calls getActionState() which calls activeActions.computeIfAbsent() which re-adds the actions that was just removed by ThreadB. The other possibility is events are being fired out-of-order.

I wonder if the right fix is for UiEventHandler to queue and only fire 1 in-progress event per actionId and wait for each event completion before firing the next. Presumably computeIfAbsent is being used because of this exact problem -- events can be processed out-of-order so actionStarted() cannot be assumed to have already ran & created an action in activeActions.

Patch:

$ git diff
diff --git a/src/main/java/com/google/devtools/build/lib/runtime/UiStateTracker.java b/src/main/java/com/google/devtools/build/lib/runtime/UiStateTracker.java
index eef7ea5c64..99ae5e36d2 100644
--- a/src/main/java/com/google/devtools/build/lib/runtime/UiStateTracker.java
+++ b/src/main/java/com/google/devtools/build/lib/runtime/UiStateTracker.java
@@ -361,6 +361,8 @@ final class UiStateTracker {
 
   private final Map<Artifact, ActionState> activeActions;
 
+  private final Map<Artifact, Boolean> finishedActions;
+
   // running downloads are identified by the original URL they were trying to access.
   private final Deque<String> runningDownloads;
   private final Map<String, Long> downloadNanoStartTimes;
@@ -396,6 +398,7 @@ final class UiStateTracker {
 
   UiStateTracker(Clock clock, int targetWidth) {
     this.activeActions = new ConcurrentHashMap<>();
+    this.finishedActions = new ConcurrentHashMap<>();
 
     this.actionsCompleted = new AtomicInteger();
     this.testActions = new ConcurrentHashMap<>();
@@ -582,6 +585,9 @@ final class UiStateTracker {
     ActionExecutionMetadata action = event.action();
     Artifact actionId = event.action().getPrimaryOutput();
     long now = clock.nanoTime();
+    if (finishedActions.containsKey(actionId)) {
+        System.out.println(String.format("id '%s' is already finished, and event %s is being processed", actionId, event.progress()));
+    }
     getActionState(action, actionId, now).onProgressEvent(event, now);
   }
 
@@ -589,6 +595,7 @@ final class UiStateTracker {
     Action action = event.getAction();
     Artifact actionId = action.getPrimaryOutput();
     checkNotNull(activeActions.remove(actionId), "%s not active after %s", actionId, event);
+    finishedActions.put(actionId, true);
 
     // As callers to the experimental state tracker assume we will fully report the new state once
     // informed of an action completion, we need to make sure the progress receiver is aware of the
@@ -604,6 +611,7 @@ final class UiStateTracker {
     Artifact actionId = action.getPrimaryOutput();
 
     checkNotNull(activeActions.remove(actionId), "%s not active after %s", actionId, event);
+    finishedActions.put(actionId, true);
 
     if (action.getOwner() != null) {
       Label owner = action.getOwner().getLabel();

bjacklyn · 2021-09-19T21:45:02Z

Also for anyone who OK with compiling bazel from source and applying a patch (patch based off 4.2.1), this fixes it for me:

$ git diff
diff --git a/src/main/java/com/google/devtools/build/lib/runtime/UiStateTracker.java b/src/main/java/com/google/devtools/build/lib/runtime/UiStateTracker.java
index eef7ea5c64..9e5570e8a7 100644
--- a/src/main/java/com/google/devtools/build/lib/runtime/UiStateTracker.java
+++ b/src/main/java/com/google/devtools/build/lib/runtime/UiStateTracker.java
@@ -526,6 +526,11 @@ final class UiStateTracker {
     return activeActions.computeIfAbsent(actionId, (key) -> new ActionState(action, nanoTimeNow));
   }
 
+  @Nullable
+  private ActionState getActionStateIfPresent(Artifact actionId) {
+    return activeActions.get(actionId);
+  }
+
   void actionStarted(ActionStartedEvent event) {
     Action action = event.getAction();
     Artifact actionId = action.getPrimaryOutput();
@@ -582,7 +587,10 @@ final class UiStateTracker {
     ActionExecutionMetadata action = event.action();
     Artifact actionId = event.action().getPrimaryOutput();
     long now = clock.nanoTime();
-    getActionState(action, actionId, now).onProgressEvent(event, now);
+    ActionState actionState = getActionStateIfPresent(actionId);
+    if (actionState != null) {
+        actionState.onProgressEvent(event, now);
+    }
   }
 
   void actionCompletion(ActionScanningCompletedEvent event) {

philwo · 2021-09-21T21:05:48Z

@bjacklyn Would you mind sending your patch as a pull request?

Fixed bazelbuild#13985 UiStateTracker can process ActionProgress events after an ActionCompletion event has been fired. This has the effect of recreating the action object in the activeActions Map because Map.computeIfAbsent() is being used to retrieve the action. Therefore, a new method getActionStateIfPresent is created which will return the action if it exists, otherwise return null, and actionProgress() uses this method so as to not inadverntantly recreate actions after they have already completed and been removed from the Map. Closes bazelbuild#14020. PiperOrigin-RevId: 398165993 (cherry picked from commit 639f89d)

Fixed #13985 UiStateTracker can process ActionProgress events after an ActionCompletion event has been fired. This has the effect of recreating the action object in the activeActions Map because Map.computeIfAbsent() is being used to retrieve the action. Therefore, a new method getActionStateIfPresent is created which will return the action if it exists, otherwise return null, and actionProgress() uses this method so as to not inadverntantly recreate actions after they have already completed and been removed from the Map. Closes #14020. PiperOrigin-RevId: 398165993 (cherry picked from commit 639f89d) Co-authored-by: Brandon Jacklyn <[email protected]>

See bazelbuild/bazel#13985 Signed-off-by: Miguel Young de la Sota <[email protected]>

philwo added P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug labels Sep 21, 2021

philwo assigned coeuvre Sep 21, 2021

bjacklyn mentioned this issue Sep 21, 2021

Fix [Prepa] actions stuck in active state #14020

Closed

bazel-io closed this as completed in 639f89d Sep 22, 2021

bjacklyn mentioned this issue Nov 24, 2021

Release 4.2 - June 2021 #13558

Closed

9 tasks

bjacklyn mentioned this issue Nov 24, 2021

Fix [Prepa] actions stuck in active state #14325

Merged

mcy added a commit to mcy/opentitan-1 that referenced this issue Apr 21, 2022

[bazel] Bump the bazel version to fix a UI bug

992048f

See bazelbuild/bazel#13985 Signed-off-by: Miguel Young de la Sota <[email protected]>

mcy mentioned this issue Apr 21, 2022

[bazel] Bump the bazel version to fix a UI bug lowRISC/opentitan#12263

Merged

mcy added a commit to mcy/opentitan-1 that referenced this issue Apr 21, 2022

[bazel] Bump the bazel version to fix a UI bug

3d0689c

See bazelbuild/bazel#13985 Signed-off-by: Miguel Young de la Sota <[email protected]>

mcy added a commit to mcy/opentitan-1 that referenced this issue Apr 21, 2022

[bazel] Bump the bazel version to fix a UI bug

8ef55d8

See bazelbuild/bazel#13985 Signed-off-by: Miguel Young de la Sota <[email protected]>

timothytrippel pushed a commit to lowRISC/opentitan that referenced this issue Apr 23, 2022

[bazel] Bump the bazel version to fix a UI bug

4b01c7a

See bazelbuild/bazel#13985 Signed-off-by: Miguel Young de la Sota <[email protected]>

jeoongp pushed a commit to jeoongp/opentitan that referenced this issue May 7, 2022

[bazel] Bump the bazel version to fix a UI bug

4ed80e1

See bazelbuild/bazel#13985 Signed-off-by: Miguel Young de la Sota <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actions stuck in "active" and thus "[Prepa]" #13985

Actions stuck in "active" and thus "[Prepa]" #13985

DavidANeil commented Sep 13, 2021

brentleyjones commented Sep 13, 2021

bjacklyn commented Sep 19, 2021

bjacklyn commented Sep 19, 2021

philwo commented Sep 21, 2021

Actions stuck in "active" and thus "[Prepa]" #13985

Actions stuck in "active" and thus "[Prepa]" #13985

Comments

DavidANeil commented Sep 13, 2021

Description of the problem:

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

What operating system are you running Bazel on?

What's the output of bazel info release?

brentleyjones commented Sep 13, 2021

bjacklyn commented Sep 19, 2021

bjacklyn commented Sep 19, 2021

philwo commented Sep 21, 2021

What's the output of `bazel info release`?