Workspace dependency downloaded 4 times #10515

gunan · 2020-01-03T08:23:48Z

#10071 # Description of the problem / feature request:

When building tensorflow, looking at the json profile, looks like the LLVM dependency is downloaded and extracted 4 times.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Run the following commands:

git clone https://github.com/tensorflow/tensorflow

unfortunately, there may be some dependencies to be installed here
then

bazel build --experimental_generate_json_trace_profile --experimental_profile_cpu_usage --profile=~/llvm.profile.gz --nobuild @llvm-project//llvm:FileCheck

What operating system are you running Bazel on?

Reproduced in linux and windows, suspected to reproduce in windows, too.

What's the output of `bazel info release`?

dowloaded and installed 1.2.1
release 1.2.1

What's the output of `git remote get-url origin ; git rev-parse master ; git rev-parse HEAD` ?

$ git remote get-url origin ; git rev-parse master ; git rev-parse HEAD
https://github.com/tensorflow/tensorflow
6c902370140cb4b5c9ae69f440e16e5ace55b828
6c902370140cb4b5c9ae69f440e16e5ace55b828

Have you found anything relevant by searching the web?

No, but there is an internal discussion thread.
I was recommendde by @laszlocsomor to create a github issue and point him and @aehlig to this issue.

Any other information, logs, or outputs that you want to share?

can attach profiler outputs at request.

The text was updated successfully, but these errors were encountered:

gunan · 2020-01-03T08:25:15Z

here are the profiler outputs on linux and windows.

linux_simplest_case.profile.gz
windows_build_more_targets.profile.gz

laszlocsomor · 2020-01-03T09:22:01Z

Successfully repro'd on Linux, at tensorflow/tensorflow@6c90237, with Bazel 1.2.1 and Python 3.7:

bazel --nohome_rc --nosystem_rc build --experimental_generate_json_trace_profile --experimental_profile_cpu_usage --profile=~/llvm.profile.gz --nobuild @llvm-project//llvm:FileCheck

Using --repository_cache doesn't help.

gunan · 2020-01-03T09:27:43Z

I have just tried another thing.
Created a new repository, with only llvm workspace dependency.
Looks like this one only downloads llvm once:
https://github.com/gunan/bazel_test

bazel build --experimental_generate_json_trace_profile --experimental_profile_cpu_usage --profile=~/llvm_tf.profile.gz --nobuild @llvm-project//:count

llvm.profile.gz

gunan · 2020-01-03T09:40:26Z

I think I got it!
in my repository, I added just one "build_file" attribute to my workspace rule.
However, if I instead set it using our "additional_build_files" attribute, I get multiple llvm extractions:
llvm_vacuum.profile.gz

Here is the implementation of it:
https://github.com/tensorflow/tensorflow/blob/master/third_party/repo.bzl#L121

However, I do not get why bazel does this multiple times. is this an artifact of ctx.symlink?
do you think this is a bug?

laszlocsomor · 2020-01-03T09:50:00Z

Interesting! What does the new WORKSPACE file (with additional_build_files) look like?

laszlocsomor · 2020-01-03T12:20:59Z

@meteorcloudy found the root cause:

It is caused by converting a label to a path: when the FileValue referenced by a label is not already loaded in Skyframe, the whole SkylarkRepositoryFunction will get restarted. See

bazel/src/main/java/com/google/devtools/build/lib/bazel/repository/skylark/SkylarkRepositoryContext.java

Lines 1020 to 1022 in 6b5ea9d

    
           if (fileValue == null) { 
        
             throw RepositoryFunction.restart(); 
        
           }

and

bazel/src/main/java/com/google/devtools/build/lib/bazel/repository/skylark/SkylarkRepositoryFunction.java

Lines 215 to 223 in 6b5ea9d

    
           // A dependency is missing, cleanup and returns null 
        
           try { 
        
             if (outputDirectory.exists()) { 
        
               outputDirectory.deleteTree(); 
        
             } 
        
           } catch (IOException e1) { 
        
             throw new RepositoryFunctionException(e1, Transience.TRANSIENT); 
        
           } 
        
           return null;

Therefore, the re-extracting was triggered again. ctx.symlink calls the getPathFromLabel function, so it triggers the restart for every new label Bazel hasn't seen before.

meteorcloudy · 2020-01-03T13:07:33Z

The workaround is to manually enforce the label -> path converting before the extraction.

if ctx.attr.additional_build_files:
        for internal_src in ctx.attr.additional_build_files:
            _ = ctx.path(Label(internal_src))

The correct fix would require changes in both TF and Bazel:

First, in repo.bzl, we should use label_keyed_string_dict for additional_build_files (instead of string_dict) so that Bazel knows which labels will be used by only checking the attributes.

Second, we have to enforce the label -> path converting for label_keyed_string_dict in Bazel at enforceLabelAttributes

meisterT · 2022-06-23T07:27:33Z

This behavior is back:

Can we do something more principled so that we don't repeat the work here on Skyframe restarts? cc @haxorz

meteorcloudy · 2022-06-23T08:30:05Z

The most recent change to fix this one is: tensorflow/tensorflow@8ba6136

meteorcloudy · 2022-06-23T08:32:05Z

My suspect is that the regression comes from: tensorflow/tensorflow@2d34c7e#diff-b04a6573f1e0d5f36e3d499b124bf500b03c3887ef54f41f29d6ae454324cfb7

meteorcloudy · 2022-06-23T08:38:21Z

@joker-eph Can you please fix this? Basically, to avoid re-extracting the archive, you'll have to resolve all labels before the ctx.download_and_extract function. See the description in tensorflow/tensorflow@8ba6136

haxorz · 2022-06-23T14:50:12Z

[@meisterT] Can we do something more principled so that we don't repeat the work here on Skyframe restarts? cc @haxorz

@meteorcloudy and the rest of team-ExternalDeps: Check out SkyFunction.Environment#getState. Can that be used to solve this problem in a principled manner? It was added recently (2022Q1), so it didn't exist back when this issue was first filed.

Wyverald · 2022-06-23T16:28:55Z

We discussed this new API briefly today and thought it might be hard because this restart happens in the middle of Starlark evaluation, so the state would need to include Starlark stackframes, program counter, etc.

…ing each times. bazelbuild/bazel#10515 PiperOrigin-RevId: 456870895

Wyverald · 2023-06-27T19:11:32Z

Thanks for the report! Could you please upload the full file? Would be nice to see where the worker threads are stuck at, or if they're even alive.

lberki · 2023-06-27T19:11:39Z

Please do upload the whole file somewhere (for example here, this very text box here says that you can attach a file by just dragging and dropping it on top of it). This one thread shows that it's hanging while trying to download a repository, but that's also apparent from the terminal output above.

Wyverald · 2023-06-28T15:04:14Z

So after a brief discussion with @fmeum, I think this is potentially caused by the skyframe evaluator being starved of threads. If we have a lot of repos being fetched (up to the size of the skyframe evaluator thread pool), and they all depend on another repo that hasn't started being fetched, we'd run into a deadlock.

I'm actually not sure how to resolve this, short of migrating skyframe to virtual threads.

Wyverald · 2023-06-28T15:16:24Z

hang on, I think I had a brain fart. The only reason we might be starved of skyframe evaluator threads here is if the worker thread pool is exhausted (so new repo fetch requests are blocked on starting a worker thread at all). So just having the worker thread pool use virtual threads should be enough to untangle all this.

lberki · 2023-06-28T15:19:34Z

Hah, I started writing a response to exactly that effect, but you were faster with the brain Febreeze :) This scenario is certainly plausible. I can think of two other, more immediate solutions:

Make the size of the worker threadpool adjustable using a command line flag
Make it so that the worker threadpool is not of a limited size, but can grow until some reasonable limit (say, 10K threads). This should not be a problem because these threads are not usually runnable (so they don't put undue load on the scheduler) and in the brave new 64-bit world, we don't need to be worried about address space exhaustion, either. It's true that each thread requires some memory, but as long as their stack is not too deep, that shouldn't be a problem.

Wyverald · 2023-06-28T15:27:52Z

Make it so that the worker threadpool is not of a limited size, but can grow until some reasonable limit (say, 10K threads).

Isn't that just a thread pool of a limited size? ThreadPoolExecutors do grow their capacity on demand.

@meisterT thoughts on upping the size of the worker thread pool to such numbers?

lberki · 2023-06-28T15:30:18Z

I don't know off the bat, but if the thread pool grows on demand, how is a deadlock possible?

Wyverald · 2023-06-28T15:31:38Z

because it's limited to 100 right now :) It grows on demand up to the limit of 100. What you said was simply to have it grow on demand up to the limit of 10K.

lberki · 2023-06-28T15:36:19Z

Oh, I thought you meant that it's already supposed to grow above 100. If not, upping the limit is still viable.

amishra-u · 2023-06-28T23:30:42Z

Sorry got busy with something else. Also the file size exceeds the github limit. So here is the drive link.
https://drive.google.com/file/d/1s5ZUm8Jtkz77eZBlFe7eaVt9ROcLA-Vp/view?usp=drive_link

lberki · 2023-06-29T07:15:13Z

@amishra-u you successfully nerd sniped me, but anyway, the thread dumps make it obvious that it's as @Wyverald says: all the threads in the thread pool are taken, thus repo fetching can make no progress.

meisterT · 2023-06-29T07:37:45Z

Make it so that the worker threadpool is not of a limited size, but can grow until some reasonable limit (say, 10K threads).

Isn't that just a thread pool of a limited size? ThreadPoolExecutors do grow their capacity on demand.

@meisterT thoughts on upping the size of the worker thread pool to such numbers?

How much of the work is CPU bound?

lberki · 2023-06-29T07:48:47Z

Technically, it's fetching Starlark repositories so people can write any Starlark code, including mining Bitcoin, but I imagine it's mostly I/O (file system or network access).

I don't think there is a way to tell the JVM "keep these zillion threads but only ever execute N of them at the same time", is there? (I mean, other than Loom)

I could imagine shenanigans like having a semaphore which we acquire when we start a new repository thread then release-and-acquire-again around blocking operations, which would have the desired effect, but I'm not sure if it's worth doing this just for the platform thread implementation of repository workers because Loom is coming soon. It would be fun exercise, though!

fmeum · 2023-06-29T07:59:16Z

Isn't the number of "active" repo function worker threads bounded by the number of Skyframe threads? The only way a worker threads can become unattached from a Skyframe thread seems to be when it requires a dep that hasn't been evaluated yet. In that case it will block until another Skyframe thread picks it up after the deo has been evaluated.

Maybe we could get away with an effectively unbounded thread pool?

lberki · 2023-06-29T08:43:38Z

Nope. Imagine the case of having one Skyframe thread but unlimited repo worker threads: in this case, the Skyframe thread would start fetching a repository, spawn a worker thread. Then when a new Skyframe dependency needs to be fetched, the worker thread would be suspended, then the Skyframe thread would continue executing and start processing the dependency. That dependency can be itself another repository, in which case the single Skyframe thread would start fetching it, spawn another worker thread, etc...

Wyverald · 2023-06-29T17:07:15Z

@lberki, you might have missed the word "active" from @fmeum's comment. In the scenario you described, there's only ever one worker thread who's not asleep.

How much of the work is CPU bound?

Normally, very very little, as Lukács described. You could technically mine bitcoin using Starlark but I imagine you'd rather write the miner in another language and call repository_ctx.execute on it instead...

… fetching. Previously, the first markerData map was passed into the worker thread, then, on a Skyframe restart, RepositoryFunction would create a new one which was left empty. Work towards #10515 . RELNOTES: None. PiperOrigin-RevId: 553704893 Change-Id: I57fd03d478f727ec7d38beb09da8148150e71fa3

Not doing this caused test_interrupted_children_waited to fail with --experimental_worker_for_repo_fetching=platform because SkyFunction states were not closed when the SkyFunction was interrupted in flight, which caused processes spawned by repository fetching not to be interrupted in turn. Work towards #10515 . RELNOTES: None. PiperOrigin-RevId: 554477869 Change-Id: Ie6d2289b5bfdc7f4a73fc9dd29a40a5fa79b6269

Wyverald · 2024-01-17T22:36:37Z

An update: we're likely to get greenlit to start using JDK 21 features (read: Loom) by the end of January, so this issue may be fixed once and for all starting from 7.1.0.

…rtual` Bazel already ships JDK 21, but the codebase can't use JDK 21 features (language OR runtime) because Google hasn't migrated to JDK 21 yet (soon TM). But we can cheat a bit using reflection! Work towards #10515 PiperOrigin-RevId: 601188027 Change-Id: I65c1712da0347bef40fc4af94bad4c211687a8b7

Wyverald · 2024-01-24T19:57:33Z

Update to the update: JDK 21 language/library features may come to Bazel later than expected, anyway (the unexpected is always to be expected, as they say). But! 112d3b6 just landed on HEAD, which allows us to use Loom to eliminate restarts (--experimental_worker_for_repo_fetching=virtual). Feel free to try it out (especially to see if it solves the deadlock issue seen by @amishra-u). If all goes well, we'll turn this on by default in 7.1.0.

Wyverald · 2024-01-26T18:59:16Z

@bazel-io fork 7.1.0

…rtual` Bazel already ships JDK 21, but the codebase can't use JDK 21 features (language OR runtime) because Google hasn't migrated to JDK 21 yet (soon TM). But we can cheat a bit using reflection! Work towards #10515 PiperOrigin-RevId: 601188027 Change-Id: I65c1712da0347bef40fc4af94bad4c211687a8b7

We use virtual threads if available (JDK 21+); if not, we fall back to not using worker threads at all. This approach is slightly safer than straight up setting the default to `virtual` as that would break certain obscure platforms without an available JDK 21. The plan is to cherry-pick this into 7.1.0. Also removed test_reexecute in bazel_workspaces_test since, well, that's the whole point! Fixes #10515 RELNOTES: The flag `--experimental_worker_for_repo_fetching` now defaults to `auto`, which uses virtual threads from JDK 21 if it's available. This eliminates restarts during repo fetching. PiperOrigin-RevId: 601810629 Change-Id: I9d2ec59688586356d90604fdd328cc985e75d1d4

iancha1992 · 2024-02-22T23:45:15Z

A fix for this issue has been included in Bazel 7.1.0 RC1. Please test out the release candidate and report any issues as soon as possible. Thanks!

laszlocsomor self-assigned this Jan 3, 2020

laszlocsomor added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. untriaged labels Jan 3, 2020

laszlocsomor assigned meteorcloudy and laszlocsomor and unassigned laszlocsomor Jan 3, 2020

meteorcloudy added P2 We'll consider working on this in future. (Assignee optional) type: bug and removed untriaged labels Jan 3, 2020

bazel-io closed this as completed in 9386c1f Jan 8, 2020

philwo added the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Jun 15, 2020

philwo removed the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Nov 29, 2021

meisterT unassigned laszlocsomor Jun 23, 2022

meisterT added untriaged and removed P2 We'll consider working on this in future. (Assignee optional) labels Jun 23, 2022

meisterT reopened this Jun 23, 2022

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jun 23, 2022

"resolve" Bazel label before download_and_extract to avoid redownload…

7a723e7

…ing each times. bazelbuild/bazel#10515 PiperOrigin-RevId: 456870895

fmeum mentioned this issue Sep 5, 2023

Optimise repository rule generation with re-entrancy #16162

Closed

Wyverald added this to the Mainline issues targeted for 7.1.0 milestone Jan 17, 2024

copybara-service bot closed this as completed in 57c1801 Jan 26, 2024

bazel-io mentioned this issue Jan 26, 2024

[7.1.0] Workspace dependency downloaded 4 times #21080

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workspace dependency downloaded 4 times #10515

Workspace dependency downloaded 4 times #10515

gunan commented Jan 3, 2020

gunan commented Jan 3, 2020

laszlocsomor commented Jan 3, 2020

gunan commented Jan 3, 2020 •

edited

Loading

gunan commented Jan 3, 2020 •

edited

Loading

laszlocsomor commented Jan 3, 2020

laszlocsomor commented Jan 3, 2020

meteorcloudy commented Jan 3, 2020

meisterT commented Jun 23, 2022

meteorcloudy commented Jun 23, 2022

meteorcloudy commented Jun 23, 2022

meteorcloudy commented Jun 23, 2022

haxorz commented Jun 23, 2022

Wyverald commented Jun 23, 2022

Wyverald commented Jun 27, 2023

lberki commented Jun 27, 2023

Wyverald commented Jun 28, 2023

Wyverald commented Jun 28, 2023

lberki commented Jun 28, 2023

Wyverald commented Jun 28, 2023

lberki commented Jun 28, 2023

Wyverald commented Jun 28, 2023

lberki commented Jun 28, 2023 •

edited

Loading

amishra-u commented Jun 28, 2023

lberki commented Jun 29, 2023

meisterT commented Jun 29, 2023

lberki commented Jun 29, 2023

fmeum commented Jun 29, 2023

lberki commented Jun 29, 2023 •

edited

Loading

Wyverald commented Jun 29, 2023

Wyverald commented Jan 17, 2024

Wyverald commented Jan 24, 2024

Wyverald commented Jan 26, 2024

iancha1992 commented Feb 22, 2024

Workspace dependency downloaded 4 times #10515

Workspace dependency downloaded 4 times #10515

Comments

gunan commented Jan 3, 2020

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

What operating system are you running Bazel on?

What's the output of bazel info release?

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

gunan commented Jan 3, 2020

laszlocsomor commented Jan 3, 2020

gunan commented Jan 3, 2020 • edited Loading

gunan commented Jan 3, 2020 • edited Loading

laszlocsomor commented Jan 3, 2020

laszlocsomor commented Jan 3, 2020

meteorcloudy commented Jan 3, 2020

meisterT commented Jun 23, 2022

meteorcloudy commented Jun 23, 2022

meteorcloudy commented Jun 23, 2022

meteorcloudy commented Jun 23, 2022

haxorz commented Jun 23, 2022

Wyverald commented Jun 23, 2022

Wyverald commented Jun 27, 2023

lberki commented Jun 27, 2023

Wyverald commented Jun 28, 2023

Wyverald commented Jun 28, 2023

lberki commented Jun 28, 2023

Wyverald commented Jun 28, 2023

lberki commented Jun 28, 2023

Wyverald commented Jun 28, 2023

lberki commented Jun 28, 2023 • edited Loading

amishra-u commented Jun 28, 2023

lberki commented Jun 29, 2023

meisterT commented Jun 29, 2023

lberki commented Jun 29, 2023

fmeum commented Jun 29, 2023

lberki commented Jun 29, 2023 • edited Loading

Wyverald commented Jun 29, 2023

Wyverald commented Jan 17, 2024

Wyverald commented Jan 24, 2024

Wyverald commented Jan 26, 2024

iancha1992 commented Feb 22, 2024

What's the output of `bazel info release`?

What's the output of `git remote get-url origin ; git rev-parse master ; git rev-parse HEAD` ?

gunan commented Jan 3, 2020 •

edited

Loading

gunan commented Jan 3, 2020 •

edited

Loading

lberki commented Jun 28, 2023 •

edited

Loading

lberki commented Jun 29, 2023 •

edited

Loading