Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workspace dependency downloaded 4 times #10515

Closed
gunan opened this issue Jan 3, 2020 · 50 comments
Closed

Workspace dependency downloaded 4 times #10515

gunan opened this issue Jan 3, 2020 · 50 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug

Comments

@gunan
Copy link

gunan commented Jan 3, 2020

#10071 # Description of the problem / feature request:

When building tensorflow, looking at the json profile, looks like the LLVM dependency is downloaded and extracted 4 times.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Run the following commands:

git clone https://github.com/tensorflow/tensorflow

unfortunately, there may be some dependencies to be installed here
then

bazel build --experimental_generate_json_trace_profile --experimental_profile_cpu_usage --profile=~/llvm.profile.gz --nobuild @llvm-project//llvm:FileCheck

What operating system are you running Bazel on?

Reproduced in linux and windows, suspected to reproduce in windows, too.

What's the output of bazel info release?

dowloaded and installed 1.2.1
release 1.2.1

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

$ git remote get-url origin ; git rev-parse master ; git rev-parse HEAD
https://github.com/tensorflow/tensorflow
6c902370140cb4b5c9ae69f440e16e5ace55b828
6c902370140cb4b5c9ae69f440e16e5ace55b828

Have you found anything relevant by searching the web?

No, but there is an internal discussion thread.
I was recommendde by @laszlocsomor to create a github issue and point him and @aehlig to this issue.

Any other information, logs, or outputs that you want to share?

can attach profiler outputs at request.

@gunan
Copy link
Author

gunan commented Jan 3, 2020

here are the profiler outputs on linux and windows.

linux_simplest_case.profile.gz
windows_build_more_targets.profile.gz

@laszlocsomor laszlocsomor self-assigned this Jan 3, 2020
@laszlocsomor laszlocsomor added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. untriaged labels Jan 3, 2020
@laszlocsomor
Copy link
Contributor

Successfully repro'd on Linux, at tensorflow/tensorflow@6c90237, with Bazel 1.2.1 and Python 3.7:

bazel --nohome_rc --nosystem_rc build --experimental_generate_json_trace_profile --experimental_profile_cpu_usage --profile=~/llvm.profile.gz --nobuild @llvm-project//llvm:FileCheck

Using --repository_cache doesn't help.

@gunan
Copy link
Author

gunan commented Jan 3, 2020

I have just tried another thing.
Created a new repository, with only llvm workspace dependency.
Looks like this one only downloads llvm once:
https://github.com/gunan/bazel_test

bazel build --experimental_generate_json_trace_profile --experimental_profile_cpu_usage --profile=~/llvm_tf.profile.gz --nobuild @llvm-project//:count

llvm.profile.gz

@gunan
Copy link
Author

gunan commented Jan 3, 2020

I think I got it!
in my repository, I added just one "build_file" attribute to my workspace rule.
However, if I instead set it using our "additional_build_files" attribute, I get multiple llvm extractions:
llvm_vacuum.profile.gz

Here is the implementation of it:
https://github.com/tensorflow/tensorflow/blob/master/third_party/repo.bzl#L121

However, I do not get why bazel does this multiple times. is this an artifact of ctx.symlink?
do you think this is a bug?

@laszlocsomor
Copy link
Contributor

Interesting! What does the new WORKSPACE file (with additional_build_files) look like?

@laszlocsomor
Copy link
Contributor

@meteorcloudy found the root cause:

It is caused by converting a label to a path: when the FileValue referenced by a label is not already loaded in Skyframe, the whole SkylarkRepositoryFunction will get restarted. See

if (fileValue == null) {
throw RepositoryFunction.restart();
}
and
// A dependency is missing, cleanup and returns null
try {
if (outputDirectory.exists()) {
outputDirectory.deleteTree();
}
} catch (IOException e1) {
throw new RepositoryFunctionException(e1, Transience.TRANSIENT);
}
return null;

Therefore, the re-extracting was triggered again. ctx.symlink calls the getPathFromLabel function, so it triggers the restart for every new label Bazel hasn't seen before.

@meteorcloudy
Copy link
Member

The workaround is to manually enforce the label -> path converting before the extraction.

if ctx.attr.additional_build_files:
        for internal_src in ctx.attr.additional_build_files:
            _ = ctx.path(Label(internal_src))

The correct fix would require changes in both TF and Bazel:

First, in repo.bzl, we should use label_keyed_string_dict for additional_build_files (instead of string_dict) so that Bazel knows which labels will be used by only checking the attributes.

Second, we have to enforce the label -> path converting for label_keyed_string_dict in Bazel at enforceLabelAttributes

@meteorcloudy meteorcloudy added P2 We'll consider working on this in future. (Assignee optional) type: bug and removed untriaged labels Jan 3, 2020
@philwo philwo added the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Jun 15, 2020
@philwo philwo removed the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Nov 29, 2021
@meisterT meisterT added untriaged and removed P2 We'll consider working on this in future. (Assignee optional) labels Jun 23, 2022
@meisterT
Copy link
Member

This behavior is back:
image

Can we do something more principled so that we don't repeat the work here on Skyframe restarts? cc @haxorz

@meisterT meisterT reopened this Jun 23, 2022
@meteorcloudy
Copy link
Member

The most recent change to fix this one is: tensorflow/tensorflow@8ba6136

@meteorcloudy
Copy link
Member

@meteorcloudy
Copy link
Member

@joker-eph Can you please fix this? Basically, to avoid re-extracting the archive, you'll have to resolve all labels before the ctx.download_and_extract function. See the description in tensorflow/tensorflow@8ba6136

@haxorz
Copy link
Contributor

haxorz commented Jun 23, 2022

[@meisterT] Can we do something more principled so that we don't repeat the work here on Skyframe restarts? cc @haxorz

@meteorcloudy and the rest of team-ExternalDeps: Check out SkyFunction.Environment#getState. Can that be used to solve this problem in a principled manner? It was added recently (2022Q1), so it didn't exist back when this issue was first filed.

@Wyverald
Copy link
Member

We discussed this new API briefly today and thought it might be hard because this restart happens in the middle of Starlark evaluation, so the state would need to include Starlark stackframes, program counter, etc.

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jun 23, 2022
@Wyverald
Copy link
Member

Thanks for the report! Could you please upload the full file? Would be nice to see where the worker threads are stuck at, or if they're even alive.

@lberki
Copy link
Contributor

lberki commented Jun 27, 2023

Please do upload the whole file somewhere (for example here, this very text box here says that you can attach a file by just dragging and dropping it on top of it). This one thread shows that it's hanging while trying to download a repository, but that's also apparent from the terminal output above.

@Wyverald
Copy link
Member

So after a brief discussion with @fmeum, I think this is potentially caused by the skyframe evaluator being starved of threads. If we have a lot of repos being fetched (up to the size of the skyframe evaluator thread pool), and they all depend on another repo that hasn't started being fetched, we'd run into a deadlock.

I'm actually not sure how to resolve this, short of migrating skyframe to virtual threads.

@Wyverald
Copy link
Member

hang on, I think I had a brain fart. The only reason we might be starved of skyframe evaluator threads here is if the worker thread pool is exhausted (so new repo fetch requests are blocked on starting a worker thread at all). So just having the worker thread pool use virtual threads should be enough to untangle all this.

@lberki
Copy link
Contributor

lberki commented Jun 28, 2023

Hah, I started writing a response to exactly that effect, but you were faster with the brain Febreeze :) This scenario is certainly plausible. I can think of two other, more immediate solutions:

  1. Make the size of the worker threadpool adjustable using a command line flag
  2. Make it so that the worker threadpool is not of a limited size, but can grow until some reasonable limit (say, 10K threads). This should not be a problem because these threads are not usually runnable (so they don't put undue load on the scheduler) and in the brave new 64-bit world, we don't need to be worried about address space exhaustion, either. It's true that each thread requires some memory, but as long as their stack is not too deep, that shouldn't be a problem.

@Wyverald
Copy link
Member

Make it so that the worker threadpool is not of a limited size, but can grow until some reasonable limit (say, 10K threads).

Isn't that just a thread pool of a limited size? ThreadPoolExecutors do grow their capacity on demand.

@meisterT thoughts on upping the size of the worker thread pool to such numbers?

@lberki
Copy link
Contributor

lberki commented Jun 28, 2023

I don't know off the bat, but if the thread pool grows on demand, how is a deadlock possible?

@Wyverald
Copy link
Member

because it's limited to 100 right now :) It grows on demand up to the limit of 100. What you said was simply to have it grow on demand up to the limit of 10K.

@lberki
Copy link
Contributor

lberki commented Jun 28, 2023

Oh, I thought you meant that it's already supposed to grow above 100. If not, upping the limit is still viable.

@amishra-u
Copy link
Contributor

Sorry got busy with something else. Also the file size exceeds the github limit. So here is the drive link.
https://drive.google.com/file/d/1s5ZUm8Jtkz77eZBlFe7eaVt9ROcLA-Vp/view?usp=drive_link

@lberki
Copy link
Contributor

lberki commented Jun 29, 2023

@amishra-u you successfully nerd sniped me, but anyway, the thread dumps make it obvious that it's as @Wyverald says: all the threads in the thread pool are taken, thus repo fetching can make no progress.

@meisterT
Copy link
Member

Make it so that the worker threadpool is not of a limited size, but can grow until some reasonable limit (say, 10K threads).

Isn't that just a thread pool of a limited size? ThreadPoolExecutors do grow their capacity on demand.

@meisterT thoughts on upping the size of the worker thread pool to such numbers?

How much of the work is CPU bound?

@lberki
Copy link
Contributor

lberki commented Jun 29, 2023

Technically, it's fetching Starlark repositories so people can write any Starlark code, including mining Bitcoin, but I imagine it's mostly I/O (file system or network access).

I don't think there is a way to tell the JVM "keep these zillion threads but only ever execute N of them at the same time", is there? (I mean, other than Loom)

I could imagine shenanigans like having a semaphore which we acquire when we start a new repository thread then release-and-acquire-again around blocking operations, which would have the desired effect, but I'm not sure if it's worth doing this just for the platform thread implementation of repository workers because Loom is coming soon. It would be fun exercise, though!

@fmeum
Copy link
Collaborator

fmeum commented Jun 29, 2023

Isn't the number of "active" repo function worker threads bounded by the number of Skyframe threads? The only way a worker threads can become unattached from a Skyframe thread seems to be when it requires a dep that hasn't been evaluated yet. In that case it will block until another Skyframe thread picks it up after the deo has been evaluated.

Maybe we could get away with an effectively unbounded thread pool?

@lberki
Copy link
Contributor

lberki commented Jun 29, 2023

Nope. Imagine the case of having one Skyframe thread but unlimited repo worker threads: in this case, the Skyframe thread would start fetching a repository, spawn a worker thread. Then when a new Skyframe dependency needs to be fetched, the worker thread would be suspended, then the Skyframe thread would continue executing and start processing the dependency. That dependency can be itself another repository, in which case the single Skyframe thread would start fetching it, spawn another worker thread, etc...

@Wyverald
Copy link
Member

@lberki, you might have missed the word "active" from @fmeum's comment. In the scenario you described, there's only ever one worker thread who's not asleep.

How much of the work is CPU bound?

Normally, very very little, as Lukács described. You could technically mine bitcoin using Starlark but I imagine you'd rather write the miner in another language and call repository_ctx.execute on it instead...

copybara-service bot pushed a commit that referenced this issue Aug 4, 2023
… fetching.

Previously, the first markerData map was passed into the worker thread, then, on a Skyframe restart, RepositoryFunction would create a new one which was left empty.

Work towards #10515 .

RELNOTES: None.
PiperOrigin-RevId: 553704893
Change-Id: I57fd03d478f727ec7d38beb09da8148150e71fa3
copybara-service bot pushed a commit that referenced this issue Aug 7, 2023
Not doing this caused test_interrupted_children_waited to fail with --experimental_worker_for_repo_fetching=platform because SkyFunction states were not closed when the SkyFunction was interrupted in flight, which caused processes spawned by repository fetching not to be interrupted in turn.

Work towards #10515 .

RELNOTES: None.
PiperOrigin-RevId: 554477869
Change-Id: Ie6d2289b5bfdc7f4a73fc9dd29a40a5fa79b6269
@Wyverald
Copy link
Member

An update: we're likely to get greenlit to start using JDK 21 features (read: Loom) by the end of January, so this issue may be fixed once and for all starting from 7.1.0.

copybara-service bot pushed a commit that referenced this issue Jan 24, 2024
…rtual`

Bazel already ships JDK 21, but the codebase can't use JDK 21 features (language OR runtime) because Google hasn't migrated to JDK 21 yet (soon TM). But we can cheat a bit using reflection!

Work towards #10515

PiperOrigin-RevId: 601188027
Change-Id: I65c1712da0347bef40fc4af94bad4c211687a8b7
@Wyverald
Copy link
Member

Update to the update: JDK 21 language/library features may come to Bazel later than expected, anyway (the unexpected is always to be expected, as they say). But! 112d3b6 just landed on HEAD, which allows us to use Loom to eliminate restarts (--experimental_worker_for_repo_fetching=virtual). Feel free to try it out (especially to see if it solves the deadlock issue seen by @amishra-u). If all goes well, we'll turn this on by default in 7.1.0.

@Wyverald
Copy link
Member

@bazel-io fork 7.1.0

Wyverald added a commit that referenced this issue Jan 26, 2024
…rtual`

Bazel already ships JDK 21, but the codebase can't use JDK 21 features (language OR runtime) because Google hasn't migrated to JDK 21 yet (soon TM). But we can cheat a bit using reflection!

Work towards #10515

PiperOrigin-RevId: 601188027
Change-Id: I65c1712da0347bef40fc4af94bad4c211687a8b7
Wyverald added a commit that referenced this issue Jan 26, 2024
We use virtual threads if available (JDK 21+); if not, we fall back to not using worker threads at all. This approach is slightly safer than straight up setting the default to `virtual` as that would break certain obscure platforms without an available JDK 21. The plan is to cherry-pick this into 7.1.0.

Also removed test_reexecute in bazel_workspaces_test since, well, that's the whole point!

Fixes #10515

RELNOTES: The flag `--experimental_worker_for_repo_fetching` now defaults to `auto`, which uses virtual threads from JDK 21 if it's available. This eliminates restarts during repo fetching.
PiperOrigin-RevId: 601810629
Change-Id: I9d2ec59688586356d90604fdd328cc985e75d1d4
@iancha1992
Copy link
Member

A fix for this issue has been included in Bazel 7.1.0 RC1. Please test out the release candidate and report any issues as soon as possible. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug
Projects
None yet
Development

No branches or pull requests