-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workspace dependency downloaded 4 times #10515
Comments
here are the profiler outputs on linux and windows. linux_simplest_case.profile.gz |
Successfully repro'd on Linux, at tensorflow/tensorflow@6c90237, with Bazel 1.2.1 and Python 3.7:
Using |
I have just tried another thing.
|
I think I got it! Here is the implementation of it: However, I do not get why bazel does this multiple times. is this an artifact of |
Interesting! What does the new WORKSPACE file (with additional_build_files) look like? |
@meteorcloudy found the root cause: It is caused by converting a label to a path: when the FileValue referenced by a label is not already loaded in Skyframe, the whole SkylarkRepositoryFunction will get restarted. See Lines 1020 to 1022 in 6b5ea9d
Lines 215 to 223 in 6b5ea9d
Therefore, the re-extracting was triggered again. |
The workaround is to manually enforce the label -> path converting before the extraction.
The correct fix would require changes in both TF and Bazel: First, in Second, we have to enforce the label -> path converting for |
Can we do something more principled so that we don't repeat the work here on Skyframe restarts? cc @haxorz |
The most recent change to fix this one is: tensorflow/tensorflow@8ba6136 |
My suspect is that the regression comes from: tensorflow/tensorflow@2d34c7e#diff-b04a6573f1e0d5f36e3d499b124bf500b03c3887ef54f41f29d6ae454324cfb7 |
@joker-eph Can you please fix this? Basically, to avoid re-extracting the archive, you'll have to resolve all labels before the |
@meteorcloudy and the rest of |
We discussed this new API briefly today and thought it might be hard because this restart happens in the middle of Starlark evaluation, so the state would need to include Starlark stackframes, program counter, etc. |
…ing each times. bazelbuild/bazel#10515 PiperOrigin-RevId: 456870895
Thanks for the report! Could you please upload the full file? Would be nice to see where the worker threads are stuck at, or if they're even alive. |
Please do upload the whole file somewhere (for example here, this very text box here says that you can attach a file by just dragging and dropping it on top of it). This one thread shows that it's hanging while trying to download a repository, but that's also apparent from the terminal output above. |
So after a brief discussion with @fmeum, I think this is potentially caused by the skyframe evaluator being starved of threads. If we have a lot of repos being fetched (up to the size of the skyframe evaluator thread pool), and they all depend on another repo that hasn't started being fetched, we'd run into a deadlock. I'm actually not sure how to resolve this, short of migrating skyframe to virtual threads. |
hang on, I think I had a brain fart. The only reason we might be starved of skyframe evaluator threads here is if the worker thread pool is exhausted (so new repo fetch requests are blocked on starting a worker thread at all). So just having the worker thread pool use virtual threads should be enough to untangle all this. |
Hah, I started writing a response to exactly that effect, but you were faster with the brain Febreeze :) This scenario is certainly plausible. I can think of two other, more immediate solutions:
|
Isn't that just a thread pool of a limited size? ThreadPoolExecutors do grow their capacity on demand. @meisterT thoughts on upping the size of the worker thread pool to such numbers? |
I don't know off the bat, but if the thread pool grows on demand, how is a deadlock possible? |
because it's limited to 100 right now :) It grows on demand up to the limit of 100. What you said was simply to have it grow on demand up to the limit of 10K. |
Oh, I thought you meant that it's already supposed to grow above 100. If not, upping the limit is still viable. |
Sorry got busy with something else. Also the file size exceeds the github limit. So here is the drive link. |
@amishra-u you successfully nerd sniped me, but anyway, the thread dumps make it obvious that it's as @Wyverald says: all the threads in the thread pool are taken, thus repo fetching can make no progress. |
How much of the work is CPU bound? |
Technically, it's fetching Starlark repositories so people can write any Starlark code, including mining Bitcoin, but I imagine it's mostly I/O (file system or network access). I don't think there is a way to tell the JVM "keep these zillion threads but only ever execute N of them at the same time", is there? (I mean, other than Loom) I could imagine shenanigans like having a semaphore which we acquire when we start a new repository thread then release-and-acquire-again around blocking operations, which would have the desired effect, but I'm not sure if it's worth doing this just for the platform thread implementation of repository workers because Loom is coming soon. It would be fun exercise, though! |
Isn't the number of "active" repo function worker threads bounded by the number of Skyframe threads? The only way a worker threads can become unattached from a Skyframe thread seems to be when it requires a dep that hasn't been evaluated yet. In that case it will block until another Skyframe thread picks it up after the deo has been evaluated. Maybe we could get away with an effectively unbounded thread pool? |
Nope. Imagine the case of having one Skyframe thread but unlimited repo worker threads: in this case, the Skyframe thread would start fetching a repository, spawn a worker thread. Then when a new Skyframe dependency needs to be fetched, the worker thread would be suspended, then the Skyframe thread would continue executing and start processing the dependency. That dependency can be itself another repository, in which case the single Skyframe thread would start fetching it, spawn another worker thread, etc... |
@lberki, you might have missed the word "active" from @fmeum's comment. In the scenario you described, there's only ever one worker thread who's not asleep.
Normally, very very little, as Lukács described. You could technically mine bitcoin using Starlark but I imagine you'd rather write the miner in another language and call |
… fetching. Previously, the first markerData map was passed into the worker thread, then, on a Skyframe restart, RepositoryFunction would create a new one which was left empty. Work towards #10515 . RELNOTES: None. PiperOrigin-RevId: 553704893 Change-Id: I57fd03d478f727ec7d38beb09da8148150e71fa3
Not doing this caused test_interrupted_children_waited to fail with --experimental_worker_for_repo_fetching=platform because SkyFunction states were not closed when the SkyFunction was interrupted in flight, which caused processes spawned by repository fetching not to be interrupted in turn. Work towards #10515 . RELNOTES: None. PiperOrigin-RevId: 554477869 Change-Id: Ie6d2289b5bfdc7f4a73fc9dd29a40a5fa79b6269
An update: we're likely to get greenlit to start using JDK 21 features (read: Loom) by the end of January, so this issue may be fixed once and for all starting from 7.1.0. |
…rtual` Bazel already ships JDK 21, but the codebase can't use JDK 21 features (language OR runtime) because Google hasn't migrated to JDK 21 yet (soon TM). But we can cheat a bit using reflection! Work towards #10515 PiperOrigin-RevId: 601188027 Change-Id: I65c1712da0347bef40fc4af94bad4c211687a8b7
Update to the update: JDK 21 language/library features may come to Bazel later than expected, anyway (the unexpected is always to be expected, as they say). But! 112d3b6 just landed on HEAD, which allows us to use Loom to eliminate restarts ( |
@bazel-io fork 7.1.0 |
…rtual` Bazel already ships JDK 21, but the codebase can't use JDK 21 features (language OR runtime) because Google hasn't migrated to JDK 21 yet (soon TM). But we can cheat a bit using reflection! Work towards #10515 PiperOrigin-RevId: 601188027 Change-Id: I65c1712da0347bef40fc4af94bad4c211687a8b7
We use virtual threads if available (JDK 21+); if not, we fall back to not using worker threads at all. This approach is slightly safer than straight up setting the default to `virtual` as that would break certain obscure platforms without an available JDK 21. The plan is to cherry-pick this into 7.1.0. Also removed test_reexecute in bazel_workspaces_test since, well, that's the whole point! Fixes #10515 RELNOTES: The flag `--experimental_worker_for_repo_fetching` now defaults to `auto`, which uses virtual threads from JDK 21 if it's available. This eliminates restarts during repo fetching. PiperOrigin-RevId: 601810629 Change-Id: I9d2ec59688586356d90604fdd328cc985e75d1d4
A fix for this issue has been included in Bazel 7.1.0 RC1. Please test out the release candidate and report any issues as soon as possible. Thanks! |
#10071 # Description of the problem / feature request:
When building tensorflow, looking at the json profile, looks like the LLVM dependency is downloaded and extracted 4 times.
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Run the following commands:
unfortunately, there may be some dependencies to be installed here
then
What operating system are you running Bazel on?
Reproduced in linux and windows, suspected to reproduce in windows, too.
What's the output of
bazel info release
?dowloaded and installed 1.2.1
release 1.2.1
What's the output of
git remote get-url origin ; git rev-parse master ; git rev-parse HEAD
?Have you found anything relevant by searching the web?
No, but there is an internal discussion thread.
I was recommendde by @laszlocsomor to create a github issue and point him and @aehlig to this issue.
Any other information, logs, or outputs that you want to share?
can attach profiler outputs at request.
The text was updated successfully, but these errors were encountered: