Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch RepoSpecs in parallel #19294

Closed
wants to merge 1 commit into from
Closed

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Aug 22, 2023

Computing RepoSpecs for all selected Bazel modules is on the critical path for the computation of the main repository mapping and thus benefits from parallelized downloads.

On my machine, this change has the following effect on bazel build //src:bazel-dev --enable_bzlmod --nobuild:

before
compute main repo mapping: 8s 127ms

after
compute main repo mapping: 4s 226ms

@github-actions github-actions bot added awaiting-review PR is awaiting review from an assigned reviewer team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. labels Aug 22, 2023
@fmeum
Copy link
Collaborator Author

fmeum commented Aug 22, 2023

I briefly looked into whether the repo spec computation could be moved out of the compute main repo mapping's path, but couldn't figure it out.

@fmeum
Copy link
Collaborator Author

fmeum commented Aug 22, 2023

The flaky test may be related to this PR.

entry.getKey(),
moduleFromInterimModule(
entry.getValue(), overrides.get(entry.getKey().getName()), eventHandler));
ExecutorService executorService = Executors.newWorkStealingPool(8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all I'll admit I'm very unfamiliar with the Java parallel stream stuff, so apologies if I make any rookie mistakes here :)

From my read of things, it looks like we don't even need to have this executor service at all? We could just directly call collect on a parallel stream and Java would use the global fork join pool (ForkJoinPool.commonPool()) underneath. That's probably better than using a dedicated pool (as the global fork join pool is also used to carry virtual threads).

But I'm not sure what happens when an exception is thrown in that case. In sequential streams, I think the exception just gets propagated out, so we just need to "tunnel" it. Do you know what happens in parallel streams?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with using the common pool is that it has NUM_CPUS - 1 threads by default, which can interact badly with other parallel operations relying on it, especially since we are blocking on IO.

Exceptions in parallel streams are also thrown by the terminal operation. I think that the first throw "wins", but I'm not sure about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I did some more reading on parallel streams and I increasingly feel like they're not a great fit for our use case (maybe I just don't understand them well enough). My main gripe is the fact that it tries to do all the MapReduce-esque splitting and merging which is basically completely unnecessary here where we're just doing a map and basically no reduce. Plus the concurrency primitives are completely hidden and it works "magically" depending on whether you're already in a ForkJoinPool or not; and this doesn't seem to be documented anywhere in the official JDK -- I had to believe StackOverflow. (For example, I'm not confident at all whether the introduction of Loom would automatically cause parallel stream processing to use virtual threads.)

On the other hand, I'll admit that the magic isn't all bad -- just slapping .parallel() on the stream and having it work is pretty neat.

I'd probably have preferred to use Skyframe for this; that is, create a RepoSpecFunction/Value and just give everything to Skyframe. To avoid the restart, we could use a SkyKeyComputeState to store everything right before the repo spec computation step. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @lberki

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the lack of official documentation on parallel streams' internals really hurts their usefulness. I can only hope that this explained by the desire to switch them over to a virtual thread executor by default once available - one can dream.

I don't quite see the problem with us not using the full feature set of a MapReduce. Isn't turning a list of values into a LinkedHashMap still an (associative) reduction? But we do need to believe StackOverflow that streams are just relying on ForkJoinPool#fork and thus just depens on the ForkJoinPool implementation that we started with (#commonPool or a custom one).

I am fine with switching to Skyframe here though, especially as it avoids anyone reading the code from having to reason about another concurrency framework.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found a way to extract this work into a SkyFunction, please take a look. I am a bit unsure about how to implement Registry's equality correctly, which made me worried about using it in a SkyKey.

Copy link
Contributor

@lberki lberki Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After spending like 10 seconds looking at this change, I don't think Registry is appropriate to be included in a SkyKey: SkyKeys are supposed to be small and serializable value objects, and IndexRegistry keeps references to a DownloadManager and a Gson instance, which means that they are not value objects at all.

From a quick glance, it looks like each Registry instance comes from a URI in ModuleFileFunction.REGISTRIES so maybe the URI could be used in the SkyKey instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I trimmed this down to a ModuleKey and a String.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that sounds much nicer. I'll leave the review to @Wyverald since he is much more familiar with the area (unless he indicates that he could use my brain)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite see the problem with us not using the full feature set of a MapReduce. Isn't turning a list of values into a LinkedHashMap still an (associative) reduction?

That was poorly worded on my part -- what I wanted to say was that the extra considerations about splitting the stream into appropriately sized substreams so that it could be sent to child ForkJoinTasks and then merging the results together are kind of wasteful. Yes technically a "map" operation is a "reduce" operation itself too, but we really don't need a tree structure for this. Just linearly going over the keys and sending each one off for a download will suffice.

So I'm very happy with the new Skyframe-based implementation :)

@fmeum fmeum force-pushed the bzlmod-futures branch 2 times, most recently from d1dfd64 to df70c95 Compare August 23, 2023 10:18
@fmeum
Copy link
Collaborator Author

fmeum commented Aug 23, 2023

@Wyverald I stacked this onto the profiling PR for the moment.

@fmeum
Copy link
Collaborator Author

fmeum commented Aug 23, 2023

@lberki Do you happen to know how the value of HttpDownloader#MAX_PARALLEL_DOWNLOADS was determined? Increasing it to 16 saves another 0.5s (out of 4.5s) for me.

@lberki
Copy link
Contributor

lberki commented Aug 23, 2023

Nope, sorry. I checked the change history but there doesn't seem to be any indication as to how the author of the change that added the parallelism arrived at that number. Naively, 8 seems to be both too small and too unconfigurable, but I'm saying this with literally zero research, so that it with a grain of salt.

@fmeum fmeum requested a review from Wyverald August 23, 2023 11:54
@fmeum fmeum requested a review from a team as a code owner August 23, 2023 12:15
@fmeum fmeum requested review from aranguyen and removed request for a team August 23, 2023 12:15
@Wyverald
Copy link
Member

Also please rebase; the profiling PR is merged :)

@@ -158,6 +158,7 @@ public final class SkyFunctions {
SkyFunctionName.createHermetic("BAZEL_DEP_GRAPH");
public static final SkyFunctionName BAZEL_LOCK_FILE =
SkyFunctionName.createHermetic("BAZEL_LOCK_FILE");
public static final SkyFunctionName REPO_SPEC = SkyFunctionName.createNonHermetic("REPO_SPEC");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an observation: I noticed that BAZEL_MODULE_RESOLUTION was marked hermetic even though it probably wasn't before this PR (due to the registry fetches).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, that does look like an oversight. However in practice this shouldn't matter much since we assume existing registry contents are immutable anyway.

@fmeum fmeum requested a review from Wyverald August 23, 2023 19:40
Computing `RepoSpec`s for all selected Bazel modules is on the critical
path for the computation of the main repository mapping and thus
benefits from parallelized downloads.

On my machine, this change has the following effect on
`bazel build //src:bazel-dev --enable_bzlmod --nobuild`:

```
before
compute main repo mapping: 8s 127ms

after
compute main repo mapping: 4s 226ms
```
Copy link
Member

@Wyverald Wyverald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@Wyverald Wyverald added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Aug 23, 2023
@fmeum
Copy link
Collaborator Author

fmeum commented Aug 23, 2023

Thanks for pushing me towards the Skyframe solution. A side effect of that is that trivial changes to the root module file now no longer result in any network I/O (assuming the Skyframe cache is retained during invocations).

@github-actions github-actions bot removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Aug 24, 2023
@fmeum
Copy link
Collaborator Author

fmeum commented Aug 24, 2023

@bazel-io flag

@fmeum fmeum deleted the bzlmod-futures branch August 24, 2023 18:46
@bazel-io bazel-io added the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Aug 24, 2023
@Wyverald
Copy link
Member

@bazel-io fork 6.4.0

@bazel-io bazel-io removed the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Aug 24, 2023
bazel-io pushed a commit to bazel-io/bazel that referenced this pull request Aug 24, 2023
Computing `RepoSpec`s for all selected Bazel modules is on the critical path for the computation of the main repository mapping and thus benefits from parallelized downloads.

On my machine, this change has the following effect on `bazel build //src:bazel-dev --enable_bzlmod --nobuild`:

```
before
compute main repo mapping: 8s 127ms

after
compute main repo mapping: 4s 226ms
```

Closes bazelbuild#19294.

PiperOrigin-RevId: 559819452
Change-Id: Ieef957fcfe402c909d2863ba4a4ca3540781a56d
fmeum added a commit to fmeum/bazel that referenced this pull request Aug 28, 2023
Computing `RepoSpec`s for all selected Bazel modules is on the critical path for the computation of the main repository mapping and thus benefits from parallelized downloads.

On my machine, this change has the following effect on `bazel build //src:bazel-dev --enable_bzlmod --nobuild`:

```
before
compute main repo mapping: 8s 127ms

after
compute main repo mapping: 4s 226ms
```

Closes bazelbuild#19294.

PiperOrigin-RevId: 559819452
Change-Id: Ieef957fcfe402c909d2863ba4a4ca3540781a56d
iancha1992 pushed a commit that referenced this pull request Aug 29, 2023
Computing `RepoSpec`s for all selected Bazel modules is on the critical
path for the computation of the main repository mapping and thus
benefits from parallelized downloads.

On my machine, this change has the following effect on `bazel build
//src:bazel-dev --enable_bzlmod --nobuild`:

```
before
compute main repo mapping: 8s 127ms

after
compute main repo mapping: 4s 226ms
```

Closes #19294.

Commit
8a68310

PiperOrigin-RevId: 559819452
Change-Id: Ieef957fcfe402c909d2863ba4a4ca3540781a56d
@iancha1992
Copy link
Member

The changes in this PR have been included in Bazel 6.4.0 RC1. Please test out the release candidate and report any issues as soon as possible. If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=last_rc.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants