Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote concurrency limited by gRPC connections #11801

Closed
EricBurnett opened this issue Jul 17, 2020 · 9 comments
Closed

Remote concurrency limited by gRPC connections #11801

EricBurnett opened this issue Jul 17, 2020 · 9 comments
Assignees
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request untriaged

Comments

@EricBurnett
Copy link

Description of the problem / feature request:

For remote RPCs, bazel only opens one connection per server (resolved IP). This constrains the maximum parallelism of a build - gRPC limits the number of in-flight requests per connection to MAX_CONCURRENT_STREAMS, which is a per-server HTTP2 setting, often set to 100 or 128.

In short, this means that when talking to a server resolved to a single IP, bazel can only have ~100 remote executions in flight, regardless of the setting of --jobs. If the server resolves to more IPs, this may be linearly increased, but rarely above 500. Most remote builds can exceed 100x parallelism, and many (esp. builds with long-duration actions) could exceed 500.

Note that due to bazel displaying full parallelism even if actions are actually queued waiting for gRPC connections, it's hard to know just how many users this affects - more than realize it, I'm quite sure. But I've had at least 3 independent projects root cause this reason for limited parallelism at this point, so it seems to be fairly frequent in reality (at least amongst RBE users).

The necessary fix in bazel is to replace the netty 'round_robin' load balancer with one that can open more than 1 connection (subchannel) per resolved address. Maybe by dynamically adding more when bottlenecked, but feeding in a target number of connections as e.g. max(2, jobs/50) would probably do just fine.

Feature requests: what underlying problem are you trying to solve with this feature?

Enable higher build parallelism for remote builds

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Run a remote build with --jobs=1000 and observe bazel showing '1000 remote'. Then check server perspective and realize bazel never had more than X00 RPCs in flight.

What operating system are you running Bazel on?

Linux (but applies to all)

What's the output of bazel info release?

3.4.1

Have you found anything relevant by searching the web?

cc @buchgr @bergsieker @gkousik @ulfjack

@gkousik
Copy link

gkousik commented Jul 18, 2020

Thanks for filing the bug @EricBurnett !

A sample repo through which you can reproduce the issue from the client side: https://github.com/gkousik/bazel-parallelism-test

The above Bazel repo contains 500 genrules, with a random number value for each action-env so as to NOT get cache-hits. Each action sleeps for 10 seconds and then generates an output file. When you run build on that repo with a remote-execution pool with 500 bots, I would expect the build to complete in 10 seconds, since all rules are parallelizable and so with 500 bots, I would expect all of them to be sent to RBE at once.

What happens though is that, the build proceeds in batches of 200 actions and takes 30+ seconds to complete no matter how many times I run the build.

➜  bazel-parallelism-test git:(master) ✗ time ./runbuild.sh
INFO: Invocation ID: 803973fb-817c-4814-b27e-053086c56174
INFO: Build option --action_env has changed, discarding analysis cache.
DEBUG: /usr/local/google/home/kousikk/.cache/bazel/_bazel_kousikk/07c64333bc485db1757df85f8da29b91/external/bazel_toolchains/rules/rbe_repo.bzl:491:10: Bazel 3.2.0 is used in rbe_default.
INFO: Analyzed 500 targets (6 packages loaded, 511 targets configured).
INFO: Found 500 targets...
INFO: Elapsed time: 35.933s, Critical Path: 35.29s
INFO: 500 processes: 500 remote.
INFO: Build completed successfully, 501 total actions
./runbuild.sh  0.02s user 0.04s system 0% cpu 35.982 total

@oquenchil oquenchil added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged type: feature request labels Jul 22, 2020
@vmrautio
Copy link

vmrautio commented Aug 5, 2020

Hitting the same issue. We would have a back-end that scales to O(1000s) of concurrent actions but this is limiting the parallelism to O(100).

@coeuvre coeuvre self-assigned this Aug 7, 2020
@ulfjack
Copy link
Contributor

ulfjack commented Aug 11, 2020

This depends on the server-side configuration. For a server written using grpc-java, the documentation of the NettyServerBuilder says:

public NettyServerBuilder maxConcurrentCallsPerConnection(int maxCalls)
The maximum number of concurrent calls permitted for each incoming connection. Defaults to no limit.

See here: https://grpc.github.io/grpc-java/javadoc/io/grpc/netty/NettyServerBuilder.html#maxConcurrentCallsPerConnection-int-

I am not sure Bazel should be circumventing the server-side configuration by creating multiple connections in this case.

@EricBurnett
Copy link
Author

@ulfjack I don't think REAPI servers are (or should be) using a per connection limit to indicate to clients the max parallelism they want to support. If the goal is to tell clients an overall limit they want honoured (per IP or endpoint-wide), that should go into Capabilities so that the client can apply it. If it's per-connection, it should not also be interpreted as a limit on the number of connections.

Per @werkt , default is unlimited and bazel happily works with servers allowing many more RPCs. The most common case I'm aware of for MAX_CONCURRENT_STREAMS to be set is from load balancers, where multiple connections are a legitimate and appropriate way to spread load across multiple LBs and backends. Are you aware of anyone using it to mean something else?

@ulfjack
Copy link
Contributor

ulfjack commented Aug 12, 2020

Good point. I think one of the goals of HTTP2 was to only have one connection from the client to the server rather than multiple, but that may not apply to gRPC. If you have an LB with a single IP, should the client create multiple connections? The intention of the load balancer is to balance the load to the actual servers, so it isn't immediately obvious to me that you'd need that - the load balancers can terminate the HTTP2 connection and distribute the individual requests to different servers (i.e., a HTTP2 proxy). Do we need to load balance to the load balancers? If so, why would the service not announce multiple IPs?

I appreciate that it may not be possible in some cases to control MAX_CONCURRENT_STREAMS on the server-side when using pre-existing HTTP proxies, so maybe we need something in the protocol to tell the client that they can / should open multiple connections. It's a bit odd that there isn't such a thing in HTTP2 - that would seem the obvious place to put it. Maybe they intentionally didn't want to add such a mechanism because it's hard to enforce on the server-side (in a distributed system)?

Maybe we should ask the gRPC folks what they think?

@gkousik
Copy link

gkousik commented Aug 12, 2020

@ulfjack grpc/grpc#21386 (comment) the recommendation from gRPC folks was to create additional connections and is leaning towards "Create a pool of connections and round-robin across it" (as opposed to reactively creating additional connections when a certain threshold of concurrent RPCs is reached).

@ulfjack
Copy link
Contributor

ulfjack commented Aug 12, 2020

Ah, ok. Sorry, I forgot about that in my previous post. What's the right behavior, then? Inspect MAX_CONCURRENT_STREAMS and decide based on that and --jobs how many channels to open? Do we need to add something to the protocol?

@EricBurnett
Copy link
Author

EricBurnett commented Aug 12, 2020

Ulf:

Do we need to load balance to the load balancers?

I think so - e.g. consider a 64-core well-networked builder doing a byte-heavy build. My understanding is this will either be throughput constrained (possibly due to the fact that most LBs use at most one core per connection, possibly from HTTP2 bottlenecks itself), or will hotspot the LB (remembering that the bandwidth required of the LB is 2x the bandwidth of the client, since it has to proxy it to the backend as well). I haven't actually sat down and measured it yet though - I know it's hard to exceed ~2Gbps on a single stream, but I haven't yet evaluated our effective per-connection bandwidth limit.

If so, why would the service not announce multiple IPs?

Not an expert in this area, but my understanding: Anycast and DNS. With DNS caching layers IP-per-proxy can lead to roving hotspots and generally scales poorly, and with Anycast it's straightforward to put multiple servers behind one IP anyways.

So you'll often see multiple IPs, but a small number like "2" that are functioning more as independent failure domains than having any logical connection to the servers backing them. See e.g. https://aws.amazon.com/blogs/networking-and-content-delivery/using-static-ip-addresses-for-application-load-balancers/ for an example of this pattern.

To be clear, this is not universal: in some setups load balancing with one-connection-per-ip is probably just fine. But even in these scenarios, using a few extra connections (e.g. 2 or 3 connections per IP) should not cause any harm - bazel connections will be comparatively few and heavy vs what you might see from "average" clients on the network (web browsers, apps, etc), and should in no way strain anything on "number of connections" grounds.

What's the right behavior, then? Inspect MAX_CONCURRENT_STREAMS and decide based on that and --jobs how many channels to open? Do we need to add something to the protocol?

From what we've seen from java gRPC, and heard from the team when we file bugs there, inspecting MAX_CONCURRENT_STREAMS will be hard as it's learned fairly late in the process (after connections are starting to be opened). But I've looked around and I've also not seen any evidence of anyone defaulting particularly low values there, with 100+ being normal. So my suggestion is to pick a target number of connections as --jobs/50*, learn the number of IPs available, and open ceil(ips/target_connections) per IP. (If the number of IPs is also not knowable at that time, I guess we could just assume 2?)

I think this should work fairly well for all REAPI environments - at the small end, someone trying to do a --jobs=2000 build against a single server would see 40 connections be opened, which shouldn't be a problem at all for that server (servers can generally handle 10s of thousands of open connections); at the large end that same build against a LB setup will spread the load over 40 different proxy servers and should see higher effective throughput as a result.

*why 50 and not 100? gRPC behaviour again: RPCs are bound to connections early and then queue against them. Uniformly distributing random length RPCs to connections sees them get full non-uniformly, so some connections will have queueing while others have spare capacity. Leaving 2x headroom should make that very unlikely.

bazel-io pushed a commit that referenced this issue Aug 14, 2020
From #11801

> due to bazel displaying full parallelism even if actions are actually queued waiting for gRPC connections

This PR makes Bazel display running actions only if it is actually executing remotely as a first step to fix parallelism limitation.

With `--jobs=500`, the progress bar is displayed as:
```
[1 / 501] 500 actions running
```

However, the number of actual running actions by remote workers is not 500 since some actions are pending because of the limitation. With this change, the progress bar now become:
```
[1 / 501] 500 actions, 200 running
```

Closes #11922.

PiperOrigin-RevId: 326604891
@coeuvre
Copy link
Member

coeuvre commented Mar 16, 2021

FYI: the fixed size channel pool is replaced with a dynamic channel pool which will open new connections on demand. de8f69d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request untriaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants