Automatically configure BigQuery scan parallelism #22279

findepi · 2024-06-05T15:33:06Z

No description provided.

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryConfig.java

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQuerySplitManager.java

hashhar · 2024-06-06T11:45:15Z

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQuerySplitManager.java

~~@findepi Did you mean to clamp this to 100 at-most? Current code is min(worker * 3, 100) which can be less than 100.~~
~~Did you mean max(100, min(worker * 3, 100))?~~

~~Or change the comment to Limit to 100 for very large clusters.~~

EDIT: Nevermind, this number is now fed to different parameter in client which sets min count of streams, not actual requested count.

Notice that the ReadSession creation takes two parameters - a maximal number of streams, and a preferred minimal number of streams, which the API tries to accommodate but treats it as a recommendation only, not binding limit. This logic is similar to what the Spark BigQuery connector is doing.

By the way, why not use the bigquery-connector-common library used by both Spark and Hive connectors?

@davidrabinowitz that you for your input!

By the way, why not use the bigquery-connector-common library used by both Spark and Hive connectors?

I think this is a very important question, but I definitely am not in a position to answer it. Please allow me to gloss over it.

Notice that the ReadSession creation takes two parameters - a maximal number of streams, and a preferred minimal number of streams, which the API tries to accommodate but treats it as a recommendation only, not binding limit.

Indeed. And this PR switches us from setting the max, to setting the min.
The idea is that we actually don't want to have a static max. For example, if the data set is very large, we want to have a very large number of "reasonably sized" splits.
However, I don't know whether this is how it works.

@davidrabinowitz do we need to set both 'recommended min' and 'strict max' parameters? or can we just set the 'recommended min' ?

Spark BigQuery connector seems to set both (https://github.com/search?q=repo%3AGoogleCloudDataproc%2Fspark-bigquery-connector+setPreferredMinStreamCount&type=code). They request at-least 3, and at most 20k by default (which is odd since IIRC 1k is the actual limit as documented at https://cloud.google.com/php/docs/reference/cloud-bigquery-storage/latest/V1.CreateReadSessionRequest#_Google_Cloud_BigQuery_Storage_V1_CreateReadSessionRequest__setPreferredMinStreamCount__).

Setting max might be useful for example if we want to avoid having too many splits.

Sure, but what should be the max value? 10k?
if the actual limit is 1k, we're fine.

By the way, why not use the bigquery-connector-common library used by both Spark and Hive connectors?

@davidrabinowitz @anoopj I sent DM in Trino community Slack. Let's continue the discussion there.

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/ReadSessionCreator.java

wendigo · 2024-06-06T12:06:36Z

Please rebase @findepi

findepi · 2024-06-06T13:20:11Z

Please rebase @findepi

Happy to, provided we have directional agreement about the change.
That's not clear to me yet.

hashhar · 2024-06-06T13:23:40Z

Let's go ahead with the change. It makes sense to me.

If we ever see that BigQuery gives errors when requesting a stream count in high concurrency workloads then we can introduce some config to adjust the "multiplier value" so that we don't run into quotas/limits. For now though, no changes requested.

wendigo · 2024-06-06T14:35:48Z

@findepi I'm in favor of having less configuration in connectors so 👍🏼 from me

findepi · 2024-06-07T12:27:13Z

I don't see any requests for changes except editorial #22279 (comment).
can i get more review comments, or maybe approval?
of course, I am also hoping for a clarification in #22279 (comment)

hashhar · 2024-06-07T13:05:17Z

pls fix conflicts

findepi · 2024-06-07T13:21:25Z

just rebased

Similar to what we do in other connectors. There is no reason to create multiple splits just to return row count information.

docs/src/main/sphinx/connector/bigquery.md

findepi requested review from ebyhr, mayankvadariya, vlad-lyutenko and wendigo June 5, 2024 15:33

cla-bot bot added the cla-signed label Jun 5, 2024

findepi mentioned this pull request Jun 5, 2024

Add read_parallelism session property to BigQuery #22273

Closed

github-actions bot added the bigquery BigQuery connector label Jun 5, 2024

findepi force-pushed the findepi/bq-auto branch 3 times, most recently from 06b922a to 0437630 Compare June 5, 2024 15:55

findepi requested a review from davidrabinowitz June 5, 2024 16:03

findinpath reviewed Jun 6, 2024

View reviewed changes

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryConfig.java Outdated Show resolved Hide resolved

findinpath reviewed Jun 6, 2024

View reviewed changes

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQuerySplitManager.java Outdated Show resolved Hide resolved

hashhar reviewed Jun 6, 2024

View reviewed changes

wendigo reviewed Jun 6, 2024

View reviewed changes

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/ReadSessionCreator.java Outdated Show resolved Hide resolved

hashhar approved these changes Jun 7, 2024

View reviewed changes

findepi force-pushed the findepi/bq-auto branch from 0437630 to bd87040 Compare June 7, 2024 13:21

findepi force-pushed the findepi/bq-auto branch from bd87040 to c85986d Compare June 7, 2024 13:24

findepi added 5 commits June 7, 2024 17:12

Prevent OOM when expanding projection over BigQuery no columns scan

2179ae3

Improve ternary operator formatting in BigQuerySplitManager

11a7f70

Produce single Big Query split when row count is all that matters

63f9d63

Similar to what we do in other connectors. There is no reason to create multiple splits just to return row count information.

Use safe idiom of getting the only element

cbf9686

Automatically configure BigQuery scan parallelism

608cf1d

findepi force-pushed the findepi/bq-auto branch from c85986d to 608cf1d Compare June 7, 2024 15:12

github-actions bot added the docs label Jun 7, 2024

findepi commented Jun 7, 2024

View reviewed changes

docs/src/main/sphinx/connector/bigquery.md Show resolved Hide resolved

findepi commented Jun 7, 2024

View reviewed changes

docs/src/main/sphinx/connector/bigquery.md Show resolved Hide resolved

empty: roll the dice 🎲

0ed1d5c

findepi merged commit 6c4c1d9 into master Jun 11, 2024

findepi deleted the findepi/bq-auto branch June 11, 2024 11:24

github-actions bot added this to the 450 milestone Jun 11, 2024

colebow mentioned this pull request Jun 11, 2024

Add Trino 450 release notes #22327

Merged

This was referenced Mar 26, 2025

Make max parallelism configurable in BigQuery #25422

Merged

Repeated SELECT queries on large tables using the BigQuery block other queries #25423

Closed

Automatically configure BigQuery scan parallelism #22279

Automatically configure BigQuery scan parallelism #22279

Uh oh!

Conversation

findepi commented Jun 5, 2024

Uh oh!

Uh oh!

Uh oh!

hashhar Jun 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidrabinowitz Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

findepi Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

hashhar Jun 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

findepi Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

ebyhr Jun 11, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wendigo commented Jun 6, 2024

Uh oh!

findepi commented Jun 6, 2024

Uh oh!

hashhar commented Jun 6, 2024

Uh oh!

wendigo commented Jun 6, 2024

Uh oh!

findepi commented Jun 7, 2024

Uh oh!

hashhar commented Jun 7, 2024

Uh oh!

findepi commented Jun 7, 2024

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

8 participants

hashhar Jun 6, 2024 •

edited

Loading

hashhar Jun 7, 2024 •

edited

Loading