Skip to content

Conversation

@findepi
Copy link
Member

@findepi findepi commented Jun 5, 2024

No description provided.

@cla-bot cla-bot bot added the cla-signed label Jun 5, 2024
@github-actions github-actions bot added the bigquery BigQuery connector label Jun 5, 2024
@findepi findepi force-pushed the findepi/bq-auto branch 3 times, most recently from 06b922a to 0437630 Compare June 5, 2024 15:55
@findepi findepi requested a review from davidrabinowitz June 5, 2024 16:03
Comment on lines 169 to 168
Copy link
Member

@hashhar hashhar Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi Did you mean to clamp this to 100 at-most? Current code is min(worker * 3, 100) which can be less than 100.
Did you mean max(100, min(worker * 3, 100))?

Or change the comment to Limit to 100 for very large clusters.

EDIT: Nevermind, this number is now fed to different parameter in client which sets min count of streams, not actual requested count.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that the ReadSession creation takes two parameters - a maximal number of streams, and a preferred minimal number of streams, which the API tries to accommodate but treats it as a recommendation only, not binding limit. This logic is similar to what the Spark BigQuery connector is doing.

By the way, why not use the bigquery-connector-common library used by both Spark and Hive connectors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidrabinowitz that you for your input!

By the way, why not use the bigquery-connector-common library used by both Spark and Hive connectors?

I think this is a very important question, but I definitely am not in a position to answer it. Please allow me to gloss over it.

Notice that the ReadSession creation takes two parameters - a maximal number of streams, and a preferred minimal number of streams, which the API tries to accommodate but treats it as a recommendation only, not binding limit.

Indeed. And this PR switches us from setting the max, to setting the min.
The idea is that we actually don't want to have a static max. For example, if the data set is very large, we want to have a very large number of "reasonably sized" splits.
However, I don't know whether this is how it works.

@davidrabinowitz do we need to set both 'recommended min' and 'strict max' parameters? or can we just set the 'recommended min' ?

Copy link
Member

@hashhar hashhar Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark BigQuery connector seems to set both (https://github.com/search?q=repo%3AGoogleCloudDataproc%2Fspark-bigquery-connector+setPreferredMinStreamCount&type=code). They request at-least 3, and at most 20k by default (which is odd since IIRC 1k is the actual limit as documented at https://cloud.google.com/php/docs/reference/cloud-bigquery-storage/latest/V1.CreateReadSessionRequest#_Google_Cloud_BigQuery_Storage_V1_CreateReadSessionRequest__setPreferredMinStreamCount__).

Setting max might be useful for example if we want to avoid having too many splits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but what should be the max value? 10k?
if the actual limit is 1k, we're fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, why not use the bigquery-connector-common library used by both Spark and Hive connectors?

@davidrabinowitz @anoopj I sent DM in Trino community Slack. Let's continue the discussion there.

@wendigo
Copy link
Contributor

wendigo commented Jun 6, 2024

Please rebase @findepi

@findepi
Copy link
Member Author

findepi commented Jun 6, 2024

Please rebase @findepi

Happy to, provided we have directional agreement about the change.
That's not clear to me yet.

@hashhar
Copy link
Member

hashhar commented Jun 6, 2024

Let's go ahead with the change. It makes sense to me.

If we ever see that BigQuery gives errors when requesting a stream count in high concurrency workloads then we can introduce some config to adjust the "multiplier value" so that we don't run into quotas/limits. For now though, no changes requested.

@wendigo
Copy link
Contributor

wendigo commented Jun 6, 2024

@findepi I'm in favor of having less configuration in connectors so 👍🏼 from me

@findepi
Copy link
Member Author

findepi commented Jun 7, 2024

I don't see any requests for changes except editorial #22279 (comment).
can i get more review comments, or maybe approval?
of course, I am also hoping for a clarification in #22279 (comment)

@hashhar
Copy link
Member

hashhar commented Jun 7, 2024

pls fix conflicts

@findepi findepi force-pushed the findepi/bq-auto branch from 0437630 to bd87040 Compare June 7, 2024 13:21
@findepi
Copy link
Member Author

findepi commented Jun 7, 2024

just rebased

@findepi findepi force-pushed the findepi/bq-auto branch from bd87040 to c85986d Compare June 7, 2024 13:24
@findepi findepi force-pushed the findepi/bq-auto branch from c85986d to 608cf1d Compare June 7, 2024 15:12
@github-actions github-actions bot added the docs label Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

8 participants