Use BigQuery storage read API when reading external reading BigLake tables#21017
Use BigQuery storage read API when reading external reading BigLake tables#21017anoopj wants to merge 2 commits intotrinodb:masterfrom
Conversation
|
/test-with-secrets sha=9d7bd1dcad92de70856b928a99e443ec3d8b4619 |
|
The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/8261925064 |
There was a problem hiding this comment.
You won't be able to create an external table with a non-BigLake connections such as Cloud SQL.
There was a problem hiding this comment.
Can you share the source? For instance, Create object tables contains the example with connection. This is different page from BigLake tables.
There was a problem hiding this comment.
The storage APIs support BigLake tables - the key use case is that they enforce security policies (row/column level). https://cloud.google.com/bigquery/docs/biglake-intro#connectors is the relevant documentation.
There was a problem hiding this comment.
I'm not asking if storage APIs support BigLake tables or not. You said "You won't be able to create an external table with a non-BigLake connections". However, https://cloud.google.com/bigquery/docs/object-tables indicates that connection-id can be used for other tables.
There was a problem hiding this comment.
Thank you for the update! It's released https://github.com/googleapis/java-bigquery/releases/tag/v2.39.0
There was a problem hiding this comment.
Just a heads-up. I'm busy with other projects and won't be able to come back to this in the next few weeks.
There was a problem hiding this comment.
@anoopj could you please share your plans regarding this PR?
There was a problem hiding this comment.
@ssheikin Sorry for the delay. I am not planning to work on this in the near term. Did you need this feature?
There was a problem hiding this comment.
Thank you! Yeah, we were interested in it. Do you know if there is someone who is willing to take over this PR?
Please remove the following dot. https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#format-git-commit-messages Also, I would change to |
…ables The storage APIs support reading BigLake external tables (ie external tables with a connection). But the current implementation uses views which can be expensive, because it requires a query. This PR adds support to read BigLake tables directly using the storage API. There are no behavior changes for external tables and BQ native tables - they use the view and storage APIs respectively. Added a new test for BigLake tables.
Done. |
|
@ebyhr Do you have any more feedback or can this be merged? |
|
see #21017 (comment), I think it's an important question. |
| return ImmutableList.of(BigQuerySplit.forViewStream(columns, filter)); | ||
| } | ||
| if (type == MATERIALIZED_VIEW || type == EXTERNAL) { | ||
| if (type == MATERIALIZED_VIEW || (type == EXTERNAL && !hasConnection)) { |
There was a problem hiding this comment.
Parentheses are unnecessary.
There was a problem hiding this comment.
Also please consider to split this condition to two ifs: one for MATERIALIZED_VIEW and the second one for the EXTERNAL
| { | ||
| @Language("SQL") | ||
| String jobCountForTableQuery = """ | ||
| SELECT * FROM region-us.INFORMATION_SCHEMA.JOBS WHERE EXISTS( |
| // Use assertEventually in case there are delays in the BQ information schema. | ||
| assertEventually(() -> assertThat(bigQuerySqlExecutor.executeQuery(jobCountForTableQuery).getTotalRows()) | ||
| .isEqualTo(expectedJobCount)); |
There was a problem hiding this comment.
when expectedJobCount=0 and in case there are delays in the BQ information schema and that's why jobCountForTableQuery returns 0, then it will be a false positive. assertEventually will not retry. After a while, eventually, jobCountForTableQuery may return other value.
| return testSessionBuilder() | ||
| .setCatalog("bigquery") | ||
| .setSchema(TPCH_SCHEMA) | ||
| .setSchema(TEST_SCHEMA) |
| public boolean hasConnection() | ||
| { | ||
| return connectionId.isPresent(); |
There was a problem hiding this comment.
There is no need to keep Optional<String> as a field if we are interested only in boolean.
| return ImmutableList.of(BigQuerySplit.forViewStream(columns, filter)); | ||
| } | ||
| if (type == MATERIALIZED_VIEW || type == EXTERNAL) { | ||
| if (type == MATERIALIZED_VIEW || (type == EXTERNAL && !hasConnection)) { |
There was a problem hiding this comment.
Also please consider to split this condition to two ifs: one for MATERIALIZED_VIEW and the second one for the EXTERNAL
| @@ -59,12 +60,17 @@ public abstract class BaseBigQueryConnectorTest | |||
| protected BigQuerySqlExecutor bigQuerySqlExecutor; | |||
| private String gcpStorageBucket; | |||
|
|
|||
There was a problem hiding this comment.
this empty line is not needed.
| `-Dbigquery.credentials-key=base64-text -Dtesting.gcp-storage-bucket=DESTINATION_BUCKET_NAME -Dtesting.alternate-bq-project-id=bigquery-cicd-alternate`. | ||
| * Set the VM options `bigquery.credentials-key`, `testing.gcp-storage-bucket`, `testing.alternate-bq-project-id` and `testing.bigquery-connection-id` in the IntelliJ "Run Configuration" | ||
| (or on the CLI if using Maven directly). It should look something like | ||
| `-Dbigquery.credentials-key=base64-text -Dtesting.gcp-storage-bucket=DESTINATION_BUCKET_NAME -Dtesting.alternate-bq-project-id=bigquery-cicd-alternate -testing.bigquery-connection-id=my_project.us.connection-id`. |
| this.bigQueryConnectionId = requireNonNull( | ||
| System.getProperty("testing.bigquery-connection-id"), | ||
| "testing.bigquery-connection-id is not set"); |
There was a problem hiding this comment.
Usually we have such code in trino in one line.
Also, how is it populated on CI?
I see you've updated documentation with testing.bigquery-connection-id=my_project.us.connection-id, but there are no corresponding changes within ./.github/workflows/ci.yml as they are for
bigquery.credentials-key, testing.gcp-storage-bucket, testing.alternate-bq-project-id
|
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
|
Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time. |
|
Can we get this merged? |
|
@anoopj Do you plan to continue this? Or should someone else pick this up and drive to completion? I see that the newer client is released already. |
|
@hashhar I am not planning to work on this. |
|
@mosabua I could take up the work on this PR |
|
Sounds good @Praveen2112 .. since @anoopj is not going to continue you can continue on this PR or start a new one with his work. Just link to this PR if you create a new one. |
|
Continuation of this PR - #22974 |

Description
BigQuery storage APIs support reading BigLake external tables (ie external tables with a connection). But the current implementation uses views which can be expensive, because it requires Trino issuing a SQL query against BigQuery. This PR adds support to read BigLake tables directly using the storage API.
There are no behavior changes for external tables and BQ native tables - they use the view and storage APIs respectively. Added a new test for BigLake tables.
Additional context and related issues
Fixes #21016
https://cloud.google.com/bigquery/docs/biglake-intro
Release notes
(x) Release notes are required, with the following suggested text: