Use BigQuery storage read API when reading external reading BigLake tables by anoopj · Pull Request #21017 · trinodb/trino

anoopj · 2024-03-11T21:44:01Z

Description

BigQuery storage APIs support reading BigLake external tables (ie external tables with a connection). But the current implementation uses views which can be expensive, because it requires Trino issuing a SQL query against BigQuery. This PR adds support to read BigLake tables directly using the storage API.

There are no behavior changes for external tables and BQ native tables - they use the view and storage APIs respectively. Added a new test for BigLake tables.

Additional context and related issues

Fixes #21016
https://cloud.google.com/bigquery/docs/biglake-intro

Release notes

(x) Release notes are required, with the following suggested text:

# BigQuery
* Improve performance when reading external BigLake tables. ({issue}`21016`)

hashhar

looks good to me

hashhar · 2024-03-13T08:45:55Z

/test-with-secrets sha=9d7bd1dcad92de70856b928a99e443ec3d8b4619

github-actions · 2024-03-13T09:00:03Z

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/8261925064

ebyhr · 2024-03-13T21:34:42Z

What happens if the connection uses non-BigLake as the source? e.g. MySQL

You won't be able to create an external table with a non-BigLake connections such as Cloud SQL.

Can you share the source? For instance, Create object tables contains the example with connection. This is different page from BigLake tables.

The storage APIs support BigLake tables - the key use case is that they enforce security policies (row/column level). https://cloud.google.com/bigquery/docs/biglake-intro#connectors is the relevant documentation.

I'm not asking if storage APIs support BigLake tables or not. You said "You won't be able to create an external table with a non-BigLake connections". However, https://cloud.google.com/bigquery/docs/object-tables indicates that connection-id can be used for other tables.

Thank you for the update! It's released https://github.com/googleapis/java-bigquery/releases/tag/v2.39.0

Just a heads-up. I'm busy with other projects and won't be able to come back to this in the next few weeks.

@anoopj could you please share your plans regarding this PR?

@ssheikin Sorry for the delay. I am not planning to work on this in the near term. Did you need this feature?

Thank you! Yeah, we were interested in it. Do you know if there is someone who is willing to take over this PR?

ebyhr · 2024-03-14T01:09:38Z

Support for reading BigLake tables using BigQuery storage read API.

Please remove the following dot. https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#format-git-commit-messages

Also, I would change to Use BigQuery storage read API when reading external reading BigLake tables because the current title looks little misleading. Reading BigLake tables has been supported via query API.

…ables The storage APIs support reading BigLake external tables (ie external tables with a connection). But the current implementation uses views which can be expensive, because it requires a query. This PR adds support to read BigLake tables directly using the storage API. There are no behavior changes for external tables and BQ native tables - they use the view and storage APIs respectively. Added a new test for BigLake tables.

anoopj · 2024-03-14T03:15:36Z

Support for reading BigLake tables using BigQuery storage read API.

Please remove the following dot. https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#format-git-commit-messages

Also, I would change to Use BigQuery storage read API when reading external reading BigLake tables because the current title looks little misleading. Reading BigLake tables has been supported via query API.

Done.

anoopj · 2024-03-19T17:25:58Z

@ebyhr Do you have any more feedback or can this be merged?

anoopj · 2024-03-22T17:04:57Z

@ebyhr @hashhar Friendly ping here. We have a GCP customer who is waiting for this PR to be merged.

hashhar · 2024-03-26T09:50:51Z

see #21017 (comment), I think it's an important question.

ssheikin · 2024-03-29T12:37:15Z

            return ImmutableList.of(BigQuerySplit.forViewStream(columns, filter));
        }
-        if (type == MATERIALIZED_VIEW || type == EXTERNAL) {
+        if (type == MATERIALIZED_VIEW || (type == EXTERNAL && !hasConnection)) {


Parentheses are unnecessary.

Also please consider to split this condition to two ifs: one for MATERIALIZED_VIEW and the second one for the EXTERNAL

ssheikin · 2024-03-29T12:49:49Z

+    {
+        @Language("SQL")
+        String jobCountForTableQuery = """
+                     SELECT * FROM region-us.INFORMATION_SCHEMA.JOBS WHERE EXISTS(


* -> count(*)

ssheikin · 2024-03-29T12:54:30Z

+         // Use assertEventually in case there are delays in the BQ information schema.
+        assertEventually(() -> assertThat(bigQuerySqlExecutor.executeQuery(jobCountForTableQuery).getTotalRows())
+                .isEqualTo(expectedJobCount));


when expectedJobCount=0 and in case there are delays in the BQ information schema and that's why jobCountForTableQuery returns 0, then it will be a false positive. assertEventually will not retry. After a while, eventually, jobCountForTableQuery may return other value.

ssheikin · 2024-03-29T13:01:14Z

        return testSessionBuilder()
                .setCatalog("bigquery")
-                .setSchema(TPCH_SCHEMA)
+                .setSchema(TEST_SCHEMA)


unrelated change?

ssheikin · 2024-03-29T13:18:03Z

+    public boolean hasConnection()
+    {
+        return connectionId.isPresent();


There is no need to keep Optional<String> as a field if we are interested only in boolean.

ssheikin · 2024-03-29T13:20:21Z

            return ImmutableList.of(BigQuerySplit.forViewStream(columns, filter));
        }
-        if (type == MATERIALIZED_VIEW || type == EXTERNAL) {
+        if (type == MATERIALIZED_VIEW || (type == EXTERNAL && !hasConnection)) {


Also please consider to split this condition to two ifs: one for MATERIALIZED_VIEW and the second one for the EXTERNAL

ssheikin · 2024-03-29T13:26:20Z

@@ -59,12 +60,17 @@ public abstract class BaseBigQueryConnectorTest
    protected BigQuerySqlExecutor bigQuerySqlExecutor;
    private String gcpStorageBucket;



this empty line is not needed.

ssheikin · 2024-03-29T13:35:04Z

-  `-Dbigquery.credentials-key=base64-text -Dtesting.gcp-storage-bucket=DESTINATION_BUCKET_NAME -Dtesting.alternate-bq-project-id=bigquery-cicd-alternate`.
+  * Set the VM options `bigquery.credentials-key`, `testing.gcp-storage-bucket`, `testing.alternate-bq-project-id` and `testing.bigquery-connection-id` in the IntelliJ "Run Configuration"
+    (or on the CLI if using Maven directly). It should look something like
+        `-Dbigquery.credentials-key=base64-text -Dtesting.gcp-storage-bucket=DESTINATION_BUCKET_NAME -Dtesting.alternate-bq-project-id=bigquery-cicd-alternate -testing.bigquery-connection-id=my_project.us.connection-id`.


- -> -D

ssheikin · 2024-03-29T13:38:23Z

+        this.bigQueryConnectionId = requireNonNull(
+                System.getProperty("testing.bigquery-connection-id"),
+                "testing.bigquery-connection-id is not set");


Usually we have such code in trino in one line.
Also, how is it populated on CI?
I see you've updated documentation with testing.bigquery-connection-id=my_project.us.connection-id, but there are no corresponding changes within ./.github/workflows/ci.yml as they are for
bigquery.credentials-key, testing.gcp-storage-bucket, testing.alternate-bq-project-id

github-actions · 2024-05-23T17:02:40Z

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

github-actions · 2024-06-13T17:02:53Z

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

velascoluis · 2024-06-26T09:40:04Z

Can we get this merged?

hashhar · 2024-06-26T10:09:06Z

@anoopj Do you plan to continue this? Or should someone else pick this up and drive to completion? I see that the newer client is released already.

anoopj · 2024-07-03T01:09:56Z

@hashhar I am not planning to work on this.

mosabua · 2024-07-04T18:11:03Z

@ssheikin and @hashhar .. are you taking this over here or in a new PR? Should we close this one?

k-haley1 · 2024-07-11T17:57:36Z

@ssheikin and @hashhar .. are you taking this over here or in a new PR? Should we close this one?

please leave it open for now. we are discussing.

Praveen2112 · 2024-07-16T16:00:04Z

@mosabua I could take up the work on this PR

mosabua · 2024-07-16T16:38:31Z

Sounds good @Praveen2112 .. since @anoopj is not going to continue you can continue on this PR or start a new one with his work. Just link to this PR if you create a new one.

marcinsbd · 2024-09-10T11:46:53Z

Continuation of this PR - #22974

cla-bot bot added the cla-signed label Mar 11, 2024

anoopj force-pushed the master branch from e577d91 to 230e254 Compare March 11, 2024 21:51

anoopj requested review from ebyhr and hashhar March 11, 2024 21:53

ebyhr requested changes Mar 11, 2024

View reviewed changes

github-actions bot added the bigquery BigQuery connector label Mar 11, 2024

ebyhr reviewed Mar 11, 2024

View reviewed changes

Comment thread plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryTableHandle.java Outdated

anoopj force-pushed the master branch from 230e254 to 90fb5d1 Compare March 13, 2024 02:30

Remove deprecated unused constructor

fc23547

anoopj force-pushed the master branch from 90fb5d1 to 9d7bd1d Compare March 13, 2024 02:33

anoopj requested a review from ebyhr March 13, 2024 03:39

hashhar reviewed Mar 13, 2024

View reviewed changes

Comment thread plugin/trino-bigquery/README.md Outdated

findinpath reviewed Mar 13, 2024

View reviewed changes

Comment thread plugin/trino-bigquery/README.md Outdated

anoopj force-pushed the master branch from 9d7bd1d to 4a53e3b Compare March 13, 2024 13:56

ebyhr reviewed Mar 13, 2024

View reviewed changes

ebyhr changed the title ~~Support for reading BigLake tables using BigQuery storage read API~~ Use BigQuery storage read API when reading external reading BigLake tables Mar 14, 2024

anoopj force-pushed the master branch from 4a53e3b to e3b5b63 Compare March 14, 2024 03:10

anoopj force-pushed the master branch from e3b5b63 to 2ac0a6d Compare March 14, 2024 03:14

ssheikin reviewed Mar 29, 2024

View reviewed changes

github-actions bot added the stale label May 23, 2024

github-actions bot closed this Jun 13, 2024

ssheikin reopened this Jun 14, 2024

github-actions bot removed the stale label Jun 14, 2024

mosabua added the stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. label Jul 16, 2024

marcinsbd mentioned this pull request Sep 10, 2024

Use BigQuery Read API for reading external BigLake tables #22974

Merged

hashhar closed this Sep 11, 2024

		@@ -59,12 +60,17 @@ public abstract class BaseBigQueryConnectorTest
		protected BigQuerySqlExecutor bigQuerySqlExecutor;
		private String gcpStorageBucket;

Conversation

anoopj commented Mar 11, 2024 • edited by ebyhr Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional context and related issues

Release notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hashhar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hashhar commented Mar 13, 2024

Uh oh!

github-actions bot commented Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ebyhr commented Mar 14, 2024

Uh oh!

anoopj commented Mar 14, 2024

Uh oh!

anoopj commented Mar 19, 2024

Uh oh!

anoopj commented Mar 22, 2024

Uh oh!

hashhar commented Mar 26, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 23, 2024

Uh oh!

anoopj commented Mar 11, 2024 •

edited by ebyhr

Loading

github-actions bot commented Mar 13, 2024 •

edited

Loading