Skip to content

Use BigQuery storage read API when reading external reading BigLake tables#21017

Closed
anoopj wants to merge 2 commits intotrinodb:masterfrom
anoopj:master
Closed

Use BigQuery storage read API when reading external reading BigLake tables#21017
anoopj wants to merge 2 commits intotrinodb:masterfrom
anoopj:master

Conversation

@anoopj
Copy link
Copy Markdown
Member

@anoopj anoopj commented Mar 11, 2024

Description

BigQuery storage APIs support reading BigLake external tables (ie external tables with a connection). But the current implementation uses views which can be expensive, because it requires Trino issuing a SQL query against BigQuery. This PR adds support to read BigLake tables directly using the storage API.

There are no behavior changes for external tables and BQ native tables - they use the view and storage APIs respectively. Added a new test for BigLake tables.

Additional context and related issues

Fixes #21016
https://cloud.google.com/bigquery/docs/biglake-intro

Release notes

(x) Release notes are required, with the following suggested text:

# BigQuery
* Improve performance when reading external BigLake tables. ({issue}`21016`)

Comment thread testing/trino-server-dev/etc/config.properties Outdated
@github-actions github-actions bot added the bigquery BigQuery connector label Mar 11, 2024
Copy link
Copy Markdown
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

Comment thread plugin/trino-bigquery/README.md Outdated
@hashhar
Copy link
Copy Markdown
Member

hashhar commented Mar 13, 2024

/test-with-secrets sha=9d7bd1dcad92de70856b928a99e443ec3d8b4619

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 13, 2024

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/8261925064

Comment thread plugin/trino-bigquery/README.md Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the connection uses non-BigLake as the source? e.g. MySQL
Screenshot 2024-03-14 at 6 45 33

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You won't be able to create an external table with a non-BigLake connections such as Cloud SQL.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share the source? For instance, Create object tables contains the example with connection. This is different page from BigLake tables.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The storage APIs support BigLake tables - the key use case is that they enforce security policies (row/column level). https://cloud.google.com/bigquery/docs/biglake-intro#connectors is the relevant documentation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not asking if storage APIs support BigLake tables or not. You said "You won't be able to create an external table with a non-BigLake connections". However, https://cloud.google.com/bigquery/docs/object-tables indicates that connection-id can be used for other tables.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads-up. I'm busy with other projects and won't be able to come back to this in the next few weeks.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anoopj could you please share your plans regarding this PR?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssheikin Sorry for the delay. I am not planning to work on this in the near term. Did you need this feature?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Yeah, we were interested in it. Do you know if there is someone who is willing to take over this PR?

@ebyhr ebyhr changed the title Support for reading BigLake tables using BigQuery storage read API Use BigQuery storage read API when reading external reading BigLake tables Mar 14, 2024
@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented Mar 14, 2024

Support for reading BigLake tables using BigQuery storage read API.

Please remove the following dot. https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#format-git-commit-messages

Also, I would change to Use BigQuery storage read API when reading external reading BigLake tables because the current title looks little misleading. Reading BigLake tables has been supported via query API.

…ables

The storage APIs support reading BigLake external tables (ie external
tables with a connection). But the current implementation uses views
which can be expensive, because it requires a query. This PR adds
support to read BigLake tables directly using the storage API.

There are no behavior changes for external tables and BQ native tables -
they use the view and storage APIs respectively.

Added a new test for BigLake tables.
@anoopj
Copy link
Copy Markdown
Member Author

anoopj commented Mar 14, 2024

Support for reading BigLake tables using BigQuery storage read API.

Please remove the following dot. https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#format-git-commit-messages

Also, I would change to Use BigQuery storage read API when reading external reading BigLake tables because the current title looks little misleading. Reading BigLake tables has been supported via query API.

Done.

@anoopj
Copy link
Copy Markdown
Member Author

anoopj commented Mar 19, 2024

@ebyhr Do you have any more feedback or can this be merged?

@anoopj
Copy link
Copy Markdown
Member Author

anoopj commented Mar 22, 2024

@ebyhr @hashhar Friendly ping here. We have a GCP customer who is waiting for this PR to be merged.

@hashhar
Copy link
Copy Markdown
Member

hashhar commented Mar 26, 2024

see #21017 (comment), I think it's an important question.

return ImmutableList.of(BigQuerySplit.forViewStream(columns, filter));
}
if (type == MATERIALIZED_VIEW || type == EXTERNAL) {
if (type == MATERIALIZED_VIEW || (type == EXTERNAL && !hasConnection)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parentheses are unnecessary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please consider to split this condition to two ifs: one for MATERIALIZED_VIEW and the second one for the EXTERNAL

{
@Language("SQL")
String jobCountForTableQuery = """
SELECT * FROM region-us.INFORMATION_SCHEMA.JOBS WHERE EXISTS(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* -> count(*)

Comment on lines +730 to +732
// Use assertEventually in case there are delays in the BQ information schema.
assertEventually(() -> assertThat(bigQuerySqlExecutor.executeQuery(jobCountForTableQuery).getTotalRows())
.isEqualTo(expectedJobCount));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when expectedJobCount=0 and in case there are delays in the BQ information schema and that's why jobCountForTableQuery returns 0, then it will be a false positive. assertEventually will not retry. After a while, eventually, jobCountForTableQuery may return other value.

return testSessionBuilder()
.setCatalog("bigquery")
.setSchema(TPCH_SCHEMA)
.setSchema(TEST_SCHEMA)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated change?

Comment on lines +85 to +87
public boolean hasConnection()
{
return connectionId.isPresent();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to keep Optional<String> as a field if we are interested only in boolean.

return ImmutableList.of(BigQuerySplit.forViewStream(columns, filter));
}
if (type == MATERIALIZED_VIEW || type == EXTERNAL) {
if (type == MATERIALIZED_VIEW || (type == EXTERNAL && !hasConnection)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please consider to split this condition to two ifs: one for MATERIALIZED_VIEW and the second one for the EXTERNAL

@@ -59,12 +60,17 @@ public abstract class BaseBigQueryConnectorTest
protected BigQuerySqlExecutor bigQuerySqlExecutor;
private String gcpStorageBucket;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this empty line is not needed.

`-Dbigquery.credentials-key=base64-text -Dtesting.gcp-storage-bucket=DESTINATION_BUCKET_NAME -Dtesting.alternate-bq-project-id=bigquery-cicd-alternate`.
* Set the VM options `bigquery.credentials-key`, `testing.gcp-storage-bucket`, `testing.alternate-bq-project-id` and `testing.bigquery-connection-id` in the IntelliJ "Run Configuration"
(or on the CLI if using Maven directly). It should look something like
`-Dbigquery.credentials-key=base64-text -Dtesting.gcp-storage-bucket=DESTINATION_BUCKET_NAME -Dtesting.alternate-bq-project-id=bigquery-cicd-alternate -testing.bigquery-connection-id=my_project.us.connection-id`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- -> -D

Comment on lines +71 to +73
this.bigQueryConnectionId = requireNonNull(
System.getProperty("testing.bigquery-connection-id"),
"testing.bigquery-connection-id is not set");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we have such code in trino in one line.
Also, how is it populated on CI?
I see you've updated documentation with testing.bigquery-connection-id=my_project.us.connection-id, but there are no corresponding changes within ./.github/workflows/ci.yml as they are for
bigquery.credentials-key, testing.gcp-storage-bucket, testing.alternate-bq-project-id

@github-actions
Copy link
Copy Markdown

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label May 23, 2024
@github-actions
Copy link
Copy Markdown

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this Jun 13, 2024
@ssheikin ssheikin reopened this Jun 14, 2024
@github-actions github-actions bot removed the stale label Jun 14, 2024
@velascoluis
Copy link
Copy Markdown

Can we get this merged?

@hashhar
Copy link
Copy Markdown
Member

hashhar commented Jun 26, 2024

@anoopj Do you plan to continue this? Or should someone else pick this up and drive to completion? I see that the newer client is released already.

@anoopj
Copy link
Copy Markdown
Member Author

anoopj commented Jul 3, 2024

@hashhar I am not planning to work on this.

@mosabua
Copy link
Copy Markdown
Member

mosabua commented Jul 4, 2024

@ssheikin and @hashhar .. are you taking this over here or in a new PR? Should we close this one?

@k-haley1
Copy link
Copy Markdown

@ssheikin and @hashhar .. are you taking this over here or in a new PR? Should we close this one?

please leave it open for now. we are discussing.

@mosabua mosabua added the stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. label Jul 16, 2024
@Praveen2112
Copy link
Copy Markdown
Member

@mosabua I could take up the work on this PR

@mosabua
Copy link
Copy Markdown
Member

mosabua commented Jul 16, 2024

Sounds good @Praveen2112 .. since @anoopj is not going to continue you can continue on this PR or start a new one with his work. Just link to this PR if you create a new one.

@marcinsbd
Copy link
Copy Markdown
Contributor

Continuation of this PR - #22974

@hashhar hashhar closed this Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bigquery BigQuery connector cla-signed stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed.

Development

Successfully merging this pull request may close these issues.

Support for direct read of BigLake tables using BigQuery storage API

10 participants