Skip to content

Support CREATE TABLE AS SELECT and INSERT in BigQuery#13094

Merged
ebyhr merged 4 commits intomasterfrom
ebi/bigquery-write
Aug 23, 2022
Merged

Support CREATE TABLE AS SELECT and INSERT in BigQuery#13094
ebyhr merged 4 commits intomasterfrom
ebi/bigquery-write

Conversation

@ebyhr
Copy link
Copy Markdown
Member

@ebyhr ebyhr commented Jul 6, 2022

Description

Support CREATE TABLE AS SELECT and INSERT in BigQuery
Fixes #6868
Fixes #6869

Documentation

(x) Sufficient documentation is included in this PR.

Release notes

(x) Release notes entries required with the following suggested text:

# BigQuery
* Support `CREATE TABLE AS SELECT` and `INSERT` statements. ({issue}`6868`, {issue}`6869`)

@cla-bot cla-bot bot added the cla-signed label Jul 6, 2022
@github-actions github-actions bot added the docs label Jul 6, 2022
@ebyhr ebyhr marked this pull request as ready for review July 6, 2022 02:56
@ebyhr ebyhr requested review from hashhar and wendigo July 6, 2022 02:57
@wendigo
Copy link
Copy Markdown
Contributor

wendigo commented Jul 7, 2022

I'm concerned about tabledata.insertAll performance.

As stated in the docs: Streams data into BigQuery one record at a time without needing to run a load job.

@ebyhr ebyhr marked this pull request as draft August 3, 2022 05:36
public ConnectorOutputTableHandle beginCreateTable(ConnectorSession session, ConnectorTableMetadata tableMetadata, Optional<ConnectorTableLayout> layout, RetryMode retryMode)
{
if (tableMetadata.getComment().isPresent()) {
throw new TrinoException(NOT_SUPPORTED, "This connector does not support creating tables with table comment");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it bigquery limitation or we just don't support it yet?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just connector's limitation.

if (isWildcardTable(TableDefinition.Type.valueOf(table.getType()), table.getRemoteTableName().getTableName())) {
throw new TrinoException(BIGQUERY_UNSUPPORTED_OPERATION, "This connector does not support inserting into wildcard tables");
}
List<String> columnNames = columns.stream().map(column -> ((BigQueryColumnHandle) column).getName()).collect(toImmutableList());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JDK 17 has .toList() on a Stream<> which according to Javadoc returns an unmodifiable list. Should we use it @martint ?

For reference: https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/stream/Stream.html#toList()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unmodifiable list and immutablelist semantics are different. null hostility being the big one and unmodifiable isn't really unmodifiable if someone else holds a reference.

return StandardSQLTypeName.TIMESTAMP;
}
if (type instanceof CharType || type instanceof VarcharType) {
if (type instanceof VarcharType) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was required to fix incorrect predicates between char vs varchar if I remember correctly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is CharType mapped to now? I don't see explicit mapping so does it mean it's unsupported type now?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mapped to BigQuery STRING type now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably being stupid but I can't find code which handles that (mapping Trino char to BigQuery STRING). Can you point me to it?

I'm probably confused because of removal of char from createTableSupportedTypes in TestBigQueryConnectorTests.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misunderstood "now" as "before this change". I wanted to say:

  • CHAR was mapped to STRING before this PR
  • CHAR is unsupported after this PR

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for causing confusion. 😄 Do we want to mention this in release notes? The only possible impact I see is some CTAS statements from other catalogs might fail but easy to preserve old "incorrect" behaviour by casting CHAR columns to VARCHAR instead.

Copy link
Copy Markdown
Contributor

@wendigo wendigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall design looks good. We need to improve on type coverage and error handling.

@ebyhr ebyhr marked this pull request as ready for review August 17, 2022 09:09
@ebyhr ebyhr force-pushed the ebi/bigquery-write branch from bf0ea16 to 2ba1e4d Compare August 17, 2022 09:18
Copy link
Copy Markdown
Contributor

@wendigo wendigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm :)

@ebyhr ebyhr force-pushed the ebi/bigquery-write branch 2 times, most recently from 23fff3c to d01df23 Compare August 18, 2022 01:57
@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Aug 18, 2022

@hashhar Could you please review this PR when you have time?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be SchemaTableName or the remote mapped names?

Copy link
Copy Markdown
Member

@hashhar hashhar Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly confused now. I see that you didn't have to add any additional code to convert name to remote names. Is it because the remote name logic is contained within BigQueryClient?

If that is the case then your older naming was more correct since the handle will include "Trino" names instead of "remote" names and the BigQueryClient will handle the conversions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be "remote", but a logic to lookup a remote schema name was missing. Could you take a look at https://github.com/trinodb/trino/compare/28038ec2b874963631ae476ca1b9cdcef36e0462..cac3a485decbaebdada4a356c0b50bab68ca34b7?

Copy link
Copy Markdown
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to look at type mapping tests.

Also we should add tests for inserts into TestBigQueryCaseInsensitiveMapping (and also for JDBC connectors but that's separate PR).

@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Aug 18, 2022

@hashhar Addressed comments.

Copy link
Copy Markdown
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % some questions.

Comment on lines 421 to 424
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching this. Was this caught by some existing test?

return StandardSQLTypeName.TIMESTAMP;
}
if (type instanceof CharType || type instanceof VarcharType) {
if (type instanceof VarcharType) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably being stupid but I can't find code which handles that (mapping Trino char to BigQuery STRING). Can you point me to it?

I'm probably confused because of removal of char from createTableSupportedTypes in TestBigQueryConnectorTests.

@ebyhr ebyhr force-pushed the ebi/bigquery-write branch from 28038ec to cac3a48 Compare August 19, 2022 09:08
Copy link
Copy Markdown
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks.

I'll revisit type mapping and approve.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we should also call this in createSchema to avoid creating schemas in BigQuery which differ only in case. Separate and pre-existing issue however.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out this is not needed/the wrong thing to do. When creating schemas we already list them and the listing will include the lowercase name so Trino will see a schema already exists - I've added a test to verify this similar to what you added for DROP SCHEMA in #13812

ebyhr added 3 commits August 20, 2022 07:35
They are different types, so we shouldn't treat as same type.
* Rename the test method to testInt64 from testInteger
* Fix data provider and reorder the entries
@ebyhr ebyhr force-pushed the ebi/bigquery-write branch from cac3a48 to 8d17f02 Compare August 19, 2022 22:41
@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Aug 19, 2022

Rebased on upstream to resolve conflicts.

Copy link
Copy Markdown
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - a question for my understanding.

"SELECT 1234567890, 1.23",
"SELECT count(*) + 1 FROM customer");

// TODO: BigQuery throws table not found at BigQueryClient.insert if we reuse the same table name
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain a bit more to me? Does this mean that when the DROP TABLE above runs and then we try to CREATE TABLE with same name again here it fails?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that when the DROP TABLE above runs and then we try to CREATE TABLE with same name again here it fails?

Yes. Also, the issue doesn't happen in case of empty tables. It seems there's cache or something for BigQuery.insertAll. I couldn't find relevant options in the library at glance. I will look into the detail later.

@ebyhr ebyhr force-pushed the ebi/bigquery-write branch from 8d17f02 to 6633525 Compare August 23, 2022 00:38
Copy link
Copy Markdown
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for working on this.

@ebyhr ebyhr merged commit 80d6b6c into master Aug 23, 2022
@ebyhr ebyhr deleted the ebi/bigquery-write branch August 23, 2022 06:29
@ebyhr ebyhr mentioned this pull request Aug 23, 2022
@github-actions github-actions bot added this to the 394 milestone Aug 23, 2022
@Override
public ConnectorOutputTableHandle beginCreateTable(ConnectorSession session, ConnectorTableMetadata tableMetadata, Optional<ConnectorTableLayout> layout, RetryMode retryMode)
{
return createTable(session, tableMetadata);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also need a defensive check to verify query retries are not enabled?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. Thanks for catching that. Do you want to send the PR?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-----------

The connector provides read and write access to data and metadata in the
BigQuery database, though write access is limited. In addition to the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Add CREATE TABLE AS (CTAS) support to BigQuery connector Add INSERT support to BigQuery connector

4 participants