Skip to content

Improve speed when listing columns in BigQuery#21920

Closed
nineinchnick wants to merge 4 commits intotrinodb:masterfrom
nineinchnick:bigquery-bulk-columns
Closed

Improve speed when listing columns in BigQuery#21920
nineinchnick wants to merge 4 commits intotrinodb:masterfrom
nineinchnick:bigquery-bulk-columns

Conversation

@nineinchnick
Copy link
Copy Markdown
Member

Description

A continuation of #21830. Test coverage of the type parser is at 99% of lines and 95% of branches. Original description below.

Using mini parser because BigQuery Java SDK doesn't support translating string to BigQuery type as far as I asked Google engineers.

This PR improves listing columns. Test with 169 tables is improved from 22s to 1.5s.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# BigQuery connector
* Improve speed of fetching columns metadata. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label May 10, 2024
@github-actions github-actions bot added the bigquery BigQuery connector label May 10, 2024
@nineinchnick nineinchnick requested review from ebyhr and wendigo May 10, 2024 13:56
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these methods are not needed, remove toString, equals and hashCode. toString is only used in test but that's redundant - we only need a single roundtrip: String -> TypeInfo

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toString is used in a few exceptions. I'll remove others.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's the exceptions?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the IllegalArgumentException will be suppressed in convertToTrinoType, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I can either remove it, or we could add some logging, to make it easier to debug why some columns are not available in Trino. LMK what's better.

@nineinchnick nineinchnick force-pushed the bigquery-bulk-columns branch 3 times, most recently from d3fba84 to f43ccd6 Compare May 12, 2024 08:24
@nineinchnick nineinchnick requested a review from wendigo May 12, 2024 09:28
@nineinchnick nineinchnick force-pushed the bigquery-bulk-columns branch from f43ccd6 to 6b34692 Compare May 14, 2024 10:50
@nineinchnick nineinchnick requested a review from ebyhr May 14, 2024 10:54
@nineinchnick nineinchnick force-pushed the bigquery-bulk-columns branch 2 times, most recently from a2bbea8 to 3654c2e Compare May 16, 2024 10:16
@nineinchnick nineinchnick requested a review from ebyhr May 16, 2024 10:18
@nineinchnick
Copy link
Copy Markdown
Member Author

@ebyhr PTAL

Copy link
Copy Markdown
Member

@ebyhr ebyhr May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to build DatasetId and call getDataset method (when is it different from remoteSchemaName)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it for consistency. This connector has a local to remote names mapping.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #19860

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand it's for extension. Such change should go to the internal repository.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we'd have to revert #19860. It's out of the scope of this PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to revert the PR. Just removing this line is sufficient.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand, this is required for consistency. We either allow this local-to-remote name mapping, or not.

Copy link
Copy Markdown
Member

@ebyhr ebyhr May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at listRelationCommentMetadata() method. Also, can you write a test that doesn't pass without this line if you still want to keep it? If not, please remove it.

this local-to-remote name mapping

Please explain when schemaName gets a different value from remoteSchemaName in this repository:

    String schemaName = client.toSchemaName(DatasetId.of(projectId, remoteSchemaName));
...
    protected String toSchemaName(DatasetId datasetId)
    {
        return datasetId.getDataset();
    }

@nineinchnick nineinchnick force-pushed the bigquery-bulk-columns branch 2 times, most recently from cc6a799 to bf0cf78 Compare May 21, 2024 10:25
@findinpath findinpath requested a review from pajaks May 21, 2024 21:06
@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented May 24, 2024

/test-with-secrets sha=bf0cf785188ed0cca97f6b18f4f750df035ebe75

@github-actions
Copy link
Copy Markdown

The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/9218809344

Copy link
Copy Markdown
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost good to me.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand it's for extension. Such change should go to the internal repository.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the IllegalArgumentException will be suppressed in convertToTrinoType, right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name looks little confusing because STRUCT type is supported except for parsing. How about renaming to ParsingException or something?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not supported in the type package. ParsingException seems to be too broad, or I'd have to refactor the exception to use it instead of IllegalArgumentException and make the typeName parameter optional. I don't think it's worth doing.

Copy link
Copy Markdown
Member

@ebyhr ebyhr May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not supported in the type package

I know. However, the class is used from outside of the package (BigQueryTypeManager) either. The package is unclear when reading the code in the class.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can I make it more clear, using a fully qualified name, or with a comment?

Comment on lines 421 to 423
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to continue iteration even when table is empty?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need the table to map STRUCT types. We should continue iteration to convert other types.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose table.isEmpty() means the table doesn't exist. No need to return columns in my opinion.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't mean it doesn't exit, just it wasn't provided.

Copy link
Copy Markdown
Member

@ebyhr ebyhr May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the tableSupplier is called with getTable method. It returns an empty when the table doesn't exist:

public Optional<TableInfo> getTable(TableId remoteTableId)
{
try {
return Optional.ofNullable(bigQuery.getTable(remoteTableId));
}
catch (BigQueryException e) {
// getTable method throws an exception in some situations, e.g. wild card tables
return Optional.empty();
}
}

@nineinchnick nineinchnick force-pushed the bigquery-bulk-columns branch from bf0cf78 to 8c16327 Compare May 24, 2024 11:43
return RowType.from(fields);
}

public List<ColumnMetadata> convertToTrinoType(List<String> names, List<String> types)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is used only from tests. Remove.

// ignore unsupported types
continue;
}
typeSignature = createRowType(table.get().getDefinition().getSchema().getFields().get(name)).getTypeSignature();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we try to get typeSingature in case of unsupported type?
If such STRUCT type is valid maybe parseTypeString should be adapted to accept it?

public final class BigDecimalTypeInfo
extends PrimitiveTypeInfo
{
private static final int MAX_PRECISION_MINUS_SCALE = 38;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that max precision is 76?
Why not use just MAX_PRECISION constant?

public final class DecimalTypeInfo
extends PrimitiveTypeInfo
{
private static final int MAX_PRECISION_MINUS_SCALE = 29;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not MAX_PRECISION constant?


private static class TypeInfoParser
{
public record Token(int position, String text, boolean type)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public record Token(int position, String text, boolean type)
public record Token(int position, String text, boolean charType)

int end = 1;
while (end <= typeInfoString.length()) {
if (end == typeInfoString.length() ||
!isTypeChar(typeInfoString.charAt(end - 1)) ||
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we check every char what is the condition that end - 1 will not be cached in previous loop iteration?

@github-actions

This comment was marked as outdated.

@github-actions github-actions bot added the stale label Jun 24, 2024
@ebyhr ebyhr added the stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. label Jun 25, 2024
@oskar-szwajkowski
Copy link
Copy Markdown
Contributor

Hey @nineinchnick, will this be worked on in near future?

@nineinchnick
Copy link
Copy Markdown
Member Author

@oskar-szwajkowski no, this has very low priority. I'll actually close this, to avoid further confusion.

@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented Sep 1, 2024

@oskar-szwajkowski Can you open a new PR?

@oskar-szwajkowski
Copy link
Copy Markdown
Contributor

@oskar-szwajkowski Can you open a new PR?

what would be motivation for that?

I currently haven't spotted any downsides of getting column list in big query catalogs, that could impact workload that I looked at.

Is this getting more priority and should be opened again?

@mosabua
Copy link
Copy Markdown
Member

mosabua commented Sep 3, 2024

If you open a new PR based on this PR but from your end @oskar-szwajkowski, you can continue to drive updates yourself. @nineinchnick is not planning to work on this so if you are interested you can take over that way.

At least I think that was the idea from @ebyhr ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bigquery BigQuery connector cla-signed stale stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed.

Development

Successfully merging this pull request may close these issues.

6 participants