Improve speed when listing columns in BigQuery by nineinchnick · Pull Request #21920 · trinodb/trino

nineinchnick · 2024-05-10T13:35:45Z

Description

A continuation of #21830. Test coverage of the type parser is at 99% of lines and 95% of branches. Original description below.

Using mini parser because BigQuery Java SDK doesn't support translating string to BigQuery type as far as I asked Google engineers.

This PR improves listing columns. Test with 169 tables is improved from 22s to 1.5s.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# BigQuery connector
* Improve speed of fetching columns metadata. ({issue}`issuenumber`)

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/type/DecimalTypeInfo.java

wendigo · 2024-05-10T13:58:19Z

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/type/TypeInfo.java

these methods are not needed, remove toString, equals and hashCode. toString is only used in test but that's redundant - we only need a single roundtrip: String -> TypeInfo

toString is used in a few exceptions. I'll remove others.

Where's the exceptions?

Like here: https://github.com/nineinchnick/trino/blob/bigquery-bulk-columns/plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryTypeManager.java#L455

However, the IllegalArgumentException will be suppressed in convertToTrinoType, right?

Yes. I can either remove it, or we could add some logging, to make it easier to debug why some columns are not available in Trino. LMK what's better.

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryTypeManager.java

plugin/trino-bigquery/src/test/java/io/trino/plugin/bigquery/type/TestTypeInfoUtils.java

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryTypeManager.java

plugin/trino-bigquery/src/test/java/io/trino/plugin/bigquery/BaseBigQueryConnectorTest.java

plugin/trino-bigquery/src/test/java/io/trino/plugin/bigquery/TestBigQueryType.java

plugin/trino-bigquery/src/test/java/io/trino/plugin/bigquery/BaseBigQueryConnectorTest.java

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/type/TypeInfoUtils.java

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryTypeManager.java

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/type/TypeInfoUtils.java

nineinchnick · 2024-05-17T12:41:14Z

@ebyhr PTAL

plugin/trino-bigquery/src/test/java/io/trino/plugin/bigquery/TestBigQueryType.java

plugin/trino-bigquery/src/test/java/io/trino/plugin/bigquery/BaseBigQueryConnectorTest.java

ebyhr · 2024-05-20T06:48:20Z

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryClient.java

Why do we need to build DatasetId and call getDataset method (when is it different from remoteSchemaName)?

I added it for consistency. This connector has a local to remote names mapping.

I understand it's for extension. Such change should go to the internal repository.

Then we'd have to revert #19860. It's out of the scope of this PR.

No need to revert the PR. Just removing this line is sufficient.

I don't understand, this is required for consistency. We either allow this local-to-remote name mapping, or not.

Please take a look at listRelationCommentMetadata() method. Also, can you write a test that doesn't pass without this line if you still want to keep it? If not, please remove it.

this local-to-remote name mapping

Please explain when schemaName gets a different value from remoteSchemaName in this repository:

String schemaName = client.toSchemaName(DatasetId.of(projectId, remoteSchemaName)); ... protected String toSchemaName(DatasetId datasetId) { return datasetId.getDataset(); }

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryMetadata.java

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryTypeManager.java

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/type/PrimitiveTypeInfo.java

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/type/TypeInfoUtils.java

ebyhr · 2024-05-24T04:50:23Z

/test-with-secrets sha=bf0cf785188ed0cca97f6b18f4f750df035ebe75

github-actions · 2024-05-24T04:51:26Z

The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/9218809344

ebyhr

Almost good to me.

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryClient.java

ebyhr · 2024-05-24T04:38:37Z

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryClient.java

I understand it's for extension. Such change should go to the internal repository.

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/type/TypeInfoUtils.java

ebyhr · 2024-05-24T04:59:38Z

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/type/TypeInfo.java

However, the IllegalArgumentException will be suppressed in convertToTrinoType, right?

ebyhr · 2024-05-24T05:10:35Z

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/type/UnsupportedTypeException.java

This name looks little confusing because STRUCT type is supported except for parsing. How about renaming to ParsingException or something?

It's not supported in the type package. ParsingException seems to be too broad, or I'd have to refactor the exception to use it instead of IllegalArgumentException and make the typeName parameter optional. I don't think it's worth doing.

It's not supported in the type package

I know. However, the class is used from outside of the package (BigQueryTypeManager) either. The package is unclear when reading the code in the class.

How can I make it more clear, using a fully qualified name, or with a comment?

ebyhr · 2024-05-24T05:27:43Z

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryTypeManager.java

Why do we want to continue iteration even when table is empty?

We only need the table to map STRUCT types. We should continue iteration to convert other types.

I suppose table.isEmpty() means the table doesn't exist. No need to return columns in my opinion.

It doesn't mean it doesn't exit, just it wasn't provided.

No, the tableSupplier is called with getTable method. It returns an empty when the table doesn't exist:

trino/plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryClient.java

Lines 208 to 217 in 569b045

public Optional<TableInfo> getTable(TableId remoteTableId)

{

try {

return Optional.ofNullable(bigQuery.getTable(remoteTableId));

}

catch (BigQueryException e) {

// getTable method throws an exception in some situations, e.g. wild card tables

return Optional.empty();

}

}

plugin/trino-bigquery/src/test/java/io/trino/plugin/bigquery/type/TestTypeInfoUtils.java

ebyhr · 2024-05-27T22:07:33Z

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryTypeManager.java

+        return RowType.from(fields);
+    }
+
+    public List<ColumnMetadata> convertToTrinoType(List<String> names, List<String> types)


This method is used only from tests. Remove.

pajaks · 2024-06-03T10:48:01Z

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryTypeManager.java

+                    // ignore unsupported types
+                    continue;
+                }
+                typeSignature = createRowType(table.get().getDefinition().getSchema().getFields().get(name)).getTypeSignature();


Why we try to get typeSingature in case of unsupported type?
If such STRUCT type is valid maybe parseTypeString should be adapted to accept it?

pajaks · 2024-06-03T10:58:02Z

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/type/BigDecimalTypeInfo.java

+public final class BigDecimalTypeInfo
+        extends PrimitiveTypeInfo
+{
+    private static final int MAX_PRECISION_MINUS_SCALE = 38;


Does it mean that max precision is 76?
Why not use just MAX_PRECISION constant?

pajaks · 2024-06-03T10:58:32Z

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/type/DecimalTypeInfo.java

+public final class DecimalTypeInfo
+        extends PrimitiveTypeInfo
+{
+    private static final int MAX_PRECISION_MINUS_SCALE = 29;


Why not MAX_PRECISION constant?

pajaks · 2024-06-03T11:18:31Z

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/type/TypeInfoUtils.java

+
+    private static class TypeInfoParser
+    {
+        public record Token(int position, String text, boolean type)


Suggested change

public record Token(int position, String text, boolean type)

public record Token(int position, String text, boolean charType)

pajaks · 2024-06-03T11:29:17Z

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/type/TypeInfoUtils.java

+            int end = 1;
+            while (end <= typeInfoString.length()) {
+                if (end == typeInfoString.length() ||
+                        !isTypeChar(typeInfoString.charAt(end - 1)) ||


If we check every char what is the condition that end - 1 will not be cached in previous loop iteration?

oskar-szwajkowski · 2024-07-26T04:30:15Z

Hey @nineinchnick, will this be worked on in near future?

nineinchnick · 2024-07-29T09:33:16Z

@oskar-szwajkowski no, this has very low priority. I'll actually close this, to avoid further confusion.

ebyhr · 2024-09-01T03:11:50Z

@oskar-szwajkowski Can you open a new PR?

oskar-szwajkowski · 2024-09-03T09:48:32Z

@oskar-szwajkowski Can you open a new PR?

what would be motivation for that?

I currently haven't spotted any downsides of getting column list in big query catalogs, that could impact workload that I looked at.

Is this getting more priority and should be opened again?

mosabua · 2024-09-03T17:40:06Z

If you open a new PR based on this PR but from your end @oskar-szwajkowski, you can continue to drive updates yourself. @nineinchnick is not planning to work on this so if you are interested you can take over that way.

At least I think that was the idea from @ebyhr ...

cla-bot bot added the cla-signed label May 10, 2024

github-actions bot added the bigquery BigQuery connector label May 10, 2024

nineinchnick requested review from ebyhr and wendigo May 10, 2024 13:56

wendigo reviewed May 10, 2024

View reviewed changes

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/type/DecimalTypeInfo.java Outdated Show resolved Hide resolved

wendigo reviewed May 10, 2024

View reviewed changes

nineinchnick force-pushed the bigquery-bulk-columns branch 3 times, most recently from d3fba84 to f43ccd6 Compare May 12, 2024 08:24

nineinchnick requested a review from wendigo May 12, 2024 09:28

ebyhr reviewed May 13, 2024

View reviewed changes

nineinchnick force-pushed the bigquery-bulk-columns branch from f43ccd6 to 6b34692 Compare May 14, 2024 10:50

nineinchnick requested a review from ebyhr May 14, 2024 10:54

ebyhr reviewed May 15, 2024

View reviewed changes

nineinchnick force-pushed the bigquery-bulk-columns branch 2 times, most recently from a2bbea8 to 3654c2e Compare May 16, 2024 10:16

nineinchnick requested a review from ebyhr May 16, 2024 10:18

ebyhr reviewed May 20, 2024

View reviewed changes

nineinchnick force-pushed the bigquery-bulk-columns branch 2 times, most recently from cc6a799 to bf0cf78 Compare May 21, 2024 10:25

findinpath requested a review from pajaks May 21, 2024 21:06

ebyhr reviewed May 24, 2024

View reviewed changes

Jan Waś and others added 4 commits May 24, 2024 13:14

Format the BaseBigQueryConnectorTest class

debd3b6

Prefer toImmutableList over toList

0181486

Prefer static imports in BigQueryTypeManager

b92cf5b

Improve speed when listing columns in BigQuery

8c16327

nineinchnick force-pushed the bigquery-bulk-columns branch from bf0cf78 to 8c16327 Compare May 24, 2024 11:43

ebyhr reviewed May 27, 2024

View reviewed changes

pajaks reviewed Jun 3, 2024

View reviewed changes

This comment was marked as outdated.

Sign in to view

github-actions bot added the stale label Jun 24, 2024

ebyhr added the stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. label Jun 25, 2024

nineinchnick closed this Jul 29, 2024

	public Optional<TableInfo> getTable(TableId remoteTableId)
	{
	try {
	return Optional.ofNullable(bigQuery.getTable(remoteTableId));
	}
	catch (BigQueryException e) {
	// getTable method throws an exception in some situations, e.g. wild card tables
	return Optional.empty();
	}
	}

	public record Token(int position, String text, boolean type)
	public record Token(int position, String text, boolean charType)

Conversation

nineinchnick commented May 10, 2024

Description

Additional context and related issues

Release notes

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nineinchnick commented May 17, 2024

Uh oh!

Uh oh!

Uh oh!

ebyhr May 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebyhr May 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ebyhr commented May 24, 2024

Uh oh!

github-actions bot commented May 24, 2024

Uh oh!

ebyhr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebyhr May 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ebyhr May 20, 2024 •

edited

Loading

ebyhr May 27, 2024 •

edited

Loading

ebyhr May 27, 2024 •

edited

Loading

ebyhr May 27, 2024 •

edited

Loading