Skip to content

Add support for AVRO in Iceberg#12125

Merged
electrum merged 2 commits intotrinodb:masterfrom
ebyhr:ebi/iceberg-avro
Jul 25, 2022
Merged

Add support for AVRO in Iceberg#12125
electrum merged 2 commits intotrinodb:masterfrom
ebyhr:ebi/iceberg-avro

Conversation

@ebyhr
Copy link
Member

@ebyhr ebyhr commented Apr 25, 2022

Description

Add support for AVRO in Iceberg
Supersedes #4776

Documentation

(x) Sufficient documentation is included in this PR.

Release notes

(x) Release notes entries required with the following suggested text:

# Iceberg
* Add support for Avro file format. ({issue}`12125`)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should always be string correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more general question: Is there a big difference between the iceberg libraries for Avro / the Avro Libraries themselves? I know Ryan works on both so I know this works Im just curious if this is done to keep things in sync or ergonomics or some other reason?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the Iceberg Avro Library do away with the Utf8 class 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is neat

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question: Why is this needed to be sent as part of the split if the schema is part of the file?

@ebyhr ebyhr force-pushed the ebi/iceberg-avro branch 4 times, most recently from 7343096 to a5c6550 Compare April 28, 2022 03:08
@ebyhr ebyhr marked this pull request as ready for review April 28, 2022 03:22
@ebyhr ebyhr requested review from electrum and phd3 April 28, 2022 06:49
Copy link
Member

@alexjo2144 alexjo2144 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an initial skim. Looks like Avro table stat's aren't working. Is that because Avro doesn't have them, or because the read path just isn't implemented for them yet?

Comment on lines 921 to 931
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these need to be in the else block?

Suggested change
rowIndexChannels.add(false);
columnNames.add(column.getName());
columnTypes.add(column.getType());
if (field == null) {
constantPopulatingPageSourceBuilder.addConstantColumn(nativeValueToBlock(column.getType(), null));
}
else {
constantPopulatingPageSourceBuilder.addDelegateColumn(avroSourceChannel);
}
avroSourceChannel++;
if (field == null) {
constantPopulatingPageSourceBuilder.addConstantColumn(nativeValueToBlock(column.getType(), null));
}
else {
rowIndexChannels.add(false);
columnNames.add(column.getName());
columnTypes.add(column.getType());
constantPopulatingPageSourceBuilder.addDelegateColumn(avroSourceChannel);
avroSourceChannel++;
}

@ebyhr ebyhr force-pushed the ebi/iceberg-avro branch from a5c6550 to b5cba79 Compare April 30, 2022 08:12
@ebyhr
Copy link
Member Author

ebyhr commented Apr 30, 2022

Is that because Avro doesn't have them, or because the read path just isn't implemented for them yet?

Avro file/table doesn't support statistics as far as I know.

@ebyhr ebyhr mentioned this pull request May 11, 2022
93 tasks
@ebyhr ebyhr force-pushed the ebi/iceberg-avro branch from b5cba79 to 617b1dd Compare May 16, 2022 08:59
Copy link
Member

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the long review time. Let's get this in since it's needed for #13219

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need multiple row ID columns? Or could this be a single index?

Since we're using the Iceberg Avro reader, I'm thinking we might get the row ID automatically by including the ROW_POSITION column in the schema.

@ebyhr ebyhr force-pushed the ebi/iceberg-avro branch 2 times, most recently from cf3c508 to 53317f6 Compare July 21, 2022 12:34
@ebyhr
Copy link
Member Author

ebyhr commented Jul 21, 2022

Just rebased on upstream to resolve a logical conflict by 308e717

@electrum
Copy link
Member

Looks like a real test failure:

2022-07-22 06:58:45 INFO: FAILURE     /    io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoCompressedData [AVRO, ZSTD] (Groups: profile_specific_tests, iceberg) took 0.6 seconds
2022-07-22 06:58:45 SEVERE: Failure cause:
java.lang.AssertionError: Expected callback to throw QueryExecutionException
	at io.trino.tempto.assertions.QueryAssert.assertQueryFailure(QueryAssert.java:107)
	at io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoCompressedData(TestIcebergSparkCompatibility.java:1238)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
	at org.testng.internal.Invoker.invokeMethod(Invoker.java:645)
	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:851)
	at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1177)
	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:129)
	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:112)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

@ebyhr ebyhr force-pushed the ebi/iceberg-avro branch from 53317f6 to c982721 Compare July 22, 2022 05:46
ebyhr and others added 2 commits July 25, 2022 09:15
Co-Authored-By: Xingyuan Lin <xingyuan_lin@apple.com>
@ebyhr ebyhr force-pushed the ebi/iceberg-avro branch from cd89eb8 to 2a6639f Compare July 25, 2022 00:15
@electrum electrum merged commit 5438d7d into trinodb:master Jul 25, 2022
@github-actions github-actions bot added this to the 392 milestone Jul 25, 2022
@ebyhr ebyhr deleted the ebi/iceberg-avro branch July 25, 2022 21:53
@ebyhr ebyhr mentioned this pull request Jul 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants