Add Avro support to Iceberg Connector#4776
Conversation
249f37b to
b3230e3
Compare
phd3
left a comment
There was a problem hiding this comment.
Added some comments (mostly minor), still reviewing.
There was a problem hiding this comment.
OrcFileWriter --> IcebergAvroPageSource
| } | ||
|
|
||
| private IcebergFileWriter createAvroWriter( | ||
| String schemaName, |
There was a problem hiding this comment.
the requirement of providing a "tableName" in Avro.WriteBuilder#named() api feels a bit strange to me. However, I was also wondering if we could just pass hdfsContext as an argument here and use the table name from there.
There was a problem hiding this comment.
Agreed. We already pass HdfsContext to createParquetWriter()
|
|
||
| @Test | ||
| public void testHourTransform() | ||
| { |
There was a problem hiding this comment.
are changes in this file from a different commit?
| import static io.prestosql.plugin.iceberg.util.IcebergAvroDataConversion.serializeToPrestoObject; | ||
| import static java.util.Objects.requireNonNull; | ||
|
|
||
| public class IcebergAvroPageSource |
There was a problem hiding this comment.
Is there a reason to not use RecordPageSource with a cursor implementation? The page building mechanism there is pretty similar.
IIRC using RecordPageSource with an extended record cursor had some performance advantage over using a ConnectorPageSource that is internally row-oriented, but don't remember the details. @dain is that still the case? If so, is the performance difference considerable?
There was a problem hiding this comment.
That's correct, RecordPageSource has special handling in the engine. It has the advantage of not materializing entire column pages if there is filtering. Consider this:
SELECT x WHERE y > 5
With ConnectorPageSource, we materialize entire pages of x even for rows where the y predicate is false. Since we're not using lazy pages, we actually materialize x even if the predicate is false for the entire page (and thus we don't need x at all).
Note that I'm not saying we need to do it this way -- just something to consider.
| .build(); | ||
| } | ||
| catch (IOException e) { | ||
| throw new PrestoException(ICEBERG_WRITER_OPEN_ERROR, "Error creating Avro file", e); |
There was a problem hiding this comment.
nit: add file path in the error message for ease of debugging?
| else { | ||
| unscaledValue = Decimals.decodeUnscaledValue(decimalType.getSlice(block, position)); | ||
| } | ||
| return new BigDecimal(unscaledValue, decimalType.getScale()); |
There was a problem hiding this comment.
should we use new BigDecimal(unscaledValue, decimalType.getScale(), type.getPrecision()) ? Or may be just Decimals#readBigDecimal
There was a problem hiding this comment.
Decimals.readBigDecimal() is the best way
| return type.getSlice(block, position).toStringUtf8(); | ||
| } | ||
| if (type.equals(VARBINARY)) { | ||
| if (icebergType.typeId().equals(FIXED)) { |
There was a problem hiding this comment.
could you elaborate on what is the reason for this special case?
| if (type instanceof MapType) { | ||
| Type keyType = type.getTypeParameters().get(0); | ||
| Type valueType = type.getTypeParameters().get(1); | ||
| org.apache.iceberg.types.Type keyIcebergtype = icebergType.asMapType().keyType(); |
There was a problem hiding this comment.
nit: camel case in keyIcebergType and valueIcebergType
| List<Types.NestedField> icebergFields = icebergType.asStructType().fields(); | ||
| BlockBuilder currentBuilder = builder.beginBlockEntry(); | ||
| for (int i = 0; i < typeParameters.size(); i++) { | ||
| serializeToPrestoObject(typeParameters.get(i), icebergFields.get(1).type(), currentBuilder, record.get(i), timeZoneKey); |
| Map<?, ?> map = (Map<?, ?>) object; | ||
| Type keyType = ((MapType) type).getKeyType(); | ||
| Type valueType = ((MapType) type).getValueType(); | ||
| org.apache.iceberg.types.Type keyIcebergtype = icebergType.asMapType().keyType(); |
phd3
left a comment
There was a problem hiding this comment.
finished reviewing, it looks great, only a couple more comments.
| Block rowBlock = block.getObject(position, Block.class); | ||
|
|
||
| List<Type> fieldTypes = type.getTypeParameters(); | ||
| checkCondition(fieldTypes.size() == rowBlock.getPositionCount(), GENERIC_INTERNAL_ERROR, "Expected row value field count does not match type field count"); |
There was a problem hiding this comment.
nit: instead of relying on a hive module class, may be throw PrestoException here directly?
| } | ||
| return new BigDecimal(unscaledValue, decimalType.getScale()); | ||
| } | ||
| if (type.equals(VARCHAR)) { |
There was a problem hiding this comment.
does this cause bounded varchar types to throw an exception? we might want to use instanceof VarcharType right?
There was a problem hiding this comment.
Yep, need to use instanceof VarcharType here
| } | ||
| return; | ||
| } | ||
| if (type.equals(VARCHAR)) { |
There was a problem hiding this comment.
same comment about supporting bounded varchar types
|
@lxynov what's pending. Is there something that we can help with? |
| "(DATE '2015-05-15', 2, NULL, NULL, 4, 5), " + | ||
| "(DATE '2020-02-21', 2, NULL, NULL, 6, 7)"; | ||
| } | ||
| if (!columnStatisticsCollected) { |
There was a problem hiding this comment.
I think these tests will become simpler if we make it similar to how we handle ORC. For example instead of defining columnStatisticsCollected and adding new methods and sublcassing. Could we not just do
if (format == AVRO) and test appropriately?
|
@electrum @rdsr @phd3 Hey I'm thinking of dividing this PR into 3 parts:
Please LMK if you have comments. |
|
@lxynov sounds good to me! |
| if (closed) { | ||
| return; | ||
| } | ||
| closed = true; |
There was a problem hiding this comment.
Should this line be moved after recordIterator.close()?
| import static org.apache.iceberg.util.DateTimeUtil.timestampFromMicros; | ||
| import static org.apache.iceberg.util.DateTimeUtil.timestamptzFromMicros; | ||
|
|
||
| public final class IcebergAvroDataConversion |
There was a problem hiding this comment.
Are there tests that cover this class?
electrum
left a comment
There was a problem hiding this comment.
I started reviewing this a while back and had a bunch of pending comments. I'll submit them now -- not sure if they are still relevant after the more recent changes.
| ICEBERG_MISSING_DATA(5, EXTERNAL), | ||
| ICEBERG_CANNOT_OPEN_SPLIT(6, EXTERNAL), | ||
| ICEBERG_WRITER_OPEN_ERROR(7, EXTERNAL), | ||
| ICEBERG_FILESYSTEM_ERROR(8, EXTERNAL), |
There was a problem hiding this comment.
Did you mean to change the existing error codes?
| { | ||
| ORC, | ||
| PARQUET, | ||
| AVRO |
|
|
||
| @Test(groups = {ICEBERG, PROFILE_SPECIFIC_TESTS}) | ||
| public void testPrestoReadingSparkData() | ||
| @DataProvider(name = "storage_formats") |
There was a problem hiding this comment.
You can leave off name and have it default to the method name
| String baseTableName = "test_spark_reads_presto_partitioned_table"; | ||
| String prestoTableName = prestoTableName(baseTableName); | ||
| onPresto().executeQuery(format("CREATE TABLE %s (_string VARCHAR, _bigint BIGINT) WITH (partitioning = ARRAY['_string'])", prestoTableName)); | ||
| onPresto().executeQuery(format("CREATE TABLE %s (_string VARCHAR, _bigint BIGINT) WITH (partitioning = ARRAY['_string'], format = '" + storageFormat + "')", prestoTableName)); |
There was a problem hiding this comment.
Use the existing string formatting instead of concatenation
| String baseTableName = "test_spark_reads_presto_partitioned_table"; | ||
| String sparkTableName = sparkTableName(baseTableName); | ||
| onSpark().executeQuery(format("CREATE TABLE %s (_string STRING, _bigint BIGINT) USING ICEBERG PARTITIONED BY (_string)", sparkTableName)); | ||
| onSpark().executeQuery(format("CREATE TABLE %s (_string STRING, _bigint BIGINT) USING ICEBERG PARTITIONED BY (_string)" + |
There was a problem hiding this comment.
Nit: missing space after last )
| private List<Block> columnBlocks; | ||
| private List<Type> types; | ||
| private List<org.apache.iceberg.types.Type> icebergTypes; | ||
| private Schema icebergSchema; |
| public long getSystemMemoryUsage() | ||
| { | ||
| //TODO: try to add memory used by recordIterator | ||
| return INSTANCE_SIZE + pageBuilder.getRetainedSizeInBytes(); |
There was a problem hiding this comment.
We could reset the PageBuilder at the end, then we don't need to calculate retained size for it.
| @Override | ||
| public Page getNextPage() | ||
| { | ||
| if (closed) { |
There was a problem hiding this comment.
We could simplify this by removing the closed flag
if (!recordIterator.hasNext()) {
return null;
}The engine won't call getNextPage() after closing the page source.
This also allows removing the explicit close() below.
| else { | ||
| unscaledValue = Decimals.decodeUnscaledValue(decimalType.getSlice(block, position)); | ||
| } | ||
| return new BigDecimal(unscaledValue, decimalType.getScale()); |
There was a problem hiding this comment.
Decimals.readBigDecimal() is the best way
| } | ||
| return new BigDecimal(unscaledValue, decimalType.getScale()); | ||
| } | ||
| if (type.equals(VARCHAR)) { |
There was a problem hiding this comment.
Yep, need to use instanceof VarcharType here
|
Any progress for this? @lxynov thanks |
|
@lxynov is this patch now split into 3 parts? Or is it safe to use this existing patch? I wanted to backport this to our internal Trino repo |
|
@caneGuy I don't think @lxynov is continuing work on this anymore. Feel free to pick it up if you'd like to. @rdsr FWIW, w.r.t. part-2 in #4776 (comment) , we've upgraded to 0.11.0. |
|
@jackye1995 is it correct you have picked this up? |
Spec
https://iceberg.apache.org/spec/#avro
Implementation
This PR's implementation utilizes Iceberg Avro reader/writer to read/write Iceberg Avro files. Involved Iceberg classes include
DataReader,DataWriter,Avro, etc. ClassIcebergAvroDataConversionwas implemented to convert between Presto data representation and Iceberg Avro data presentation. Presto data presentation refers to data presentation inBlocks. Iceberg Avro data presentation refers to the one used when read/write throughDataReader/DataWriter. The table below illustrates the conversion map.LongDecimalType
LongDecimalType: unscaled value is encoded and stored in Slice
FIXED(L)
FIXED(L): byte[]
Tests and TODOs
This PR doesn't implement additional tests but applies
AbstractTestIcebergSmokeandTestSparkCompatibilityto Avro format.There are two test failures.
AbstractTestIcebergSmoke.testCreateNestedPartitionedTable. This test depends on Avro: Fix pruning columns when a logical-map array's value type is nested apache/iceberg#1321. It's not inapache-iceberg-0.9.1but inapache-iceberg-0.10.0-rc0.TestSparkCompatibility.testPrestoReadingSparkData. This test doesn't pass due to the handling of TIMESTAMP objects.I wasn't able to get it pass and felt Presto's spec on TIMESTAMP is not really identical to Iceberg's.Presto's spec on TIMESTAMP: Instant in time that includes the date and time of day without a time zone with P digits of precision for the fraction of seconds. A precision of up to 12 (picoseconds) is supported. Values of this type are parsed and rendered in the session time zone.
Iceberg's spec on TIMESTAMP: All time and timestamp values are stored with microsecond precision. Timestamps with time zone represent a point in time: values are stored as UTC and do not retain a source time zone (
2017-11-16 17:10:34 PSTis stored/retrieved as2017-11-17 01:10:34 UTCand these values are considered identical). Timestamps without time zone represent a date and time of day regardless of zone: the time value is independent of zone adjustments (2017-11-16 17:10:34is always retrieved as2017-11-16 17:10:34). Timestamp values are stored as a long that encodes microseconds from the unix epoch.]I feel this PR's implementation is correct and perhaps there's something wrong on the Spark side.
Closes #2298
Part of #1324