-
Notifications
You must be signed in to change notification settings - Fork 3k
Parquet: Add readers and writers for the internal object model #11904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java
Outdated
Show resolved
Hide resolved
parquet/src/test/java/org/apache/iceberg/parquet/TestInternalWriter.java
Outdated
Show resolved
Hide resolved
772f5c2 to
233a00b
Compare
|
|
||
| @Override | ||
| public UUID read(UUID reuse) { | ||
| return UUIDUtil.convert(column.nextBinary().toByteBuffer()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me.
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalWriter.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalWriter.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalWriter.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalWriter.java
Outdated
Show resolved
Hide resolved
| return new ParquetValueReaders.UnboxedReader<>(desc); | ||
| } | ||
|
|
||
| private static class ParquetStructReader extends StructReader<StructLike, StructLike> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here also, there's not much value in using Parquet in the class name. Since this will produce GenericRecord instances, how about RecordReader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When checking that name (RecordReader) for consistency, I noticed that there's already a RecordReader in GenericParquetReaders. You can reuse that class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot reuse the class from GenericParquetReaders as it is based on Record interface, we need a class based on StructLike interface.
I will rename to StructLikeReader, just like the StructLikeWriter from InternalWriter class.
| @Override | ||
| protected ParquetValueReaders.PrimitiveReader<?> int96Reader(ColumnDescriptor desc) { | ||
| // normal handling as int96 | ||
| return new ParquetValueReaders.UnboxedReader<>(desc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't correct. The unboxed reader will return a Binary for int96 columns. Instead, this needs to use the same logic as the Spark reader (which also uses the internal representation):
private static class TimestampInt96Reader extends UnboxedReader<Long> {
TimestampInt96Reader(ColumnDescriptor desc) {
super(desc);
}
@Override
public Long read(Long ignored) {
return readLong();
}
@Override
public long readLong() {
final ByteBuffer byteBuffer =
column.nextBinary().toByteBuffer().order(ByteOrder.LITTLE_ENDIAN);
return ParquetUtil.extractTimestampInt96(byteBuffer);
}
}You can move that class into the parquet package to share it.
| ColumnDescriptor desc = type.getColumnDescription(currentPath()); | ||
|
|
||
| if (primitive.getOriginalType() != null) { | ||
| if (primitive.getLogicalTypeAnnotation() != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this change, but please point these kinds of changes out for reviewers.
The old version worked because all of the supported logical type annotations had an equivalent ConvertedType (which is what OriginalType is called in Parquet format and the logical type docs).
parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java
Show resolved
Hide resolved
| protected abstract ParquetValueReader<T> createStructReader( | ||
| List<Type> types, List<ParquetValueReader<?>> fieldReaders, Types.StructType structType); | ||
|
|
||
| protected abstract LogicalTypeAnnotation.LogicalTypeAnnotationVisitor<ParquetValueReader<?>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to have the subclasses provide this visitor.
| private static final OffsetDateTime EPOCH = Instant.ofEpochSecond(0).atOffset(ZoneOffset.UTC); | ||
| private static final LocalDate EPOCH_DAY = EPOCH.toLocalDate(); | ||
|
|
||
| private static class DateReader extends ParquetValueReaders.PrimitiveReader<LocalDate> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with moving the date/time reader classes here.
| @Override | ||
| public Optional<ParquetValueReader<?>> visit( | ||
| LogicalTypeAnnotation.TimestampLogicalTypeAnnotation timestampLogicalType) { | ||
| return Optional.of(new ParquetValueReaders.UnboxedReader<>(desc)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't correct. The unit of the incoming timestamp value still needs to be handled, even if the in-memory representation of the value is the same (a long).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the Spark implementations for this should work well, just like the int96 cases.
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetWriter.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetWriter.java
Outdated
Show resolved
Hide resolved
| import org.junit.jupiter.api.Test; | ||
| import org.junit.jupiter.api.io.TempDir; | ||
|
|
||
| public class TestInternalWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with the Avro tests, I think this should extend DataTest. It is probably easier to do the Avro work first and then reuse it here.
.palantir/revapi.yml
Outdated
| \ org.apache.iceberg.data.parquet.BaseParquetReaders<T>::logicalTypeReaderVisitor(org.apache.parquet.column.ColumnDescriptor,\ | ||
| \ org.apache.iceberg.types.Type.PrimitiveType, org.apache.parquet.schema.PrimitiveType)" | ||
| justification: "{Refactor Parquet reader and writer}" | ||
| - code: "java.method.abstractMethodAdded" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR should not introduce revapi failures. Instead, the new methods should have default implementations that match the previous behavior (returning the generic representations).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New methods are abstract and abstract method cannot have default implementation. So, I think we have to handle revapi failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think what you mean is don't add it as abstract. Add it as methods with default implementation. I got it. I will update it today.
| } | ||
| } | ||
|
|
||
| private static class RecordReader<T extends StructLike> extends StructReader<T, T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns Record so I don't think it needed to be modified. It doesn't return any other subclass of StructLike.
parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java
Outdated
Show resolved
Hide resolved
|
|
||
| protected ParquetValueWriter<?> uuidWriter(ColumnDescriptor desc) { | ||
| // Use primitive-type writer (as FIXED_LEN_BYTE_ARRAY); no special writer needed. | ||
| return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I commented on this in the last round of reviews. This isn't correct. Incoming values are of type UUID so this needs a writer that can convert UUID into a byte array. This should return ParquetValueWriters.uuids(desc).
There's also no need to add a method for this because it is the same between the generic and internal object models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the existing testcases was the reason for confusion as I mentioned in #11904 (comment)
I will update the existing testcases of Arrow too in this PR.
parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetWriter.java
Outdated
Show resolved
Hide resolved
| return new ParquetValueReaders.TimestampMillisReader(desc); | ||
| } | ||
|
|
||
| public static <T extends StructLike> StructReader<T, T> recordReader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be ParquetValueReader<Record>
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalWriter.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetReaders.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetWriter.java
Outdated
Show resolved
Hide resolved
|
@rdblue: Thanks for giving additional context for unresolved comments. I think I understood all the comments this time. PR is ready. It also fixed base code issues and testcases. |
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java
Outdated
Show resolved
Hide resolved
|
|
||
| @Override | ||
| public long readLong() { | ||
| return 1000L * column.nextInteger(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is valid for time but not for timestamp. I may have mixed up the timestamp reader and time reader in an earlier comment. This needs to be nextLong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the confusion was from this comment: #11904 (comment)
I was talking about the time type, but the code I pasted had the wrong class name, TimestampMillisReader should have been TimeMillisReader. Timestamps (millis) should use nextLong and time (millis) should use nextInteger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in ajantha-bhat#74
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lack of test coverage in the base code for these milliseconds time, timestamp and int96 timestamps is the reason for back and forth. Tests would have caught this. Will try to add in a follow up.
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java
Outdated
Show resolved
Hide resolved
|
rebasing the PR as Flink hit a flaky test #11833 (comment) |
f3d9245 to
20f7c26
Compare
|
Thanks, @ajantha-bhat! Good to get this in. |
BaseParquetWriterandBaseParquetReadersto reuse for internal writers and readers.InternalWriterandInternalReaderclass for parquet that consumes and produces the Iceberg in-memory data model.