-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Parquet, Core: Allows Internal Parquet Readers to use Custom Types #14040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet, Core: Allows Internal Parquet Readers to use Custom Types #14040
Conversation
parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java
Show resolved
Hide resolved
| import org.junit.jupiter.api.io.TempDir; | ||
|
|
||
| @ExtendWith(ParameterizedTestExtension.class) | ||
| public class TestInternalData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we have some tests for these code paths in TestInternalAvro and TestInternalParquet but those suites mostly function by piping everything through a "writeAndValidate" method which is not conducive to adding new functionality. It currently is used to test reuse containers and projections but I think we may want to consider separating those tests out as well into tests that run in core if possible.
| import org.apache.parquet.column.ColumnDescriptor; | ||
| import org.apache.parquet.schema.MessageType; | ||
|
|
||
| public class InternalReader<T extends StructLike> extends BaseParquetReaders<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this into a stateful reader so that Parquet.java can set custom classes for the reader to use. I'm definitely open to other ways around this. The previous approach was for Parquet to statically used the create method to generate a reader for "readerFunc", this had the problem that each call to set custom types would happen after the reader func was set.
|
I'm interested if anyone has an alternate way of wiring this together, I'm not extremely happy with the current setup where Parquet.java has so many different ways of attaching functions or mappings. |
|
|
||
| private static Parquet.ReadBuilder readInternal(InputFile inputFile) { | ||
| return Parquet.read(inputFile).createReaderFunc(InternalReader::create); | ||
| return Parquet.read(inputFile).useInternalReader(new InternalReader<>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot create a reader function at this point because we don't know whether or not we will have custom classes to apply by future calls to returned object from this class.
|
@RussellSpitzer is this still the uber PR for the Parquet manifest support? If yes, it will be great t0 rebase that PR on top of this one to help understand the integration. |
stevenzwu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some early comments from a quick glance. will look deeper
parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java
Outdated
Show resolved
Hide resolved
This is a small part of the PR for manifest support. This is required to load the fields into specific structLike classes. The key thing is implementing two interfaces which previously returned unsupported operation exception |
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java
Outdated
Show resolved
Hide resolved
| public static CustomRow of(Object... values) { | ||
| return new CustomRow(values); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constructors so we can use Custom Row as a CustomType in internal Data
| } | ||
| } | ||
|
|
||
| private static class UnaryReaderFunction implements ReaderFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an instance of the original readerFunc which takes only messageType as an arg.
stevenzwu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some nit comments. This looks pretty good to me
| this.readerFuncWithSchema == null, | ||
| "Cannot set reader function: 2-argument reader function already set"); | ||
| this.readerFunc = newReaderFunction; | ||
| this.readerFunction == null, "Cannot set reader function: reader function already set"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a slight behavior change here. Previously, it is ok to set the unary reader function multiple times. With the combining unary and binary reader function, this is not allowed anymore. But I am ok with the slight behavior change, as I don't think it is necessary to allow set the same reader function type multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it is a behavior change. I could allow the old behavior and allow resetting if the class of the ReaderFunction matches the previous class but failing if it's different.
Ie:
If this.readerFunction.class == readerFunction.class
this.readerFunction = readerFunction
So it wouldn't be a huge deal, but I think it's a weird behavior but it is an established one... Maybe we fix it in 2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe with the File Format API we can remove the possibility of setting the reader function from the external API.
That would naturally fix this.
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java
Outdated
Show resolved
Hide resolved
stevenzwu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. great work on solving the reader function setter problem
pvary
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @RussellSpitzer!
Nice solution!
.palantir/revapi.yml
Outdated
| \ java.util.List<org.apache.iceberg.PositionDeletesScanTask>)" | ||
| justification: "Removing deprecations for 1.10.0" | ||
| org.apache.iceberg:iceberg-parquet: | ||
| - code: "java.method.parameterTypeChanged" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think any of these RevAPI changes are actually needed anymore. I've removed them locally and things passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They still fail on my machine, let me try rebasing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebasing fixed this!
parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java
Show resolved
Hide resolved
| @Deprecated | ||
| protected ParquetValueReader<T> createStructReader( | ||
| List<ParquetValueReader<?>> fieldReaders, Types.StructType structType) { | ||
| throw new UnsupportedOperationException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we can actually throw here, isn't that a behavioral change for when people upgrade their Iceberg version and they were in some place using createStructReader(fieldReaders, structType)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the method was "abstract" before so they would have had to have implemented this method. I made it concrete here so I could remove the implementations from the two child classes we currently have rather than marking it deprecated there as well.
If you check the 3 arg version, it calls the subclasses 2 arg version if not defined so existing behavior should be the same.
| } | ||
|
|
||
| @Override | ||
| public Parquet.ReadBuilder.ReaderFunction withCustomTypes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually like this naming you have here better, but just wanted to point out that in the Avro reader we use setCustomTypes / setRootType / setSchema
EDIT: nvm, there isn't any direct relation between the Avro and Parquet internal readers, so having different naming here should be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 Yeah I'm not sure it's worth standardizing since this is only called by Parquet.java anyway
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code changes and tests LGTM but I believe we should be able to remove all of the newly added RevAPI exceptions. Also I added a question around backward comp.
| protected ParquetValueReader<T> createStructReader( | ||
| List<ParquetValueReader<?>> fieldReaders, Types.StructType structType, Integer fieldId) { | ||
| // Fallback to the signature without fieldId if not overridden | ||
| return createStructReader(fieldReaders, structType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nastra ^ See here (this is what keeps subclasses written by outside users from breaking)
# This is the 1st commit message: Parquet, Core: Allows Internal Parquet Readers to use Custom Types # This is the commit message #2: Remove some test code that is no longer in use # This is the commit message #3: Add Preconditions for double setting Parquet # This is the commit message #4: Reviewer Comments # This is the commit message #5: Reviewer Comments
a2a5231 to
2564724
Compare
|
|
||
| abstract class BaseParquetReaders<T> { | ||
| // Root ID is used for the top-level struct in the Parquet Schema | ||
| protected static final int ROOT_ID = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number could use documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the line above this not sufficent?
| List<ParquetValueReader<?>> readers, Types.StructType struct, Class<T> structLikeClass) { | ||
| super(readers); | ||
| this.struct = struct; | ||
| this.ctor = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DynConstructors.build() returns a constructor wrapper even if no matching constructors are found. The failure only occurs later when ctor.newInstance(struct) is called in newStructData() method (line 1167), potentially causing confusing runtime errors instead of clear validation errors during reader setup.
Can we do?
try {
this.ctor = DynConstructors.builder(StructLike.class)
.hiddenImpl(structLikeClass, Types.StructType.class)
.hiddenImpl(structLikeClass)
.build();
// Validate constructor works by testing instantiation
ctor.newInstance(struct);
} catch (Exception e) {
throw new IllegalArgumentException(
"Custom type " + structLikeClass.getName() +
" must have a constructor accepting Types.StructType or a no-args constructor", e);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may be misreading the code for DynConstructors, if it fails to find a constructor it will fail see
@SuppressWarnings("unchecked")
public <C> Ctor<C> build() {
if (ctor != null) {
return (Ctor<C>) ctor;
}
throw buildRuntimeException(baseClass, problems);
}
I also think this is pretty deep in the code and if we do have an exception here a user probably won't be able to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see
|
Merged! Thanks everyone for your thoughtful reviews! |
…pache#14040) Parquet, Core: Allows Internal Parquet Readers to use Custom Types
|
Hey @RussellSpitzer , |
|
Hey @stevenzwu , |
|
@gaborkaszab Assuming you are talking about porting this PR for custom types to Avro and Orc. This PR is a prep work to support writing manifest file in Parquet format. Avro already supports custom types for writing manifest file. There is no plan to support writing manifest in Orc. |
|
Thanks for your answer, @stevenzwu ! The motivation here is slightly different: @pvary found that having the same enhancement for all the readers/writers could help him simplify his file format API PR. |
|
I don't think there's value in adding custom type support to ORC if we're not going to use it anyway? |
The suggestion is not to add custom type support. The suggestion is to merge together the different readerFunctions, writerFunctions. Currently we have 2-3 versions of them for every file format. We also have a different builder method setting every one of them. Russell's change created a single, common reader function which handles all of the cases, and enables us to add new inputs to the functions easily. We could follow the same pattern for every FileFormat. The File Format API will need to add new parameters to these methods. This is way easier if we use similar ReadBuilder interfaces like the one introduced by Russell in this PR. |
…pache#14040) Parquet, Core: Allows Internal Parquet Readers to use Custom Types
Adds support to InternalReader Parquet for Support custom types.
New StructLike Reader added to ParquetValueReaders which can optionally be
used by BaseParquetReaders and InternalReader to produce any struct like object
with a constructor.
Tests for this are added to a new suite in Core because the existing tests are currently
split between TestInternalAvro and TestInternalParquet but I can move those if necessary.