Skip to content

Conversation

@RussellSpitzer
Copy link
Member

Adds support to InternalReader Parquet for Support custom types.

New StructLike Reader added to ParquetValueReaders which can optionally be
used by BaseParquetReaders and InternalReader to produce any struct like object
with a constructor.

Tests for this are added to a new suite in Core because the existing tests are currently
split between TestInternalAvro and TestInternalParquet but I can move those if necessary.

import org.junit.jupiter.api.io.TempDir;

@ExtendWith(ParameterizedTestExtension.class)
public class TestInternalData {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we have some tests for these code paths in TestInternalAvro and TestInternalParquet but those suites mostly function by piping everything through a "writeAndValidate" method which is not conducive to adding new functionality. It currently is used to test reuse containers and projections but I think we may want to consider separating those tests out as well into tests that run in core if possible.

import org.apache.parquet.column.ColumnDescriptor;
import org.apache.parquet.schema.MessageType;

public class InternalReader<T extends StructLike> extends BaseParquetReaders<T> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this into a stateful reader so that Parquet.java can set custom classes for the reader to use. I'm definitely open to other ways around this. The previous approach was for Parquet to statically used the create method to generate a reader for "readerFunc", this had the problem that each call to set custom types would happen after the reader func was set.

@RussellSpitzer
Copy link
Member Author

I'm interested if anyone has an alternate way of wiring this together, I'm not extremely happy with the current setup where Parquet.java has so many different ways of attaching functions or mappings.


private static Parquet.ReadBuilder readInternal(InputFile inputFile) {
return Parquet.read(inputFile).createReaderFunc(InternalReader::create);
return Parquet.read(inputFile).useInternalReader(new InternalReader<>());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot create a reader function at this point because we don't know whether or not we will have custom classes to apply by future calls to returned object from this class.

@stevenzwu stevenzwu requested a review from pvary September 9, 2025 22:39
@stevenzwu
Copy link
Contributor

@RussellSpitzer is this still the uber PR for the Parquet manifest support?
#13769

If yes, it will be great t0 rebase that PR on top of this one to help understand the integration.

Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some early comments from a quick glance. will look deeper

@RussellSpitzer
Copy link
Member Author

RussellSpitzer commented Sep 10, 2025

@RussellSpitzer is this still the uber PR for the Parquet manifest support? #13769

If yes, it will be great t0 rebase that PR on top of this one to help understand the integration.

This is a small part of the PR for manifest support. This is required to load the fields into specific structLike classes.

The key thing is implementing two interfaces which previously returned unsupported operation exception

@github-actions github-actions bot added the API label Sep 17, 2025
public static CustomRow of(Object... values) {
return new CustomRow(values);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constructors so we can use Custom Row as a CustomType in internal Data

}
}

private static class UnaryReaderFunction implements ReaderFunction {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an instance of the original readerFunc which takes only messageType as an arg.

Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some nit comments. This looks pretty good to me

this.readerFuncWithSchema == null,
"Cannot set reader function: 2-argument reader function already set");
this.readerFunc = newReaderFunction;
this.readerFunction == null, "Cannot set reader function: reader function already set");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a slight behavior change here. Previously, it is ok to set the unary reader function multiple times. With the combining unary and binary reader function, this is not allowed anymore. But I am ok with the slight behavior change, as I don't think it is necessary to allow set the same reader function type multiple times.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it is a behavior change. I could allow the old behavior and allow resetting if the class of the ReaderFunction matches the previous class but failing if it's different.

Ie:

If this.readerFunction.class == readerFunction.class
this.readerFunction = readerFunction

So it wouldn't be a huge deal, but I think it's a weird behavior but it is an established one... Maybe we fix it in 2.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe with the File Format API we can remove the possibility of setting the reader function from the external API.
That would naturally fix this.

Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. great work on solving the reader function setter problem

Copy link
Contributor

@pvary pvary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RussellSpitzer!
Nice solution!

\ java.util.List<org.apache.iceberg.PositionDeletesScanTask>)"
justification: "Removing deprecations for 1.10.0"
org.apache.iceberg:iceberg-parquet:
- code: "java.method.parameterTypeChanged"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any of these RevAPI changes are actually needed anymore. I've removed them locally and things passed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They still fail on my machine, let me try rebasing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebasing fixed this!

@Deprecated
protected ParquetValueReader<T> createStructReader(
List<ParquetValueReader<?>> fieldReaders, Types.StructType structType) {
throw new UnsupportedOperationException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can actually throw here, isn't that a behavioral change for when people upgrade their Iceberg version and they were in some place using createStructReader(fieldReaders, structType)?

Copy link
Member Author

@RussellSpitzer RussellSpitzer Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the method was "abstract" before so they would have had to have implemented this method. I made it concrete here so I could remove the implementations from the two child classes we currently have rather than marking it deprecated there as well.

If you check the 3 arg version, it calls the subclasses 2 arg version if not defined so existing behavior should be the same.

}

@Override
public Parquet.ReadBuilder.ReaderFunction withCustomTypes(
Copy link
Contributor

@nastra nastra Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like this naming you have here better, but just wanted to point out that in the Avro reader we use setCustomTypes / setRootType / setSchema

EDIT: nvm, there isn't any direct relation between the Avro and Parquet internal readers, so having different naming here should be fine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 Yeah I'm not sure it's worth standardizing since this is only called by Parquet.java anyway

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code changes and tests LGTM but I believe we should be able to remove all of the newly added RevAPI exceptions. Also I added a question around backward comp.

protected ParquetValueReader<T> createStructReader(
List<ParquetValueReader<?>> fieldReaders, Types.StructType structType, Integer fieldId) {
// Fallback to the signature without fieldId if not overridden
return createStructReader(fieldReaders, structType);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra ^ See here (this is what keeps subclasses written by outside users from breaking)

# This is the 1st commit message:

Parquet, Core: Allows Internal Parquet Readers to use Custom Types

# This is the commit message #2:

Remove some test code that is no longer in use

# This is the commit message #3:

Add Preconditions for double setting Parquet

# This is the commit message #4:

Reviewer Comments

# This is the commit message #5:

Reviewer Comments

abstract class BaseParquetReaders<T> {
// Root ID is used for the top-level struct in the Parquet Schema
protected static final int ROOT_ID = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number could use documentation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the line above this not sufficent?

List<ParquetValueReader<?>> readers, Types.StructType struct, Class<T> structLikeClass) {
super(readers);
this.struct = struct;
this.ctor =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DynConstructors.build() returns a constructor wrapper even if no matching constructors are found. The failure only occurs later when ctor.newInstance(struct) is called in newStructData() method (line 1167), potentially causing confusing runtime errors instead of clear validation errors during reader setup.

Can we do?
try {
this.ctor = DynConstructors.builder(StructLike.class)
.hiddenImpl(structLikeClass, Types.StructType.class)
.hiddenImpl(structLikeClass)
.build();
// Validate constructor works by testing instantiation
ctor.newInstance(struct);
} catch (Exception e) {
throw new IllegalArgumentException(
"Custom type " + structLikeClass.getName() +
" must have a constructor accepting Types.StructType or a no-args constructor", e);
}

Copy link
Member Author

@RussellSpitzer RussellSpitzer Sep 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may be misreading the code for DynConstructors, if it fails to find a constructor it will fail see

@SuppressWarnings("unchecked")
    public <C> Ctor<C> build() {
      if (ctor != null) {
        return (Ctor<C>) ctor;
      }
      throw buildRuntimeException(baseClass, problems);
    }

I also think this is pretty deep in the code and if we do have an exception here a user probably won't be able to fix it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see

@RussellSpitzer RussellSpitzer merged commit f9a4168 into apache:main Sep 30, 2025
43 checks passed
@RussellSpitzer
Copy link
Member Author

Merged! Thanks everyone for your thoughtful reviews!

gabeiglio pushed a commit to gabeiglio/iceberg that referenced this pull request Oct 1, 2025
…pache#14040)

Parquet, Core: Allows Internal Parquet Readers to use Custom Types
@gaborkaszab
Copy link
Collaborator

Hey @RussellSpitzer ,
We discussed with @pvary the other day that the same enhancements for the Parquet writers, and also for Avro and Orc readers/writers would be useful for the file format API change that he has been working on. I just wanted to throw the idea in to see if there are any obvious reasons not to make those changes or in case you have any objections against. If you think it makes sense, I can take a deeper look. Thanks!

@gaborkaszab
Copy link
Collaborator

Hey @stevenzwu ,
Let me cc you as well on this. See my above comment. I'd appreciate any early feedback, and whether you think this makes sense. Thank you!

@stevenzwu
Copy link
Contributor

@gaborkaszab Assuming you are talking about porting this PR for custom types to Avro and Orc. This PR is a prep work to support writing manifest file in Parquet format. Avro already supports custom types for writing manifest file. There is no plan to support writing manifest in Orc.

@gaborkaszab
Copy link
Collaborator

Thanks for your answer, @stevenzwu ! The motivation here is slightly different: @pvary found that having the same enhancement for all the readers/writers could help him simplify his file format API PR.

@nastra
Copy link
Contributor

nastra commented Oct 10, 2025

I don't think there's value in adding custom type support to ORC if we're not going to use it anyway?

@pvary
Copy link
Contributor

pvary commented Oct 10, 2025

I don't think there's value in adding custom type support to ORC if we're not going to use it anyway?

The suggestion is not to add custom type support. The suggestion is to merge together the different readerFunctions, writerFunctions. Currently we have 2-3 versions of them for every file format. We also have a different builder method setting every one of them. Russell's change created a single, common reader function which handles all of the cases, and enables us to add new inputs to the functions easily. We could follow the same pattern for every FileFormat. The File Format API will need to add new parameters to these methods. This is way easier if we use similar ReadBuilder interfaces like the one introduced by Russell in this PR.

adawrapub pushed a commit to adawrapub/iceberg that referenced this pull request Oct 16, 2025
…pache#14040)

Parquet, Core: Allows Internal Parquet Readers to use Custom Types
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants