Skip to content

Conversation

@stubz151
Copy link
Contributor

@stubz151 stubz151 commented Sep 5, 2025

What Am I doing

Adding a read vector implementation to the range readable interface. To do this I'm adding the methods to check if it's enabled and proving an interface which one can implement.
#13254

Changes

  • added read vector to the ranged readable interface
  • added config values for it and passed it down to the hadoop parquet library
  • added a mapper in parquet io that can turn our range readable + seekable streams into a parquet seekable stream.

testing

Tested with the AAL implementation and it is passed correctly with flag
--conf "spark.sql.iceberg.read.vector.enabled=true" \

Notes

Kept the AAL changes seperate to not bloat this PR but can include them we want to see a functional implementation of this.


private static List<ParquetObjectRange> convertRanges(List<ParquetFileRange> ranges) {
return ranges.stream()
.map(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this just maps between the internal parquet hadoop range and the new iceberg one.

Copy link

@fuatbasik fuatbasik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @stubz151 . Interface and configuration flag looked good to me. I just put two small comments

@stubz151 stubz151 marked this pull request as ready for review September 5, 2025 14:39
@stubz151 stubz151 force-pushed the vector_impl branch 2 times, most recently from 48d647d to fa97848 Compare September 8, 2025 13:42
* this class is written by @mukundthakur, and taken from
* /hadoop-common/src/main/java/org/apache/hadoop/fs/VectoredReadUtils.java (thank you!).
*/
public final class VectoredReadUtils {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel like we need this class. There are three things this does, but it should probalby be just one. The validateRangeRequest should just be handled in the constructor of the FileRange (we currently don't have any validation there). The sortRangeList is a subset of validateAndSortRanges which seems duplicative.

I'd suggest moving validateAndSort to the RangeReadable interface as a static utility that can be used by implementors and avoid creating this util class.

private final long offset;
private final int length;

public FileRange(CompletableFuture<ByteBuffer> byteBuffer, long offset, int length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the parquet implementation, I don't think you can pass the byteBuffer future in like this. I believe this is intended to be set by the implementation so that it can be returned to the invoker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We not passing the bytebuffer in here right, we passing a future that completes with a byte buffer, we need a way to map the futures in Iceberg to the future's we are setting in Parquet,
So when we call parquetFileRange.setDataReadFuture(future); we need to have a way of tracking that future in Iceberg and that's what this gives us.

@github-actions github-actions bot added the AWS label Sep 12, 2025
}

@Override
public void readVectored(List<ParquetFileRange> ranges, ByteBufferAllocator allocate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some tests at the ParquetIO level to validate this? I know we're adding some in S3FileIO, but it would be good to have this interface tested (even if there's a mock implementation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added in testRangeReadableAdapterReadVectored which does something similar to the tests in S3FileIO, but focused a bit more on checking that the buffers/ranges are being used correctly, I skipped the other operations but can add them in if we want. Let me know

optionsBuilder.withDecryption(fileDecryptionProperties);
}

optionsBuilder.withUseHadoopVectoredIo(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were some efforts to allow Iceberg working without Hadoop on the classpath.
I'm not sure how far away these efforts went, and also not sure how this change will effect that effort.

Could you please help me understand the consequences of always using withUseHadoopVectoredIo?

Thanks,
Peter

Copy link
Contributor Author

@stubz151 stubz151 Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For part 1 about the effort to reduce the dependencies on Hadoop I don't think that was ever completed I do see a TODO comment about wanting to do it. I am probably making the effort more complicated as I am adding 2 new imports from Hadoop but I don't think that is a big risk.

for 2) withUseHadoopVectoredIo is used in the file reader in conjunction with readVectoredAvailable() so moving to always using readVector doesn't change anything unless the stream also supports readVectored.
https://github.com/apache/parquet-java/blob/f50dd6cb4b526cf4b585993c1b69a838cd8151f3/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1303

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the naming of this option is a little misleading. The withUseHadoopVectoredIo doesn't necessarily depend on hadoop as @stubz151 mentions, but rather enables the vectored io behavior in Parquet.

@stubz151 stubz151 force-pushed the vector_impl branch 3 times, most recently from 861cdf3 to c6e04ea Compare September 26, 2025 11:23
@stubz151 stubz151 force-pushed the vector_impl branch 4 times, most recently from d35250f to 278a0fb Compare September 26, 2025 13:54
Copy link
Contributor

@pvary pvary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok from my side, but I would like to ask someone else to take a look as well.

Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stubz151 !

@danielcweeks danielcweeks merged commit a2ac141 into apache:main Sep 30, 2025
43 checks passed
gabeiglio pushed a commit to gabeiglio/iceberg that referenced this pull request Oct 1, 2025
apache#13997)

Core: Adding read vector to range readable interface and adding mapper to parquet stream.
adawrapub pushed a commit to adawrapub/iceberg that referenced this pull request Oct 16, 2025
apache#13997)

Core: Adding read vector to range readable interface and adding mapper to parquet stream.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants