Skip to content

Conversation

@parthchandra
Copy link
Contributor

Which issue does this PR close?

When running iceberg unit and integration tests with Comet, we still encounter some issues due to Parquet shading. This PR addresses those issues.
The main changes are: For all classes in Comet that are used in Iceberg integration, make all constructors and methods package private if they have a Parquet class in the signature. Provide equivalent methods that use Parquet encapsulation classes.
Refactor the FileReader API to use an InputStream (instead of a file path) to allow implementations of Iceberg's InputFile. The refactoring introduces some more encapsulation classes that use reflection to call back into Iceberg (to avoid adding a circular dependency on Iceberg)
This PR may break the complementary (draft) PR in Iceberg: apache/iceberg#13378

@codecov-commenter
Copy link

codecov-commenter commented Aug 6, 2025

Codecov Report

❌ Patch coverage is 0% with 75 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.35%. Comparing base (f09f8af) to head (0ae3c34).
⚠️ Report is 355 commits behind head on main.

Files with missing lines Patch % Lines
.../src/main/java/org/apache/comet/parquet/Utils.java 0.00% 35 Missing ⚠️
...ava/org/apache/comet/parquet/WrappedInputFile.java 0.00% 15 Missing ⚠️
...ache/comet/parquet/WrappedSeekableInputStream.java 0.00% 15 Missing ⚠️
...org/apache/comet/parquet/ConstantColumnReader.java 0.00% 3 Missing ⚠️
...org/apache/comet/parquet/MetadataColumnReader.java 0.00% 3 Missing ⚠️
...main/java/org/apache/comet/parquet/FileReader.java 0.00% 2 Missing ⚠️
...va/org/apache/comet/parquet/NativeBatchReader.java 0.00% 1 Missing ⚠️
...c/main/java/org/apache/comet/parquet/TypeUtil.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2078      +/-   ##
============================================
+ Coverage     56.12%   58.35%   +2.22%     
- Complexity      976     1252     +276     
============================================
  Files           119      140      +21     
  Lines         11743    13322    +1579     
  Branches       2251     2370     +119     
============================================
+ Hits           6591     7774    +1183     
- Misses         4012     4317     +305     
- Partials       1140     1231      +91     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@parthchandra
Copy link
Contributor Author

Reverted some of the changes to allow ci to pass.
#2079 to track the methods that need to be deprecated once everything is in sync

@huaxingao
Copy link
Contributor

Refactor the FileReader API to use an InputStream (instead of a file path)... — Does using a file path still have shading issues?

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @parthchandra


/**
* Wraps an InputStream that possibly implements the methods of a Parquet SeekableInputStream (but
* is not a Parquet SeekableInputStream). Such an InputStream exists, fir instance, in Iceberg's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) fir -> for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


/**
* A Parquet {@link InputFile} implementation that's similar to {@link
* org.apache.parquet.hadoop.util.HadoopInputFile}, but with optimizations introduced in Hadoop 3.x,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Hadoop 3.x part still valid b/c this class is not a CometInputFile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Updated the comment.

Copy link
Contributor

@hsiang-c hsiang-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@parthchandra
Copy link
Contributor Author

Refactor the FileReader API to use an InputStream (instead of a file path)... — Does using a file path still have shading issues?

No, but the input to CometVectorizedParquetReader.newCometReader is an InputFile that need not be a HadoopInputFile and need not be backed by an actual file. In that case, the file path is invalid and the Comet FileReader fails. A bunch of Iceberg unit tests failed because of that.

The WrappedInputFile implementation exists to work around that. The underlying object in WrappedInputFile is an org.apache.iceberg.io.InputFile which is accessed via reflection to avoid creating a dependency on Iceberg. Similarly the org.apache.iceberg.io.SeekableInputStream that is returned by org.apache.iceberg.io.InputFile is wrapped in WrappedSeekableInputStream to avoid dependencies on Iceberg.

}

return new ParquetColumnSpec(
1, // ToDo: pass in the correct id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnDescriptor doesn't expose fieldId, but we need a placeholder here. I guess put -1 for unknown id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently one can but id may not be set. I've changed this to set the id from the descriptor else to -1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huaxingao if this change looks good to you, I will merge this PR. Please take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the fix!

@parthchandra parthchandra merged commit bdd5c43 into apache:main Aug 8, 2025
92 checks passed
@parthchandra
Copy link
Contributor Author

Merged. Thank you for the review @andygrove @huaxingao @hsiang-c

coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025
* fix: [iceberg] more fixes for Iceberg intergation APIs.
@parthchandra parthchandra deleted the iceberg-shading branch January 14, 2026 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants