Spark: Encapsulate parquet objects for Comet #13786

parthchandra · 2025-08-11T20:24:29Z

Fixes issues in #13378. This makes all the remaining changes to Iceberg to address issues with Comet/Parquet shading. The corresponding changes to Comet are already merged into Comet 0.10.1
Additionally, the dependency of the iceberg-parquet module has been removed because it created an iceberg-flink dependency on comet-spark. The code is changed to use a reflection based CometBridge class to access the come (Parquet) FileReader.
This PR also removes the commit to enable Comet execution (moving it to a follow up PR: cc @hsiang-c).

This PR makes the changes for Spark 3.4, 3.5, and 4.0

parthchandra · 2025-08-11T20:26:00Z

@huaxingao, @anuragmantri please take a look.

.../v3.4/spark/src/main/java/org/apache/iceberg/spark/parquet/CometVectorizedParquetReader.java

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

parquet/src/main/java/org/apache/iceberg/parquet/CometTypeUtils.java

.../v3.4/spark/src/main/java/org/apache/iceberg/spark/parquet/CometVectorizedParquetReader.java

parthchandra · 2025-09-27T00:03:57Z

@shangxinli thank you for taking a look. I found that there are more changes to get around parquet shading that need to be addressed; let me address your comments along with that.

parthchandra · 2025-10-08T00:14:26Z

@huaxingao, @shangxinli, @hsiang-c this is ready for final review.

parthchandra · 2025-10-08T00:15:38Z

cc @anuragmantri

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

parthchandra · 2025-10-10T21:28:09Z

@huaxingao any comments you would like me to address?

huaxingao · 2025-10-10T22:10:27Z

@parthchandra Apologies for the delay. I’ve skimmed the changes and would like more time for a deeper review. Because we’re adding several classes to the Parquet module, Comet code is no longer confined to the Spark module. Given the broader scope, it would be ideal if we can get one more committer to review and sign off.

parthchandra · 2025-10-10T22:47:00Z

Thanks @huaxingao. Who would you recommend we ask for a second review?
Note that the introduction of Comet classes becomes necessary because of the use of Comet to provide parquet file reading. If we use the Iceberg parquet file reader then we won't need to include any Comet classes in the parquet module. Also, to avoid creating a dependency on Comet, the Comet classes in the parquet module use reflection introducing the CometBridge class among others.

huaxingao · 2025-10-10T23:16:25Z

cc @aokolnychyi @flyrain @szehon-ho @stevenzwu Appreciate it if you could please take a look too. Thanks a lot!

pvary · 2025-11-14T15:56:01Z

What about something like this: pvary@4b6f7a5
This builds above the proposed File Format API: #12774, #12298

manuzhang · 2025-11-15T14:17:23Z

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatch.java

+    // If no explicit factory is set and reader type is COMET, use the default Comet factory
+    if (factoryClassName == null && readerType == ParquetReaderType.COMET) {
+      factoryClassName =
+          org.apache.iceberg.spark.SparkSQLProperties.COMET_VECTORIZED_READER_FACTORY_CLASS;


Why not import SparkSQLProperties?

Fails checkstyle -

Task :iceberg-spark:iceberg-spark-4.0_2.13:checkstyleMain FAILED [ant:checkstyle] [ERROR] /iceberg/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatch.java:21:58: Using a static member import should be avoided - org.apache.iceberg.spark.SparkSQLProperties.COMET_VECTORIZED_READER_FACTORY_CLASS. [AvoidStaticImport]

But you can import SparkSQLProperties and invoke like SparkSQLProperties.COMET_VECTORIZED_READER_FACTORY_CLASS

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java

spark/v4.0/build.gradle

parthchandra · 2025-11-18T18:58:27Z

What about something like this: pvary@4b6f7a5
This builds above the proposed File Format API: #12774, #12298

@pvary I'm not sure what you're asking here. Would you like this PR to be changed to build on top of the proposed API?

parthchandra · 2025-12-05T02:23:42Z

@pvary @huaxingao I've rebased on the latest and resolved the conflicts. PTAL.
If there are no more changes to be made, can we approve and merge this?

manuzhang · 2025-12-05T15:04:11Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

    }

+    /** Convenience method to enable comet */
+    public ReadBuilder enableComet(boolean enableComet) {


where is this method used? Can we just use vectorizedReaderFactory?

manuzhang · 2025-12-05T15:06:49Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+      if (enableComet) {
+        this.properties.put(
+            VECTORIZED_READER_FACTORY,
+            "org.apache.iceberg.spark.parquet.CometVectorizedParquetReaderFactory");


How to keep this consistent with that in SparkSQLProperties? Can we put it in a common module?

manuzhang · 2025-12-05T15:11:00Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java


 public class Parquet {
  private static final Logger LOG = LoggerFactory.getLogger(Parquet.class);
+  private static final String VECTORIZED_READER_FACTORY = "read.parquet.vectorized-reader.factory";


Do we have a unit test for changes in this class?

I've added a bunch of tests for this. For the Comet specific code, there is a plan to add a diff file that can be applied to change the reader for all the existing tests and we can add a ci test pipeline that verifies that nothing is broken.

manuzhang · 2025-12-05T15:11:27Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java


        if (batchedReaderFunc != null) {
+          // Try to load custom vectorized reader factory from properties
+          String readerName = properties.get(VECTORIZED_READER_FACTORY);


isn't this factoryName?

manuzhang · 2025-12-05T15:15:34Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+      Class<?> factoryClass = Class.forName(className);
+      if (!VectorizedParquetReaderFactory.class.isAssignableFrom(factoryClass)) {
+        LOG.warn("Class {} does not implement VectorizedParquetReaderFactory interface", className);
+        return null;


Rather than return null for multiple failure cases, how about

VectorizedParquetReaderFactory factory = null; try { Class<?> factoryClass = Class.forName(className); if (VectorizedParquetReaderFactory.class.isAssignableFrom(factoryClass)) { factory = (VectorizedParquetReaderFactory) factoryClass.getDeclaredConstructor().newInstance(); } else { // log warning } catch (...) { // log warning } return factory; }

We lose a little bit of helpful text in the error, but made the change.

manuzhang · 2025-12-05T15:36:16Z

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatch.java

+    String factoryClassName = readConf.parquetVectorizedReaderFactory();
+
+    // If no explicit factory is set and reader type is COMET, use the default Comet factory
+    if (factoryClassName == null && readerType == ParquetReaderType.COMET) {


This can be handled inside readConf.parquetVectorizedReaderFactory() and moved to right before the check if (factoryClassName != null) {.

manuzhang · 2025-12-05T15:42:48Z

.../v4.0/spark/src/main/java/org/apache/iceberg/spark/parquet/CometVectorizedParquetReader.java

+              reuseContainers,
+              caseSensitive,
+              batchSize);
+      this.conf = readConf.copy();


why do we need to immediately copy the readConf?

TBH, I don't know myself. Based it on what is being done in VectorizedParquetReader

manuzhang · 2025-12-05T15:48:40Z

.../v4.0/spark/src/main/java/org/apache/iceberg/spark/parquet/CometVectorizedParquetReader.java

+  private ReadConf conf = null;
+
+  private ReadConf init() {
+    if (conf == null) {


Is this thread safe?

I don't think it needs to be. This is pretty much what is done in VectorizedParquetReader

manuzhang · 2025-12-05T15:50:10Z

@parthchandra Can we target Spark 4.0 first to make review and revise easier? Meanwhile, can you add more tests for the changes?

parthchandra · 2025-12-05T21:17:46Z

@parthchandra Can we target Spark 4.0 first to make review and revise easier? Meanwhile, can you add more tests for the changes?

@manuzhang appreciate your feedback and time spent on this PR. We started with just one Spark version in #13378 but that got no attention!
One reason for keeping 3.4 is because we need to get one release of Iceberg out with Spark 3.4 support because Comet still supports Spark 3.4. The proposal to remove 3.4 has already been approved so we are racing against the clock here.

If it helps, the Spark 3.4 and Spark 3.5 implementations are (nearly) identical to Spark 4.0. If the review focuses only on Spark 4.0, I can make sure the same changes are made for the other Spark version branches.

manuzhang · 2025-12-07T15:59:30Z

The proposal to remove 3.4 has already been approved so we are racing against the clock here.

@parthchandra Targeting multiple Spark versions in one PR doesn't make it faster. Usually when a PR targets Spark 4.0 is merged, the back-port PR can be reviewed and merged quickly if the changes are identical. On the other hand, reviewers can get distracted if a PR targets multiple Spark versions.

kevinjqliu · 2025-12-23T22:28:58Z

The proposal to remove 3.4 has already been approved so we are racing against the clock here.

I raised the thread to remove spark 3.4 support for the upcoming 1.11 release. We could delay the removal until 1.12 based on feedback. I would like to remove it if possible and just have 1 spark 3.x version.

kevinjqliu · 2025-12-23T22:34:47Z

If it helps, the Spark 3.4 and Spark 3.5 implementations are (nearly) identical to Spark 4.0. If the review focuses only on Spark 4.0, I can make sure the same changes are made for the other Spark version branches.

Generally I do find it easier to review the PR if we have 1 PR focused on a single Spark version, such as for Spark 4.0. Once we get that PR merged, we can add followup PR that just copy/paste (hopefully) the same code to other Spark versions. If you look through the repo, this is what we often do as "backport" PRs.

Focusing on just the spark 4 integration might cut down the number of files to review by 1/3 :)

Spark:3.4 Encapsulate parquet objects for Comet

8df3ab4

github-actions bot added spark parquet build labels Aug 11, 2025

hsiang-c reviewed Aug 11, 2025

View reviewed changes

.../v3.4/spark/src/main/java/org/apache/iceberg/spark/parquet/CometVectorizedParquetReader.java Show resolved Hide resolved

hsiang-c reviewed Aug 11, 2025

View reviewed changes

.../v3.4/spark/src/main/java/org/apache/iceberg/spark/parquet/CometVectorizedParquetReader.java Show resolved Hide resolved

hsiang-c reviewed Aug 11, 2025

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java Outdated Show resolved Hide resolved

hsiang-c added a commit to hsiang-c/datafusion-comet that referenced this pull request Aug 12, 2025

Port apache/iceberg#13786

5891249

hsiang-c mentioned this pull request Aug 16, 2025

fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel apache/datafusion-comet#1987

Merged

huaxingao reviewed Aug 18, 2025

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/CometTypeUtils.java Outdated Show resolved Hide resolved

huaxingao reviewed Aug 18, 2025

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/CometTypeUtils.java Outdated Show resolved Hide resolved

parthchandra mentioned this pull request Sep 12, 2025

feat: support deleted rows in Comet #14062

Closed

parthchandra marked this pull request as ready for review September 16, 2025 23:48

parthchandra marked this pull request as draft September 17, 2025 00:03

shangxinli reviewed Sep 26, 2025

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/CometTypeUtils.java Show resolved Hide resolved

.../v3.4/spark/src/main/java/org/apache/iceberg/spark/parquet/CometVectorizedParquetReader.java Show resolved Hide resolved

hsiang-c mentioned this pull request Sep 26, 2025

Parquet: Close reader in CometVectorizedParquetReader #14194

Closed

parthchandra force-pushed the iceberg-shading-pr branch from 0119c24 to a148a5b Compare October 7, 2025 22:49

parthchandra marked this pull request as ready for review October 8, 2025 00:07

huaxingao reviewed Oct 8, 2025

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java Outdated Show resolved Hide resolved

Fix NPE

22d5de1

manuzhang reviewed Nov 15, 2025

View reviewed changes

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java Show resolved Hide resolved

manuzhang reviewed Nov 15, 2025

View reviewed changes

spark/v4.0/build.gradle Outdated Show resolved Hide resolved

parthchandra added 3 commits November 18, 2025 13:56

remove unnecessary blank line

bc0b83e

Merge remote-tracking branch 'origin/main' into iceberg-shading-pr

17d9a79

fix after updating from main

e65888e

parthchandra mentioned this pull request Nov 19, 2025

iceberg failed to build with datafusion-comet 0.11.0 in JDK 11 apache/datafusion-comet#2788

Open

parthchandra added 2 commits December 4, 2025 17:01

Merge remote-tracking branch 'origin/main' into iceberg-shading-pr

096fe99

revert inadvertent version change

6d74b80

manuzhang reviewed Dec 5, 2025

View reviewed changes

address review comments

efb0f9e

parthchandra force-pushed the iceberg-shading-pr branch from 27fe44d to efb0f9e Compare December 5, 2025 19:56

spotless

4e29904

parthchandra mentioned this pull request Dec 16, 2025

Future of Iceberg Support in Comet apache/datafusion-comet#2921

Open

Spark: Encapsulate parquet objects for Comet #13786

Are you sure you want to change the base?

Spark: Encapsulate parquet objects for Comet #13786

Uh oh!

Conversation

parthchandra commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parthchandra commented Aug 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

parthchandra commented Sep 27, 2025

Uh oh!

parthchandra commented Oct 8, 2025

Uh oh!

parthchandra commented Oct 8, 2025

Uh oh!

Uh oh!

parthchandra commented Oct 10, 2025

Uh oh!

huaxingao commented Oct 10, 2025

Uh oh!

parthchandra commented Oct 10, 2025

Uh oh!

huaxingao commented Oct 10, 2025

Uh oh!

pvary commented Nov 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

parthchandra commented Nov 18, 2025

Uh oh!

parthchandra commented Dec 5, 2025

Uh oh!

manuzhang Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

manuzhang commented Dec 5, 2025

Uh oh!

parthchandra commented Dec 5, 2025

Uh oh!

manuzhang commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu commented Dec 23, 2025

parthchandra commented Aug 11, 2025 •

edited

Loading

manuzhang Dec 5, 2025 •

edited

Loading

manuzhang commented Dec 7, 2025 •

edited

Loading