Skip to content

Conversation

@hililiwei
Copy link
Contributor

@hililiwei hililiwei commented Mar 28, 2022

This PR consists of the following parts:

  • Port ORC rolling writer and unit test to Spark 3.1 3.0 2.4 and Flink 1.13 1.12, refer: #3784
  • Clear the Formart that are no longer needed.

@github-actions github-actions bot added the spark label Mar 28, 2022
@openinx
Copy link
Member

openinx commented Mar 29, 2022

As this is a minor change for the specific engine version, could you please create a single PR to include all those changes for all engine versions ?

@hililiwei hililiwei changed the title Spark 3.1: ORC support estimated length for unclosed file. Spark: ORC support estimated length for unclosed file. Mar 29, 2022
@hililiwei
Copy link
Contributor Author

As this is a minor change for the specific engine version, could you please create a single PR to include all those changes for all engine versions ?

Is the modification of Flink also submitted here or a separate PR?

@openinx
Copy link
Member

openinx commented Mar 29, 2022

I think it's fair to make all the engine's changes into a single PR.

@hililiwei hililiwei changed the title Spark: ORC support estimated length for unclosed file. Spark/Flink: ORC support estimated length for unclosed file. Mar 29, 2022
@github-actions github-actions bot added the flink label Mar 29, 2022
@hililiwei
Copy link
Contributor Author

I think it's fair to make all the engine's changes into a single PR.

Updated code and PR description.

@openinx
Copy link
Member

openinx commented Mar 29, 2022

I just checked all the ORC rolling file writer unit tests, all unit tests seems to be enabled after this PR:

➜  iceberg git:(4419) find . -type f -name '*.java'  | xargs grep -i 'Assume.*ORC'
./mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java:    assumeTrue(isVectorized && FileFormat.ORC.equals(fileFormat));
./hive3/src/main/java/org/apache/iceberg/mr/hive/vector/HiveVectorizedReader.java:    // reader will assume that the ORC file ends at the task's start + length, and might fail reading the tail..
./data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:    Assume.assumeFalse("ORC row group filter does not support StringStartsWith", format == FileFormat.ORC);
./data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:    Assume.assumeFalse("ORC row group filter does not support StringStartsWith", format == FileFormat.ORC);
./spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java:    Assume.assumeFalse(fileFormat == FileFormat.ORC && vectorized);
./spark/v3.0/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java:    Assume.assumeFalse(fileFormat == FileFormat.ORC && vectorized);
./spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java:    Assume.assumeFalse(fileFormat == FileFormat.ORC && vectorized);
➜  iceberg git:(4419) find . -type f -name '*.scala'  | xargs grep -i 'TODO.*orc' 
➜  iceberg git:(4419) find . -type f -name '*.java'  | xargs grep -i 'Assume.*ORC'
./mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java:    assumeTrue(isVectorized && FileFormat.ORC.equals(fileFormat));
./hive3/src/main/java/org/apache/iceberg/mr/hive/vector/HiveVectorizedReader.java:    // reader will assume that the ORC file ends at the task's start + length, and might fail reading the tail..
./data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:    Assume.assumeFalse("ORC row group filter does not support StringStartsWith", format == FileFormat.ORC);
./data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:    Assume.assumeFalse("ORC row group filter does not support StringStartsWith", format == FileFormat.ORC);
./spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java:    Assume.assumeFalse(fileFormat == FileFormat.ORC && vectorized);
./spark/v3.0/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java:    Assume.assumeFalse(fileFormat == FileFormat.ORC && vectorized);
./spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java:    Assume.assumeFalse(fileFormat == FileFormat.ORC && vectorized);

Copy link
Member

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty great to me !

@openinx openinx merged commit f6e1114 into apache:master Mar 29, 2022
@hililiwei hililiwei deleted the orc-writer-spark3.1 branch March 29, 2022 08:59
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 10, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants