Spark/Flink: ORC support estimated length for unclosed file. #4419

hililiwei · 2022-03-28T11:39:28Z

This PR consists of the following parts:

Port ORC rolling writer and unit test to Spark 3.1 3.0 2.4 and Flink 1.13 1.12, refer: #3784
Clear the Formart that are no longer needed.

openinx · 2022-03-29T01:49:00Z

As this is a minor change for the specific engine version, could you please create a single PR to include all those changes for all engine versions ?

hililiwei · 2022-03-29T02:17:49Z

As this is a minor change for the specific engine version, could you please create a single PR to include all those changes for all engine versions ?

Is the modification of Flink also submitted here or a separate PR?

openinx · 2022-03-29T03:00:20Z

I think it's fair to make all the engine's changes into a single PR.

hililiwei · 2022-03-29T04:12:10Z

I think it's fair to make all the engine's changes into a single PR.

Updated code and PR description.

openinx · 2022-03-29T08:46:31Z

I just checked all the ORC rolling file writer unit tests, all unit tests seems to be enabled after this PR:

➜  iceberg git:(4419) find . -type f -name '*.java'  | xargs grep -i 'Assume.*ORC'
./mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java:    assumeTrue(isVectorized && FileFormat.ORC.equals(fileFormat));
./hive3/src/main/java/org/apache/iceberg/mr/hive/vector/HiveVectorizedReader.java:    // reader will assume that the ORC file ends at the task's start + length, and might fail reading the tail..
./data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:    Assume.assumeFalse("ORC row group filter does not support StringStartsWith", format == FileFormat.ORC);
./data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:    Assume.assumeFalse("ORC row group filter does not support StringStartsWith", format == FileFormat.ORC);
./spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java:    Assume.assumeFalse(fileFormat == FileFormat.ORC && vectorized);
./spark/v3.0/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java:    Assume.assumeFalse(fileFormat == FileFormat.ORC && vectorized);
./spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java:    Assume.assumeFalse(fileFormat == FileFormat.ORC && vectorized);
➜  iceberg git:(4419) find . -type f -name '*.scala'  | xargs grep -i 'TODO.*orc' 
➜  iceberg git:(4419) find . -type f -name '*.java'  | xargs grep -i 'Assume.*ORC'
./mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java:    assumeTrue(isVectorized && FileFormat.ORC.equals(fileFormat));
./hive3/src/main/java/org/apache/iceberg/mr/hive/vector/HiveVectorizedReader.java:    // reader will assume that the ORC file ends at the task's start + length, and might fail reading the tail..
./data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:    Assume.assumeFalse("ORC row group filter does not support StringStartsWith", format == FileFormat.ORC);
./data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:    Assume.assumeFalse("ORC row group filter does not support StringStartsWith", format == FileFormat.ORC);
./spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java:    Assume.assumeFalse(fileFormat == FileFormat.ORC && vectorized);
./spark/v3.0/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java:    Assume.assumeFalse(fileFormat == FileFormat.ORC && vectorized);
./spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java:    Assume.assumeFalse(fileFormat == FileFormat.ORC && vectorized);

openinx

Looks pretty great to me !

Spark 3.1: ORC support estimated length for unclosed file.

efc612d

github-actions bot added the spark label Mar 28, 2022

This was referenced Mar 29, 2022

Spark 3.0: ORC support estimated length for unclosed file. #4420

Closed

Spark 2.4: ORC support estimated length for unclosed file. #4421

Closed

hililiwei added 2 commits March 29, 2022 10:01

Spark 3.0: ORC support estimated length for unclosed file.

515eca0

Spark 2.4: ORC support estimated length for unclosed file.

8e545ec

hililiwei changed the title ~~Spark 3.1: ORC support estimated length for unclosed file.~~ Spark: ORC support estimated length for unclosed file. Mar 29, 2022

hililiwei added 2 commits March 29, 2022 11:29

Flink 1.13: ORC support estimated length for unclosed file.

6c63ee5

Flink 1.12: ORC support estimated length for unclosed file.

f86c2eb

hililiwei changed the title ~~Spark: ORC support estimated length for unclosed file.~~ Spark/Flink: ORC support estimated length for unclosed file. Mar 29, 2022

github-actions bot added the flink label Mar 29, 2022

clean up format

071ad44

hililiwei force-pushed the orc-writer-spark3.1 branch from 3735e21 to 071ad44 Compare March 29, 2022 04:05

github-actions bot added core data labels Mar 29, 2022

openinx approved these changes Mar 29, 2022

View reviewed changes

openinx merged commit f6e1114 into apache:master Mar 29, 2022

hililiwei deleted the orc-writer-spark3.1 branch March 29, 2022 08:59

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022

Spark 3.1:Port apache#4419 to Spark 3.1

69a424c

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022

Spark 3.1:Port apache#4419 to Spark 3.1

59423c8

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022

Spark 3.1:Port apache#4419 to Spark 3.1

a254d81

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022

Spark 3.1:Port apache#4419 to Spark 3.1

2f58587

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022

Spark 3.1:Port apache#4419 to Spark 3.1

f3bc943

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 10, 2022

Spark 3.1:Port apache#4419 to Spark 3.1

3892533

hililiwei mentioned this pull request Aug 10, 2022

Spark 3.1: Port some PRs to Spark 3.1 #5479

Closed

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 11, 2022

Spark 3.1:Port apache#4419 to Spark 3.1

afbe810

aokolnychyi pushed a commit that referenced this pull request Aug 12, 2022

Spark 3.1: Port #3287 #4381 #3535 #4419 to Spark 3.1 (#5498)

43315e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark/Flink: ORC support estimated length for unclosed file. #4419

Spark/Flink: ORC support estimated length for unclosed file. #4419

Uh oh!

hililiwei commented Mar 28, 2022 •

edited

Loading

Uh oh!

openinx commented Mar 29, 2022

Uh oh!

hililiwei commented Mar 29, 2022

Uh oh!

openinx commented Mar 29, 2022

Uh oh!

hililiwei commented Mar 29, 2022

Uh oh!

openinx commented Mar 29, 2022

Uh oh!

openinx left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Spark/Flink: ORC support estimated length for unclosed file. #4419

Spark/Flink: ORC support estimated length for unclosed file. #4419

Uh oh!

Conversation

hililiwei commented Mar 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openinx commented Mar 29, 2022

Uh oh!

hililiwei commented Mar 29, 2022

Uh oh!

openinx commented Mar 29, 2022

Uh oh!

hililiwei commented Mar 29, 2022

Uh oh!

openinx commented Mar 29, 2022

Uh oh!

openinx left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hililiwei commented Mar 28, 2022 •

edited

Loading