API,Core,Spark: Add rewritten bytes to rewrite data files procedure results #6801

nastra · 2023-02-10T06:40:28Z

No description provided.

Fokko

LGTM

Fokko · 2023-02-10T08:38:26Z

core/src/main/java/org/apache/iceberg/actions/BaseFileGroupRewriteResult.java

  private final FileGroupInfo info;

+  /**
+   * @deprecated Will be removed in 1.3.0; use {@link


Should we create an issue for this, and add it to the 1.3.0 milestone?

I'll usually clean things like this up right after a release, so I'll open an issue for that shortly

Awesome, just to make sure that's in the collective memory of the community :)

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

RussellSpitzer

I'm on board for this adjustment. Only issue is the test suite become a little brittle with the change. I left a note on how I think we can keep it from requiring a lot of patching in the future.

api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/SparkTestHelperBase.java

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

nastra · 2023-02-14T08:53:35Z

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

-    insertData(tableName(QUOTED_SPECIAL_CHARS_TABLE_NAME), 10);
+    insertData(tblName, 10);
+    // TODO: metadata table access currently fails with special chars in the table name
+    // long dataSizeBefore = testDataSize(tblName);


this seems like a bug to me where the table with special characters can't be found when running SELECT sum(file_size_in_bytes) from %s.files. It fails with

Caused by: java.io.FileNotFoundException: File file:/tmp/warehouse2890706410427132468.tmp/default/table:with.special:chars/metadata/2ae7cad2-3cff-4c92-9536-9bf9652f119d-m1.avro does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:160) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:372) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976) at org.apache.iceberg.hadoop.HadoopInputFile.newStream(HadoopInputFile.java:183)

turns out this happened because of caching in the Hadoop catalog. Adding cache-enabled=false fixes this issue

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

…esults

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java

RussellSpitzer · 2023-02-15T21:47:14Z

Thanks for the PR @nastra and for help with review @Fokko !

…#6117) Yesterday this PR got merged (apache/iceberg#6801) which introduces one more output value. Hence, the strict check fails. This PR is to unblock `query-engine-integration-tests` Part of fixing #6114

…esults (apache#6801) Co-authored-by: Alex Reid <[email protected]>

nastra requested a review from RussellSpitzer February 10, 2023 06:40

github-actions bot added API core spark labels Feb 10, 2023

nastra requested review from aokolnychyi and rdblue February 10, 2023 06:40

nastra force-pushed the add-rewritten-bytes-compaction-results branch from 315b1e4 to 8e7d6c8 Compare February 10, 2023 07:33

Fokko approved these changes Feb 10, 2023

View reviewed changes