Spark - Accept an `output-spec-id` that allows writing to a desired partition spec #7120

gustavoatt · 2023-03-15T21:24:04Z

Summary

Allow passing output-spec-id to the Spark writer, so that we can customize which partition spec to write. This is useful for when a table has more than one active spec (e.g. daily and hourly partition spec).

Fixes #6932

Testing

Unit tests for Spark 3.1, 3.2 & 3.3

gustavoatt · 2023-03-15T21:28:50Z

@rdblue could you take a look when you have a chance? I wrote about the use-case for this feature on #6932.

Let me know what you think.

edgarRd · 2023-03-15T21:45:30Z

Should this be in Spark 2.4 as well?

gustavoatt · 2023-03-15T22:45:25Z

@edgarRd I looked into it, but I'm not sure whether it is worthwhile to add it to Spark 2.4:

It is a much larger change since the SparkWriter significantly changed from Spark 2 to Spark 3.
I hit some issues with org.apache.iceberg.io.PartitionedWriter when trying to write to a non-default partition spec which is only used for Spark 2.4.

For our use-case, adding only on Spark 3 should be enough.

szehon-ho

I think @RussellSpitzer was interested in this as well (for rewrite to different spec id).

spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java

spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestPartitionedWrites.java

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/PartitionedWritesTestBase.java

szehon-ho

This seems reasonable to me, left a few minor comments.

Would also ask @aokolnychyi to see if I am missing some potential problems by doing so.

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/PartitionedWritesTestBase.java

szehon-ho

Looks good to me, just a style nit

spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java

gustavoatt

Thank you for the review @szehon-ho!

szehon-ho

Thanks for the code changes, those look good now. Missed some questions/comments on the test.

spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestPartitionedWrites.java

spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/SparkWriteOptions.java

szehon-ho · 2023-04-12T17:34:38Z

Thanks @gustavoatt for all the changes, let me wait a little bit to see if any other concerns, will merge if not

aokolnychyi · 2023-04-13T02:01:56Z

spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java

    this.extraSnapshotMetadata = writeConf.extraSnapshotMetadata();
    this.partitionedFanoutEnabled = writeConf.fanoutWriterEnabled();
+
+    if (writeConf.outputSpecId() == null) {


Why not have this inside SparkWriteConf and make outputSpecId() return int?

Done, I moved this logic to SparkWriteConf. The main reason why I initially did not do it there was because I did not want to store the specs and current spec there, but I think that should be ok.

We could keep a reference to Table, just like we do in SparkReadConf.

Follow up PR #7348

aokolnychyi · 2023-04-13T02:03:32Z

spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

    return sessionConf.get("spark.wap.id", null);
  }

+  public Integer outputSpecId() {


Initially we were not sure whether to make spec IDs public but I also don't see a good alternative.
I am OK with the overall idea of exposing this.

spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java

aokolnychyi · 2023-04-13T02:06:55Z

The change looks good to me, I had a few minor comments.

gustavoatt · 2023-04-13T16:25:25Z

Thanks for the review @aokolnychyi and @szehon-ho.

There is a check failure on this PR but looks unrelated to my changes:

Deprecated Gradle features were used in this build, making it incompatible with Gradle 9.0.
* What went wrong:

Execution failed for task ':iceberg-spark:iceberg-spark-3.2_2.12:checkstyleMain'.
You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.
> A failure occurred while executing org.gradle.api.plugins.quality.internal.CheckstyleAction

   > An unexpected error occurred configuring and executing Checkstyle.
See https://docs.gradle.org/8.0.2/userguide/command_line_interface.html#sec:command_line_warnings
      > java.lang.Error: Error was thrown while processing /home/runner/work/iceberg/iceberg/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java
499 actionable tasks: 499 executed

szehon-ho · 2023-04-13T16:30:16Z

Yea I think its appeared a few times, tracked by #7321, let's just retrigger (close and re-open for example)

szehon-ho · 2023-04-13T17:13:35Z

spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

  }

+  public int outputSpecId() {
+    final int outputSpecId =


Nit: Would you mind remove 'final' here? I know from @aokolnychyi (when he reviewed me change), he prefers not to have extra finals except in class fields, as modern compiler usually adds it anyway. ex: #4293 (comment)

Done. I usually just keep the final as a way of having something like const to avoid accidentally modifying something I did not intend to. But I think it is not necessary in this case and would prefer to do it to keep consistency in the rpo.

szehon-ho · 2023-04-13T19:01:16Z

Merged, thanks @gustavoatt , and also @aokolnychyi, @edgarRd for additional review

gustavoatt · 2023-04-13T19:11:38Z

Thanks for the review and merging @szehon-ho @aokolnychyi!

Appreciate the effort spent reviewing!

apache#7120)

github-actions bot added the spark label Mar 15, 2023

szehon-ho reviewed Mar 23, 2023

View reviewed changes

spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java Outdated Show resolved Hide resolved

spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestPartitionedWrites.java Show resolved Hide resolved

szehon-ho reviewed Apr 6, 2023

View reviewed changes

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/PartitionedWritesTestBase.java Outdated Show resolved Hide resolved

szehon-ho reviewed Apr 7, 2023

View reviewed changes

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java Outdated Show resolved Hide resolved

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/PartitionedWritesTestBase.java Outdated Show resolved Hide resolved

gustavoatt added 5 commits April 7, 2023 15:05

[Spark 3.1] Add option to specify output spec

1a4e765

[Spark 3.2] Add option to specify output spec

afda885

[Spark 3.3] Add option to specify output spec

d0c5a65

Compute the outputSpec only during the actual write

df343d9

Address comments

684f1cf

gustavoatt force-pushed the spark-write-output-spec-id branch from 62592b2 to 684f1cf Compare April 7, 2023 19:57

szehon-ho reviewed Apr 11, 2023

View reviewed changes

spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java Outdated Show resolved Hide resolved

spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java Outdated Show resolved Hide resolved

Simplify output spec ID precondition check

8546b63

gustavoatt commented Apr 11, 2023

View reviewed changes

szehon-ho reviewed Apr 11, 2023

View reviewed changes

Simplify tests

7c4235f

szehon-ho approved these changes Apr 12, 2023

View reviewed changes

spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/SparkWriteOptions.java Outdated Show resolved Hide resolved

Remove unnecessary comment

601c24c

aokolnychyi reviewed Apr 13, 2023

View reviewed changes

spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java Outdated Show resolved Hide resolved

gustavoatt added 2 commits April 13, 2023 10:26

Move output spec logic into SparkWriteConf

f0eda8d

Minimize SparkWrite diffs

99529cc

gustavoatt closed this Apr 13, 2023

gustavoatt reopened this Apr 13, 2023

szehon-ho reviewed Apr 13, 2023

View reviewed changes

Remove final in local variable

63d6fab

szehon-ho merged commit 522480d into apache:master Apr 13, 2023

ericlgoodman pushed a commit to ericlgoodman/iceberg that referenced this pull request Apr 14, 2023

Spark: Accept an output-spec-id to write to a desired partition spec (

9c33c16

apache#7120)

gustavoatt mentioned this pull request Apr 14, 2023

Spark -Simplify checks of output-spec-id in SparkWriteConf #7348

Merged

manisin pushed a commit to Snowflake-Labs/iceberg that referenced this pull request May 9, 2023

Spark: Accept an output-spec-id to write to a desired partition spec (

d430591

apache#7120)

Spark - Accept an output-spec-id that allows writing to a desired partition spec #7120

Spark - Accept an output-spec-id that allows writing to a desired partition spec #7120

Uh oh!

Conversation

gustavoatt commented Mar 15, 2023

Summary

Testing

Uh oh!

gustavoatt commented Mar 15, 2023

Uh oh!

edgarRd commented Mar 15, 2023

Uh oh!

gustavoatt commented Mar 15, 2023

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gustavoatt left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

szehon-ho commented Apr 12, 2023

Uh oh!

aokolnychyi Apr 13, 2023

Choose a reason for hiding this comment

Uh oh!

gustavoatt Apr 13, 2023

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Apr 14, 2023

Choose a reason for hiding this comment

Uh oh!

gustavoatt Apr 14, 2023

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Apr 13, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aokolnychyi commented Apr 13, 2023

Uh oh!

gustavoatt commented Apr 13, 2023

Uh oh!

szehon-ho commented Apr 13, 2023

Uh oh!

szehon-ho Apr 13, 2023

Choose a reason for hiding this comment

Uh oh!

gustavoatt Apr 13, 2023

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Apr 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gustavoatt commented Apr 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Spark - Accept an `output-spec-id` that allows writing to a desired partition spec #7120

Spark - Accept an `output-spec-id` that allows writing to a desired partition spec #7120

szehon-ho commented Apr 13, 2023 •

edited

Loading