Spark 3.3: support write to WAP branch #7050

jackye1995 · 2023-03-08T22:54:38Z

based on #6965 , separated PR for writing to WAP branch. Will rebase once that is merged.

@rdblue @aokolnychyi @amogh-jahagirdar @namrathamyske

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSessionProperties.java

jackye1995 · 2023-03-11T05:50:49Z

.../v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestPartitionedWritesToWAPBranch.java

+import org.junit.Before;
+import org.junit.Test;
+
+public class TestPartitionedWritesToWAPBranch extends PartitionedWritesTestBase {


I did not add TestUnpartitionedWritesToWAPBranch, seems unnecessary to add more tests because it does not test additional case in the code path.

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

aokolnychyi · 2023-03-11T19:42:39Z

I'll take a look on Monday.

spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java

spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMerge.java

rdblue · 2023-03-11T20:12:26Z

.../v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestPartitionedWritesToWAPBranch.java

+  }
+
+  // commit target in WAP is just the table name
+  // should use table + branch name instead for read


I don't agree with this. I think it has to read the branch if it exists.

For example, a DELETE command is going to modify the state of a branch if it exists, and it could exist because branch WAP supports multiple writes. That means the reads for both the delete itself and any dynamic pruning must read the branch if it exists and main if it doesn't.

Yeah we need to use the branch during planning of the write to handle the dynamic pruning case, that's what came up in https://github.com/apache/iceberg/pull/6651/files. If we pass the branch through the SparkTable that should take care of this? Whether it's WAP branch or not doesn't matter, just needs to read the branch

.../v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestPartitionedWritesToWAPBranch.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

amogh-jahagirdar · 2023-03-11T21:41:35Z

.../v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestPartitionedWritesToWAPBranch.java

+  }
+
+  // commit target in WAP is just the table name
+  // should use table + branch name instead for read


Yeah we need to use the branch during planning of the write to handle the dynamic pruning case, that's what came up in https://github.com/apache/iceberg/pull/6651/files. If we pass the branch through the SparkTable that should take care of this? Whether it's WAP branch or not doesn't matter, just needs to read the branch

jackye1995 · 2023-03-12T16:19:45Z

@rdblue @amogh-jahagirdar thanks for the suggestions, I have addressed all nit comments, and also fixed the scan issue and updated the tests to verify behavior of multiple writes. Could you take another look?

spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java

amogh-jahagirdar

I took another pass @jackye1995 just a nit but this looks great to me overall!

rdblue · 2023-03-13T23:14:32Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

+            branch == null,
+            "Cannot write to both branch and WAP branch, but got branch [%s] and WAP branch [%s]",
+            branch,
+            wapBranch);


I don't think this behavior is a blocker because it is strict, but I would expect to be able to write to another branch with the WAP branch set. I'm curious what other people think the long-term behavior should be.

I think this behavior does help ensure that there are no side-effects, which is good if you want people to trust the pattern. But that's undermined by enabling/disabling WAP on a per-table basis.

Thank you! I added issue #7103 and we can discuss there with related people.

It seems like a reasonable starting point to me.

rdblue · 2023-03-13T23:20:22Z

Thanks @jackye1995! Looks great.

aokolnychyi · 2023-03-15T03:51:58Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java

+      return inputBranch;
+    }
+
+    boolean wapEnabled =


nit: I'd prefer a separate method called wapEnabled() like we have in SparkWriteConf. Then we could use the constant for the default value and it would simplify this method.

public boolean wapEnabled() { return confParser .booleanConf() .tableProperty(TableProperties.WRITE_AUDIT_PUBLISH_ENABLED) .defaultValue(TableProperties.WRITE_AUDIT_PUBLISH_ENABLED_DEFAULT) .parse(); }

aokolnychyi · 2023-03-15T03:54:32Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

+    if (wapEnabled()) {
+      String wapId = wapId();
+      String wapBranch =
+          confParser.stringConf().sessionConf(SparkSQLProperties.WAP_BRANCH).parseOptional();


nit: What about a separate method like we have for wapId()?

aokolnychyi

Late +1 from me too.

github-actions bot added core spark labels Mar 8, 2023

jackye1995 changed the title ~~Branch wap~~ Spark 3.3: support write to WAP branch Mar 8, 2023

jackye1995 closed this Mar 9, 2023

jackye1995 reopened this Mar 9, 2023

jackye1995 mentioned this pull request Mar 11, 2023

Spark 3.3: Support write to branch through table identifier #6965

Merged

jackye1995 commented Mar 11, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSessionProperties.java Outdated Show resolved Hide resolved

jackye1995 commented Mar 11, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java Outdated Show resolved Hide resolved

jackye1995 requested review from aokolnychyi and rdblue March 11, 2023 07:00