Skip to content

Conversation

@jackye1995
Copy link
Contributor

@jackye1995 jackye1995 commented Mar 8, 2023

Fixes #6774

based on #6965 , separated PR for writing to WAP branch. Will rebase once that is merged.

@rdblue @aokolnychyi @amogh-jahagirdar @namrathamyske

@jackye1995 jackye1995 changed the title Branch wap Spark 3.3: support write to WAP branch Mar 8, 2023
@jackye1995 jackye1995 closed this Mar 9, 2023
@jackye1995 jackye1995 reopened this Mar 9, 2023
import org.junit.Before;
import org.junit.Test;

public class TestPartitionedWritesToWAPBranch extends PartitionedWritesTestBase {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not add TestUnpartitionedWritesToWAPBranch, seems unnecessary to add more tests because it does not test additional case in the code path.

@aokolnychyi
Copy link
Contributor

I'll take a look on Monday.

}

// commit target in WAP is just the table name
// should use table + branch name instead for read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with this. I think it has to read the branch if it exists.

For example, a DELETE command is going to modify the state of a branch if it exists, and it could exist because branch WAP supports multiple writes. That means the reads for both the delete itself and any dynamic pruning must read the branch if it exists and main if it doesn't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we need to use the branch during planning of the write to handle the dynamic pruning case, that's what came up in https://github.com/apache/iceberg/pull/6651/files. If we pass the branch through the SparkTable that should take care of this? Whether it's WAP branch or not doesn't matter, just needs to read the branch

}

// commit target in WAP is just the table name
// should use table + branch name instead for read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we need to use the branch during planning of the write to handle the dynamic pruning case, that's what came up in https://github.com/apache/iceberg/pull/6651/files. If we pass the branch through the SparkTable that should take care of this? Whether it's WAP branch or not doesn't matter, just needs to read the branch

@jackye1995
Copy link
Contributor Author

@rdblue @amogh-jahagirdar thanks for the suggestions, I have addressed all nit comments, and also fixed the scan issue and updated the tests to verify behavior of multiple writes. Could you take another look?

@jackye1995 jackye1995 added this to the Iceberg 1.2.0 milestone Mar 13, 2023
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another pass @jackye1995 just a nit but this looks great to me overall!

branch == null,
"Cannot write to both branch and WAP branch, but got branch [%s] and WAP branch [%s]",
branch,
wapBranch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this behavior is a blocker because it is strict, but I would expect to be able to write to another branch with the WAP branch set. I'm curious what other people think the long-term behavior should be.

I think this behavior does help ensure that there are no side-effects, which is good if you want people to trust the pattern. But that's undermined by enabling/disabling WAP on a per-table basis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I added issue #7103 and we can discuss there with related people.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a reasonable starting point to me.

@rdblue rdblue merged commit e7d9ec1 into apache:master Mar 13, 2023
@rdblue
Copy link
Contributor

rdblue commented Mar 13, 2023

Thanks @jackye1995! Looks great.

return inputBranch;
}

boolean wapEnabled =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd prefer a separate method called wapEnabled() like we have in SparkWriteConf. Then we could use the constant for the default value and it would simplify this method.

public boolean wapEnabled() {
  return confParser
      .booleanConf()
      .tableProperty(TableProperties.WRITE_AUDIT_PUBLISH_ENABLED)
      .defaultValue(TableProperties.WRITE_AUDIT_PUBLISH_ENABLED_DEFAULT)
      .parse();
}

if (wapEnabled()) {
String wapId = wapId();
String wapBranch =
confParser.stringConf().sessionConf(SparkSQLProperties.WAP_BRANCH).parseOptional();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: What about a separate method like we have for wapId()?

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late +1 from me too.

krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support SQLConf for branch write similar to WAP

4 participants