-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Spark SQL Extensions for create branch #6617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: Spark SQL Extensions for create branch #6617
Conversation
44b8082 to
d8728d0
Compare
...sions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
Show resolved
Hide resolved
...tensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateBranchExec.scala
Outdated
Show resolved
Hide resolved
...3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestSnapshotRefSQL.java
Outdated
Show resolved
Hide resolved
...sions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
Outdated
Show resolved
Hide resolved
| snapshotRetentionClause | ||
| : WITH SNAPSHOT RETENTION numSnapshots SNAPSHOTS | ||
| | WITH SNAPSHOT RETENTION snapshotRetain snapshotRetainTimeUnit | ||
| | WITH SNAPSHOT RETENTION numSnapshots SNAPSHOTS snapshotRetain snapshotRetainTimeUnit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify these 3 cases to be WITH SNAPSHOT RETENTION (numSnapshots SNAPSHOTS)? (snapshotRetain snapshotRetainTimeUnit)??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This skips illegal statements : WITH SNAPSHOT RETENTION
or we can use
WITH SNAPSHOT RETENTION ((numSnapshots SNAPSHOTS)? (snapshotRetain snapshotRetainTimeUnit)? | numSnapshots SNAPSHOTS snapshotRetain snapshotRetainTimeUnit )?
But it's not intuitive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I will leave it here to see if anyone has better suggestions. I am not an Antlr expert 😝
...3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestSnapshotRefSQL.java
Outdated
Show resolved
Hide resolved
...sions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
Outdated
Show resolved
Hide resolved
...ark-extensions/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/CreateBranch.scala
Outdated
Show resolved
Hide resolved
| case table => | ||
| throw new UnsupportedOperationException(s"Cannot add branch to non-Iceberg table: $table") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this case will be common for all the reference based operations. We may want to see about extracting to a common parent. Not needed at this point, but we may revisit in later DDL implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an existing pattern for all extensions, so I think it is probably fine to leave it like this.
d8728d0 to
33375b0
Compare
Co-authored-by: Amogh Jahagirdar <[email protected]> Co-authored-by: chidayong <[email protected]>
33375b0 to
5d815f6
Compare
5d815f6 to
7e72948
Compare
| : MONTHS | ||
| | DAYS | ||
| | HOURS | ||
| | MINUTES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing SECONDS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we support it? I prefer at least the minute-level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I am fine with minute level. We can always add more if needed.
...n/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSqlExtensionsAstBuilder.scala
Outdated
Show resolved
Hide resolved
|
Ping some people for thoughts around the syntax: @rdblue @RussellSpitzer @nastra |
f7dc678 to
a29615b
Compare
...n/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSqlExtensionsAstBuilder.scala
Outdated
Show resolved
Hide resolved
...n/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSqlExtensionsAstBuilder.scala
Outdated
Show resolved
Hide resolved
0149e01 to
e29c1a1
Compare
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me! Waiting for some feedback from other committers
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took another pass, LGTM! thanks for contributing this
|
I think this PR is mostly ready to go. I see there is a comment in design doc from @flyrain:
Requesting one more review from him |
|
Looks like there are quite a few duplicates for #6637 without merging this one. In that case given there are 2 committer votes, I will first merge this one to unblock that PR, and will request @flyrain 's review on the next PR. Thanks for the work @hililiwei , and thanks for all the reviews @yyanyy and @amogh-jahagirdar ! |
| : number | ||
| ; | ||
|
|
||
| snapshotRefRetain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are there so many aliases for number? Are these rules useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackye1995 asked the same question.
I originally added snapshotRefRetain and snapshotRetain to make the statement parsing code more readable. Removing it is technically feasible. In the new version, I have removed (including create branch).
ref: #6637 (comment)
|
|
||
| public class TestCreateBranch extends SparkExtensionsTestBase { | ||
|
|
||
| @Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this testing with multiple catalogs? This is a table-level operation that shouldn't be affected by the catalog, so it should test just one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we just test SparkCatalogConfig.SPARK catalog.
@Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}")
public static Object[][] parameters() {
return new Object[][] {
{
SparkCatalogConfig.SPARK.catalogName(),
SparkCatalogConfig.SPARK.implementation(),
SparkCatalogConfig.SPARK.properties()
}
};
}
| AddPartitionFieldExec(catalog, ident, transform, name) :: Nil | ||
|
|
||
| case CreateBranch(IcebergCatalogAndIdentifier(catalog, ident), _, _, _, _, _) => | ||
| CreateBranchExec(catalog, ident, plan.asInstanceOf[CreateBranch]) :: Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this pass the logical plan rather than passing the necessary information? Is it just to avoid a longer line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I passed in the CreateBranch instance directly to reduce the argument count.
| sql("ALTER TABLE %s CREATE BRANCH %s", tableName, branchName); | ||
| table.refresh(); | ||
| SnapshotRef ref = table.refs().get(branchName); | ||
| Assert.assertNotNull(ref); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should assert the state of the branch. I think it would use the current snapshot of main, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in #6637.
| @Test | ||
| public void testCreateBranch() throws NoSuchTableException { | ||
| Table table = createDefaultTableAndInsert2Row(); | ||
| long snapshotId = table.currentSnapshot().snapshotId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use a snapshot other than the default to test the clause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. updated in #6637.
| tableName, branchName, snapshotId, maxRefAge, minSnapshotsToKeep, maxSnapshotAge); | ||
| table.refresh(); | ||
| SnapshotRef ref = table.refs().get(branchName); | ||
| Assert.assertNotNull(ref); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs an assertion about the snapshot referenced by the branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in #6637.
| AssertHelpers.assertThrows( | ||
| "Illegal statement", | ||
| IllegalFormatConversionException.class, | ||
| "d != java.lang.String", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks suspicious. Why is this not a mismatched input?
It looks like this isn't reaching the parser and is instead failing in String.format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in the latest version. #6637
| public void testCreateBranchUseCustomMinSnapshotsToKeepAndMaxSnapshotAge() | ||
| throws NoSuchTableException { | ||
| Integer minSnapshotsToKeep = 2; | ||
| long maxSnapshotAge = 2L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using variables like this, it really helps to name them with a unit. That makes it easier to validate the uses, like %d DAYS", ... maxSnapshotAgeDays, ... and TimeUnit.DAYS.toMillis(maxSnapshotAgeDays).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the unit to a variable. Using a fixed unit suffix is not appropriate, any suggestions?
| AssertHelpers.assertThrows( | ||
| "Illegal statement", | ||
| IcebergParseException.class, | ||
| "mismatched input 'SECONDS' expecting {'DAYS', 'HOURS', 'MINUTES'}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HOURS and MINUTES are never tested, but should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using the fourth param to test different units of time.
@Parameterized.Parameters(
name = "catalogName = {0}, implementation = {1}, config = {2}, timeUnit = {3}")
Co-authored-by: Amogh Jahagirdar <[email protected]> Co-authored-by: chidayong <[email protected]>
Co-authored-by: Amogh Jahagirdar [email protected]
Co-authored-by: chidayong [email protected]
What is the purpose of the change
Implement the syntax in the following documents:
https://docs.google.com/document/d/1tbATFPrKF3vNlzkgZQdaW8CAJmbjvryfrlg6C2Ci_aA/edit
As suggested by @jackye1995, further split the DDL into smaller PR.
Create Branches
e.g.
cc @jackye1995 @amogh-jahagirdar