Skip to content

Conversation

@hililiwei
Copy link
Contributor

@hililiwei hililiwei commented Jan 18, 2023

Co-authored-by: Amogh Jahagirdar [email protected]
Co-authored-by: chidayong [email protected]

What is the purpose of the change

Implement the syntax in the following documents:
https://docs.google.com/document/d/1tbATFPrKF3vNlzkgZQdaW8CAJmbjvryfrlg6C2Ci_aA/edit

As suggested by @jackye1995, further split the DDL into smaller PR.

Create Branches

ALTER TABLE tableName
{CREATE | REPLACE} BRANCH branchName [AS OF {VERSION snapshotId}]
[RETAIN interval {DAYS | HOURS | MINUTES}]
[WITH SNAPSHOT RETENTION {[num_snapshots SNAPSHOTS] [interval {DAYS | HOURS | MINUTES}]}]}]

e.g.

ALTER TABLE prod.db.my_table 
CREATE BRANCH test_branch AS OF VERSION 100 
RETAIN 12 MONTHS
WITH SNAPSHOT RETENTION 5 SNAPSHOTS 7 DAYS

cc @jackye1995 @amogh-jahagirdar

@github-actions github-actions bot added the spark label Jan 18, 2023
@hililiwei hililiwei force-pushed the spark_sql_extend_for_create_branch branch 2 times, most recently from 44b8082 to d8728d0 Compare January 18, 2023 15:27
snapshotRetentionClause
: WITH SNAPSHOT RETENTION numSnapshots SNAPSHOTS
| WITH SNAPSHOT RETENTION snapshotRetain snapshotRetainTimeUnit
| WITH SNAPSHOT RETENTION numSnapshots SNAPSHOTS snapshotRetain snapshotRetainTimeUnit
Copy link
Contributor

@jackye1995 jackye1995 Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify these 3 cases to be WITH SNAPSHOT RETENTION (numSnapshots SNAPSHOTS)? (snapshotRetain snapshotRetainTimeUnit)??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This skips illegal statements : WITH SNAPSHOT RETENTION
or we can use
WITH SNAPSHOT RETENTION ((numSnapshots SNAPSHOTS)? (snapshotRetain snapshotRetainTimeUnit)? | numSnapshots SNAPSHOTS snapshotRetain snapshotRetainTimeUnit )?
But it's not intuitive

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I will leave it here to see if anyone has better suggestions. I am not an Antlr expert 😝

Comment on lines +51 to +61
case table =>
throw new UnsupportedOperationException(s"Cannot add branch to non-Iceberg table: $table")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this case will be common for all the reference based operations. We may want to see about extracting to a common parent. Not needed at this point, but we may revisit in later DDL implementations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an existing pattern for all extensions, so I think it is probably fine to leave it like this.

@hililiwei hililiwei force-pushed the spark_sql_extend_for_create_branch branch from d8728d0 to 33375b0 Compare January 19, 2023 06:38
@hililiwei hililiwei force-pushed the spark_sql_extend_for_create_branch branch from 33375b0 to 5d815f6 Compare January 19, 2023 06:42
@hililiwei hililiwei force-pushed the spark_sql_extend_for_create_branch branch from 5d815f6 to 7e72948 Compare January 19, 2023 07:53
: MONTHS
| DAYS
| HOURS
| MINUTES
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing SECONDS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we support it? I prefer at least the minute-level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I am fine with minute level. We can always add more if needed.

@jackye1995
Copy link
Contributor

Ping some people for thoughts around the syntax: @rdblue @RussellSpitzer @nastra

@hililiwei hililiwei force-pushed the spark_sql_extend_for_create_branch branch from f7dc678 to a29615b Compare January 20, 2023 02:27
@hililiwei hililiwei force-pushed the spark_sql_extend_for_create_branch branch from 0149e01 to e29c1a1 Compare January 20, 2023 07:15
Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me! Waiting for some feedback from other committers

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another pass, LGTM! thanks for contributing this

@jackye1995
Copy link
Contributor

jackye1995 commented Jan 20, 2023

I think this PR is mostly ready to go. I see there is a comment in design doc from @flyrain:

"VERSION" is used in Iceberg to indicate any table changes including table property changes, which doesn't create a new snapshot. So versions and snapshots are mismatch. Can we call it "SNAPSHOT" to remove the ambiguity?

Requesting one more review from him

@jackye1995 jackye1995 requested a review from flyrain January 20, 2023 22:08
@jackye1995
Copy link
Contributor

Looks like there are quite a few duplicates for #6637 without merging this one. In that case given there are 2 committer votes, I will first merge this one to unblock that PR, and will request @flyrain 's review on the next PR. Thanks for the work @hililiwei , and thanks for all the reviews @yyanyy and @amogh-jahagirdar !

@jackye1995 jackye1995 merged commit fc07921 into apache:master Jan 21, 2023
: number
;

snapshotRefRetain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there so many aliases for number? Are these rules useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 asked the same question.

I originally added snapshotRefRetain and snapshotRetain to make the statement parsing code more readable. Removing it is technically feasible. In the new version, I have removed (including create branch).
ref: #6637 (comment)


public class TestCreateBranch extends SparkExtensionsTestBase {

@Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this testing with multiple catalogs? This is a table-level operation that shouldn't be affected by the catalog, so it should test just one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we just test SparkCatalogConfig.SPARK catalog.

  @Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}")
  public static Object[][] parameters() {
    return new Object[][] {
      {
        SparkCatalogConfig.SPARK.catalogName(),
        SparkCatalogConfig.SPARK.implementation(),
        SparkCatalogConfig.SPARK.properties()
      }
    };
  }

AddPartitionFieldExec(catalog, ident, transform, name) :: Nil

case CreateBranch(IcebergCatalogAndIdentifier(catalog, ident), _, _, _, _, _) =>
CreateBranchExec(catalog, ident, plan.asInstanceOf[CreateBranch]) :: Nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this pass the logical plan rather than passing the necessary information? Is it just to avoid a longer line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I passed in the CreateBranch instance directly to reduce the argument count.

sql("ALTER TABLE %s CREATE BRANCH %s", tableName, branchName);
table.refresh();
SnapshotRef ref = table.refs().get(branchName);
Assert.assertNotNull(ref);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should assert the state of the branch. I think it would use the current snapshot of main, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in #6637.

@Test
public void testCreateBranch() throws NoSuchTableException {
Table table = createDefaultTableAndInsert2Row();
long snapshotId = table.currentSnapshot().snapshotId();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use a snapshot other than the default to test the clause.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. updated in #6637.

tableName, branchName, snapshotId, maxRefAge, minSnapshotsToKeep, maxSnapshotAge);
table.refresh();
SnapshotRef ref = table.refs().get(branchName);
Assert.assertNotNull(ref);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs an assertion about the snapshot referenced by the branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in #6637.

AssertHelpers.assertThrows(
"Illegal statement",
IllegalFormatConversionException.class,
"d != java.lang.String",
Copy link
Contributor

@rdblue rdblue Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks suspicious. Why is this not a mismatched input?

It looks like this isn't reaching the parser and is instead failing in String.format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in the latest version. #6637

public void testCreateBranchUseCustomMinSnapshotsToKeepAndMaxSnapshotAge()
throws NoSuchTableException {
Integer minSnapshotsToKeep = 2;
long maxSnapshotAge = 2L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using variables like this, it really helps to name them with a unit. That makes it easier to validate the uses, like %d DAYS", ... maxSnapshotAgeDays, ... and TimeUnit.DAYS.toMillis(maxSnapshotAgeDays).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the unit to a variable. Using a fixed unit suffix is not appropriate, any suggestions?

AssertHelpers.assertThrows(
"Illegal statement",
IcebergParseException.class,
"mismatched input 'SECONDS' expecting {'DAYS', 'HOURS', 'MINUTES'}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HOURS and MINUTES are never tested, but should be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using the fourth param to test different units of time.

  @Parameterized.Parameters(
      name = "catalogName = {0}, implementation = {1}, config = {2}, timeUnit = {3}")

@hililiwei
Copy link
Contributor Author

hililiwei commented Jan 23, 2023

hi @rdblue, thank you for your review. I tried to address your comments in #6637. I can raise a separate PR if you think it's necessary, of course.

@hililiwei hililiwei deleted the spark_sql_extend_for_create_branch branch February 11, 2023 02:21
krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023
Co-authored-by: Amogh Jahagirdar <[email protected]>
Co-authored-by: chidayong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants