Skip to content

Conversation

@zhangbutao
Copy link
Contributor

@zhangbutao zhangbutao commented May 12, 2023

  1. create a new empty iceberg table using Spark3.3
    create table testicespark(id int) using iceberg;

  2. create a brancn on the empty table:
    alter table hivedblake.testicespark create branch "branch1";

  3. NullPointerException

ERROR SparkSQLDriver: Failed in [alter table testicespark create branch testbranch]
java.lang.NullPointerException
        at org.apache.spark.sql.execution.datasources.v2.CreateOrReplaceBranchExec.$anonfun$run$1(CreateOrReplaceBranchExec.scala:44)
        at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.sql.execution.datasources.v2.CreateOrReplaceBranchExec.run(CreateOrReplaceBranchExec.scala:44)
        at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
        at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
        at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
....

I think if table is empty, we should throw a friendly exception instead of null exception when creating branch.

@github-actions github-actions bot added the spark label May 12, 2023
Comment on lines 44 to 49
if (iceberg.table.currentSnapshot() == null) {
throw new UnsupportedOperationException(s"The Iceberg table: $iceberg" + " has no snapshots")
}
Copy link
Contributor

@dramaticlly dramaticlly May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code belowiceberg.table.currentSnapshot().snapshotId always assume the currentSnapshot is not null but it's not the valid chaining case.

Probably need a way to probably handle that instead of branching out and throw a new exception. FYI @amogh-jahagirdar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very familiar with design of iceberg snapshots. I found many annoying code snippet snapshot == null , maybe the issue will be gone if every new table has a default snapshot. Just my immature thought. :)

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for identifying this issue and raising this PR @zhangbutao! Have some comments on the fix

Comment on lines 44 to 49
if (iceberg.table.currentSnapshot() == null) {
throw new UnsupportedOperationException(s"The Iceberg table: $iceberg" + " has no snapshots")
}
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely should throw a better exception here thanks for identifying this issue! Few points:

1.) I don't think the fix is correct since it is a valid case that there is no main branch and thus table.currentSnapshot() is null, and the user specifies an explicit snapshot to create from some other non-main branch. We don't want to throw unnecessarily

So the commit graph looks something like

(empty main table state)

                     S4 (Branch "b")
                         /
S1 -> S2 -> S3 (Branch "a")

Creating branch b should still succeed but with this implementation we'll fail unnecessarily.

So in the core library invalid snapshots are prevented here https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1181

but this case is a bit different just because in the spark procedure we default to choosing the latest main snapshot if a snapshot is not specified. I think one simple way forward is assign snapshotId to a Long which can be null in case table.currentSnapshot() is null. Then before the API call to create the branch, throw our own exception in case the current snapshot is null.

2.) UnsupportedOperationException is not the right exception imo. I think we should be throwing an IllegalArgumentException to indicate that there is no latest main snapshot, and that the user should specify an explicit snapshot.

3.) Could we just keep the PR focused to 3.4 and also add unit tests to cover this case? We can later backport and address if any other DDLs are impacted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogh-jahagirdar Thanks for detailed explanation! I might prefer to throw UnsupportedOperationException if user does not specify the snapshot and the main latest snapshot is null.
But if you think it is more reasonable to pass null snapshotId to this API https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1181, I'm happy to make further changes.

I have refined the PR. Please take a look again if you have a chance.

@zhangbutao zhangbutao force-pushed the fix_null_snapshot branch from dc69de1 to da9a19a Compare May 14, 2023 08:05
@zhangbutao zhangbutao changed the title Spark: Throw a friendly exception if table is empty when creating branch Spark3.4: Throw a friendly exception if table is empty when creating branch May 14, 2023
Copy link
Contributor

@dramaticlly dramaticlly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on what @amogh-jahagirdar suggested, looks like we want to let failure/validation happen in java code instead of scala, so suggested change to allow for snapshotId pass through instead of throwing exception.

Comment on lines +44 to +49
var snapshotId = branchOptions.snapshotId.getOrElse(-1L)
if (snapshotId == -1) {
val currentSnapshot = Option(iceberg.table().currentSnapshot()).getOrElse(throw new IllegalArgumentException(
s"Please specify an explicit snapshot as table: $iceberg" + " has no latest main snapshot"))
snapshotId = currentSnapshot.snapshotId()
}
Copy link
Contributor

@dramaticlly dramaticlly May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var snapshotId = branchOptions.snapshotId.getOrElse(-1L)
if (snapshotId == -1) {
val currentSnapshot = Option(iceberg.table().currentSnapshot()).getOrElse(throw new IllegalArgumentException(
s"Please specify an explicit snapshot as table: $iceberg" + " has no latest main snapshot"))
snapshotId = currentSnapshot.snapshotId()
}
val snapshotId: java.lang.Long = branchOptions.snapshotId
.orElse(Option(iceberg.table.currentSnapshot()).map(_.snapshotId()))
.map(java.lang.Long.valueOf)
.orNull

}

private Table createEmptyTable() {
return validationCatalog.loadTable(tableIdent);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not create table but rather load a table from catalog.

I think empty table creation is already handled in before method at line 43 so this method is not needed at all

@dramaticlly
Copy link
Contributor

Actually I just realized that we also need to fix for its counterpart in CreateTag, so I ended up just doing it in #7652

@zhangbutao
Copy link
Contributor Author

Actually I just realized that we also need to fix for its counterpart in CreateTag, so I ended up just doing it in #7652

@dramaticlly Thanks for the fix. I am ok to close this PR if your change is merged. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants