Expire snapshots action without cache #1344

RussellSpitzer · 2020-08-14T18:21:33Z

Requires #1342

Removes the necessity of a Cache from ExpireSnapshotsAction. Instead uses a Static view of the TableMetadata enabled by #1342 to preserve a listing of all metadata/data files from a previous point in time.

Instead of using a cache to preserve the state from before the expireSnapshots command, we preserve the table metadata via a StaticTable reference. This reference doesn't change when the Snapshosts are expired and allows us to look up all the files referenced by the prior version of the table without holding everything in memory.

RussellSpitzer · 2020-08-19T19:25:30Z

spark/src/main/java/org/apache/iceberg/actions/BaseAction.java


-  protected Dataset<Row> buildManifestFileDF(SparkSession spark) {
-    String allManifestsMetadataTable = metadataTableName(MetadataTableType.ALL_MANIFESTS);
+  protected Dataset<Row> buildManifestFileDF(SparkSession spark, String tableName) {


We add these 2 arg versions so that we can specify metadata Json files directly, the single arg versions just use the current table state as before.

RussellSpitzer · 2020-08-19T19:27:06Z

@aokolnychyi + @rdblue this is the followup to the StaticTableOperation ticket #1342 using the improvement inside ExpireSnapshotActions, now with no caching!

rdblue · 2020-08-19T21:43:13Z

spark/src/main/java/org/apache/iceberg/actions/BaseAction.java

-  protected Dataset<Row> buildManifestListDF(SparkSession spark, Table table) {
-    List<String> manifestLists = getManifestListPaths(table);
+  protected Dataset<Row> buildManifestListDF(SparkSession spark, String tableName, TableOperations ops) {
+    Table snapshot = new BaseTable(ops, tableName);


I think it is misleading to use snapshot here, since that term usually refers to a version of a table, not a table itself.

True, I'll just switch it to table
Table table = Table

rdblue · 2020-08-19T21:47:13Z

spark/src/main/java/org/apache/iceberg/actions/ExpireSnapshotsAction.java

-        .union(appendTypeString(buildManifestFileDF(spark), MANIFEST))
-        .union(appendTypeString(buildManifestListDF(spark, table), MANIFEST_LIST));
+  private Dataset<Row> buildValidFileDF(TableMetadata metadata) {
+    StaticTableOperations staticOps = new StaticTableOperations(metadata.metadataFileLocation(), table.io());


Minor: the metadata file location is passed to buildManifestFileDF and buildValidDataFileDF, but StaticTableOperations is passed into buildManifestListDF. I think it would make a more consistent API if the location were also passed to buildManifestListDF.

I know that the difference is that the method accepts a Table and doesn't use a metadata table. But it would be a bit cleaner to pass the base Table and metadata location, then create the StaticTableOperations in that method rather than here.

I think I understand what you are asking for here, but I'm not sure I like how it looks since I end up with two methods, one of which takes metadataFileLocation and one which takes "Table"

The metadataFileLocation version makes the StaticOps and BaseTable from them and passes to the table method,
while the "table" method is used for the version used by Orphan files.

Take a look at the new version and see if we are on the same page

rdblue · 2020-08-19T21:49:03Z

spark/src/main/java/org/apache/iceberg/actions/BaseAction.java

-    List<String> manifestLists = getManifestListPaths(table);
+  protected Dataset<Row> buildManifestListDF(SparkSession spark, String tableName, TableOperations ops) {
+    Table snapshot = new BaseTable(ops, tableName);
+    List<String> manifestLists = getManifestListPaths(snapshot);


What about changing getManifestListPaths to accept Iterable<Snapshot>? Then you wouldn't need to create a BaseTable out of a StaticTableOperations. Instead you could just pass staticOps.current().snapshots() here and table.snapshots() elsewhere.

Sounds good to me

rdblue · 2020-08-19T21:50:15Z

This looks great, @RussellSpitzer! I had a few minor comments about structure, but overall it has no major issues.

Refactor parameters of getManifestLists, no longer requires Ops

Refactor of buildManifestListsDF so that it matches the signatures of the other ExpireSnapshot methods.

RussellSpitzer · 2020-08-20T03:33:28Z

spark/src/main/java/org/apache/iceberg/actions/BaseAction.java

    return spark.createDataset(manifestLists, Encoders.STRING()).toDF("file_path");
  }

+  protected Dataset<Row> buildManifestListDF(SparkSession spark, String metadataFileLocation) {


You cannot pass a pure table name here since we aren't looking up the table using Spark, this path is for metadataFileLocation based tables only.

rdblue · 2020-08-20T16:26:02Z

Merging. Looks good.

RussellSpitzer · 2020-08-20T16:57:32Z

Thanks!

Instead of using a cache to preserve the state from before the expireSnapshots command, we preserve the table metadata via a StaticTable reference. This reference doesn't change when the Snapshosts are expired and allows us to look up all the files referenced by the prior version of the table without holding everything in memory.

* Implement SupportsComet in SparkScan * fix test failure --------- Co-authored-by: huaxingao <[email protected]>

) This reverts commit 895e2ca. Co-authored-by: huaxingao <[email protected]>

RussellSpitzer mentioned this pull request Aug 14, 2020

Static Table Operations #1342

Merged

RussellSpitzer force-pushed the ExpireSnapshotsActionWithoutCache branch from c7a583d to 0596f8a Compare August 19, 2020 19:23

probot-autolabeler bot added the spark label Aug 19, 2020

RussellSpitzer commented Aug 19, 2020

View reviewed changes

rdblue reviewed Aug 19, 2020

View reviewed changes

RussellSpitzer added 2 commits August 19, 2020 22:02

Reviewer Comments

4919639

Refactor parameters of getManifestLists, no longer requires Ops

Further Review Comments

2f76828

Refactor of buildManifestListsDF so that it matches the signatures of the other ExpireSnapshot methods.

RussellSpitzer commented Aug 20, 2020

View reviewed changes

rdblue merged commit d46c7d6 into apache:master Aug 20, 2020

parthchandra pushed a commit to parthchandra/iceberg that referenced this pull request Oct 22, 2025

Implement SupportsComet in SparkScan (apache#1344)

895e2ca

* Implement SupportsComet in SparkScan * fix test failure --------- Co-authored-by: huaxingao <[email protected]>

parthchandra pushed a commit to parthchandra/iceberg that referenced this pull request Oct 22, 2025

Revert "Implement SupportsComet in SparkScan (apache#1344)" (apache#1349

61e8874

) This reverts commit 895e2ca. Co-authored-by: huaxingao <[email protected]>

Expire snapshots action without cache #1344

Expire snapshots action without cache #1344

Uh oh!

Conversation

RussellSpitzer commented Aug 14, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Aug 19, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Aug 19, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Aug 20, 2020

Uh oh!

RussellSpitzer commented Aug 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants