Skip to content

Conversation

@RussellSpitzer
Copy link
Member

Requires #1342

Removes the necessity of a Cache from ExpireSnapshotsAction. Instead uses a Static view of the TableMetadata enabled by #1342 to preserve a listing of all metadata/data files from a previous point in time.

Instead of using a cache to preserve the state from before the
expireSnapshots command, we preserve the table metadata via a
StaticTable reference. This reference doesn't change when the
Snapshosts are expired and allows us to look up all the files
referenced by the prior version of the table without holding
everything in memory.
@RussellSpitzer RussellSpitzer force-pushed the ExpireSnapshotsActionWithoutCache branch from c7a583d to 0596f8a Compare August 19, 2020 19:23

protected Dataset<Row> buildManifestFileDF(SparkSession spark) {
String allManifestsMetadataTable = metadataTableName(MetadataTableType.ALL_MANIFESTS);
protected Dataset<Row> buildManifestFileDF(SparkSession spark, String tableName) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add these 2 arg versions so that we can specify metadata Json files directly, the single arg versions just use the current table state as before.

@RussellSpitzer
Copy link
Member Author

@aokolnychyi + @rdblue this is the followup to the StaticTableOperation ticket #1342 using the improvement inside ExpireSnapshotActions, now with no caching!

protected Dataset<Row> buildManifestListDF(SparkSession spark, Table table) {
List<String> manifestLists = getManifestListPaths(table);
protected Dataset<Row> buildManifestListDF(SparkSession spark, String tableName, TableOperations ops) {
Table snapshot = new BaseTable(ops, tableName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is misleading to use snapshot here, since that term usually refers to a version of a table, not a table itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I'll just switch it to table
Table table = Table

.union(appendTypeString(buildManifestFileDF(spark), MANIFEST))
.union(appendTypeString(buildManifestListDF(spark, table), MANIFEST_LIST));
private Dataset<Row> buildValidFileDF(TableMetadata metadata) {
StaticTableOperations staticOps = new StaticTableOperations(metadata.metadataFileLocation(), table.io());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: the metadata file location is passed to buildManifestFileDF and buildValidDataFileDF, but StaticTableOperations is passed into buildManifestListDF. I think it would make a more consistent API if the location were also passed to buildManifestListDF.

I know that the difference is that the method accepts a Table and doesn't use a metadata table. But it would be a bit cleaner to pass the base Table and metadata location, then create the StaticTableOperations in that method rather than here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand what you are asking for here, but I'm not sure I like how it looks since I end up with two methods, one of which takes metadataFileLocation and one which takes "Table"

The metadataFileLocation version makes the StaticOps and BaseTable from them and passes to the table method,
while the "table" method is used for the version used by Orphan files.

Take a look at the new version and see if we are on the same page

List<String> manifestLists = getManifestListPaths(table);
protected Dataset<Row> buildManifestListDF(SparkSession spark, String tableName, TableOperations ops) {
Table snapshot = new BaseTable(ops, tableName);
List<String> manifestLists = getManifestListPaths(snapshot);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about changing getManifestListPaths to accept Iterable<Snapshot>? Then you wouldn't need to create a BaseTable out of a StaticTableOperations. Instead you could just pass staticOps.current().snapshots() here and table.snapshots() elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

@rdblue
Copy link
Contributor

rdblue commented Aug 19, 2020

This looks great, @RussellSpitzer! I had a few minor comments about structure, but overall it has no major issues.

Refactor parameters of getManifestLists, no longer requires Ops
Refactor of buildManifestListsDF so that it matches the signatures of the other
ExpireSnapshot methods.
return spark.createDataset(manifestLists, Encoders.STRING()).toDF("file_path");
}

protected Dataset<Row> buildManifestListDF(SparkSession spark, String metadataFileLocation) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot pass a pure table name here since we aren't looking up the table using Spark, this path is for metadataFileLocation based tables only.

@rdblue rdblue merged commit d46c7d6 into apache:master Aug 20, 2020
@rdblue
Copy link
Contributor

rdblue commented Aug 20, 2020

Merging. Looks good.

@RussellSpitzer
Copy link
Member Author

Thanks!

rdblue pushed a commit to rdblue/iceberg that referenced this pull request Aug 20, 2020
Instead of using a cache to preserve the state from before the
expireSnapshots command, we preserve the table metadata via a
StaticTable reference. This reference doesn't change when the
Snapshosts are expired and allows us to look up all the files
referenced by the prior version of the table without holding
everything in memory.
parthchandra pushed a commit to parthchandra/iceberg that referenced this pull request Oct 22, 2025
* Implement SupportsComet in SparkScan

* fix test failure

---------

Co-authored-by: huaxingao <[email protected]>
parthchandra pushed a commit to parthchandra/iceberg that referenced this pull request Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants