-
Notifications
You must be signed in to change notification settings - Fork 3k
Expire snapshots action without cache #1344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expire snapshots action without cache #1344
Conversation
Instead of using a cache to preserve the state from before the expireSnapshots command, we preserve the table metadata via a StaticTable reference. This reference doesn't change when the Snapshosts are expired and allows us to look up all the files referenced by the prior version of the table without holding everything in memory.
c7a583d to
0596f8a
Compare
|
|
||
| protected Dataset<Row> buildManifestFileDF(SparkSession spark) { | ||
| String allManifestsMetadataTable = metadataTableName(MetadataTableType.ALL_MANIFESTS); | ||
| protected Dataset<Row> buildManifestFileDF(SparkSession spark, String tableName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We add these 2 arg versions so that we can specify metadata Json files directly, the single arg versions just use the current table state as before.
|
@aokolnychyi + @rdblue this is the followup to the StaticTableOperation ticket #1342 using the improvement inside ExpireSnapshotActions, now with no caching! |
| protected Dataset<Row> buildManifestListDF(SparkSession spark, Table table) { | ||
| List<String> manifestLists = getManifestListPaths(table); | ||
| protected Dataset<Row> buildManifestListDF(SparkSession spark, String tableName, TableOperations ops) { | ||
| Table snapshot = new BaseTable(ops, tableName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is misleading to use snapshot here, since that term usually refers to a version of a table, not a table itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I'll just switch it to table
Table table = Table
| .union(appendTypeString(buildManifestFileDF(spark), MANIFEST)) | ||
| .union(appendTypeString(buildManifestListDF(spark, table), MANIFEST_LIST)); | ||
| private Dataset<Row> buildValidFileDF(TableMetadata metadata) { | ||
| StaticTableOperations staticOps = new StaticTableOperations(metadata.metadataFileLocation(), table.io()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: the metadata file location is passed to buildManifestFileDF and buildValidDataFileDF, but StaticTableOperations is passed into buildManifestListDF. I think it would make a more consistent API if the location were also passed to buildManifestListDF.
I know that the difference is that the method accepts a Table and doesn't use a metadata table. But it would be a bit cleaner to pass the base Table and metadata location, then create the StaticTableOperations in that method rather than here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand what you are asking for here, but I'm not sure I like how it looks since I end up with two methods, one of which takes metadataFileLocation and one which takes "Table"
The metadataFileLocation version makes the StaticOps and BaseTable from them and passes to the table method,
while the "table" method is used for the version used by Orphan files.
Take a look at the new version and see if we are on the same page
| List<String> manifestLists = getManifestListPaths(table); | ||
| protected Dataset<Row> buildManifestListDF(SparkSession spark, String tableName, TableOperations ops) { | ||
| Table snapshot = new BaseTable(ops, tableName); | ||
| List<String> manifestLists = getManifestListPaths(snapshot); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about changing getManifestListPaths to accept Iterable<Snapshot>? Then you wouldn't need to create a BaseTable out of a StaticTableOperations. Instead you could just pass staticOps.current().snapshots() here and table.snapshots() elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me
|
This looks great, @RussellSpitzer! I had a few minor comments about structure, but overall it has no major issues. |
Refactor parameters of getManifestLists, no longer requires Ops
Refactor of buildManifestListsDF so that it matches the signatures of the other ExpireSnapshot methods.
| return spark.createDataset(manifestLists, Encoders.STRING()).toDF("file_path"); | ||
| } | ||
|
|
||
| protected Dataset<Row> buildManifestListDF(SparkSession spark, String metadataFileLocation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot pass a pure table name here since we aren't looking up the table using Spark, this path is for metadataFileLocation based tables only.
|
Merging. Looks good. |
|
Thanks! |
Instead of using a cache to preserve the state from before the expireSnapshots command, we preserve the table metadata via a StaticTable reference. This reference doesn't change when the Snapshosts are expired and allows us to look up all the files referenced by the prior version of the table without holding everything in memory.
* Implement SupportsComet in SparkScan * fix test failure --------- Co-authored-by: huaxingao <[email protected]>
) This reverts commit 895e2ca. Co-authored-by: huaxingao <[email protected]>
Requires #1342
Removes the necessity of a Cache from ExpireSnapshotsAction. Instead uses a Static view of the TableMetadata enabled by #1342 to preserve a listing of all metadata/data files from a previous point in time.