Spark: Add an action to remove all referenced files #2415

karuppayya · 2021-04-05T18:42:32Z

No description provided.

karuppayya · 2021-04-06T00:09:01Z

cc: @aokolnychyi @RussellSpitzer @flyrain

api/src/main/java/org/apache/iceberg/actions/DropTable.java

aokolnychyi · 2021-04-07T01:59:02Z

spark/src/main/java/org/apache/iceberg/actions/Actions.java

    return new ExpireSnapshotsAction(delegate);
  }

+  public BaseDropTableSparkAction dropTableAction() {


Let's not add it to the old Actions API which we are about to deprecate.

I think I know why this is added here. We don't have an implementation of ActionsProvider yet. I am about to add it so we will move this logic there. It is ok to keep it here for now but we will need to wait for a new entry point before merging this.

api/src/main/java/org/apache/iceberg/actions/DropTable.java

spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

aokolnychyi · 2021-04-07T02:46:59Z

We will have to decide what to do with the root table dir and other dirs where our data can be.

karuppayya · 2021-04-09T21:53:16Z

We will have to decide what to do with the root table dir and other dirs where our data can be.

We have api for file deletion only(org.apache.iceberg.io.FileIO#deleteFile), and need to introduce a new one for directory deletion. I will start a followup discussion regarding introducing apis for directory deletion.

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRemoveFilesSparkAction.java

RussellSpitzer · 2021-04-12T19:25:48Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRemoveFilesSparkAction.java

+  }
+
+  private Dataset<Row> appendTypeString(Dataset<Row> ds, String type) {
+    return ds.select(new Column("file_path"), functions.lit(type).as("file_type"));


not important here, but let's raise an issue to change this sort of thing to Constants, I think this would involve making some things public in metadata table apis so maybe out of scope here

might be able to put as part of BaseSparkAction since it's used in BaseExpireSnapshotsSparkAction too

The name is also a bit weird as it not only adds file type, it also projects file_path.
We could call it projectFilePathWithType or something.

Also, I'd use functions.col() instead of new Column for consistency.

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRemoveFilesSparkAction.java

RussellSpitzer

Just a few minor comments, I think the Spark UI Job name needs a change and we should probably only do the shuffle when required rather than preemptively in the base class

karuppayya · 2021-04-15T17:23:38Z

spark/src/test/java/org/apache/iceberg/actions/TestRemoveFilesAction.java

+    checkDropTableResults(3L, 4L, 3L, results);
+
+    Assert.assertTrue(
+        String.format("Expected total jobs to be equal to total number of shuffle partitions", SHUFFLE_PARTITIONS, totalJobsRun),


@aokolnychyi @RussellSpitzer Since we dedup the values as part of the dataset, the number of elements in iterator will be equal to the number of shuffle partitions.

jackye1995 · 2021-04-27T03:34:38Z

api/src/main/java/org/apache/iceberg/actions/RemoveFiles.java

+
+  /**
+   * Passes an alternative delete implementation that will be used for manifests and data files.
+   * <p>


nit: I don't think we need the <p> at the end of doc.

jackye1995 · 2021-04-27T03:35:41Z

api/src/main/java/org/apache/iceberg/actions/RemoveFiles.java

+   * Passes an alternative delete implementation that will be used for manifests and data files.
+   * <p>
+   *
+   * @param deleteFunc a function that will be called to delete manifests and data files


nit: could you add more description for what the string in the consumer means for the function? It might be not clear for the readers.

jackye1995 · 2021-04-27T03:37:51Z

api/src/main/java/org/apache/iceberg/actions/RemoveFiles.java

+    long deletedManifestListsCount();
+
+    /**
+     * Returns the number of files deleted other than data, manifest and manifest list.


how are we treating delete files? We do we not have another long deletedDeleteFilesCount()?

I know we don't have utils like buildValidDeleteFileDF yet, but I think it would be good to start add those in, and this action seems to be a good one to add that, maybe collaborate a bit on the timing with #2518.

I think the doc here should be a bit more descriptive. If I understand correctly, this count will represent things like version hint, json files, etc but not delete files. Removed delete files count should be a separate top-level method.

Let's refine the doc to reflect the concern @jackye1995 mentioned.

yyanyy · 2021-04-29T22:49:02Z

api/src/main/java/org/apache/iceberg/actions/RemoveFiles.java

+     * Returns the number of deleted data files.
+     */
+    long deletedDataFilesCount();


ContentFiles since I guess positional/equality delete files will also be deleted?

I'd expose the positional/equality delete counts in a separate method.

yyanyy · 2021-04-29T22:59:59Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRemoveFilesSparkAction.java

+  }
+
+  private Result doExecute() {
+    boolean streamResults = PropertyUtil.propertyAsBoolean(options(), STREAM_RESULTS, false);


sorry I'm not super familiar with other actions' code base, how would this be set? looks like it's used in ExpireSnapshotsAction but it seems like just for passing a parameter to BaseExpireSnapshotsSparkAction and is not something user can control, and here we don't really use this base class?

The user can pass this config as part of options
Actions.forTable(table).removeFilesAction().option("stream-results", "true").execute()

yyanyy · 2021-04-29T23:18:50Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRemoveFilesSparkAction.java

+  }
+
+  private Dataset<Row> appendTypeString(Dataset<Row> ds, String type) {
+    return ds.select(new Column("file_path"), functions.lit(type).as("file_type"));


might be able to put as part of BaseSparkAction since it's used in BaseExpireSnapshotsSparkAction too

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRemoveFilesSparkAction.java

yyanyy · 2021-04-29T23:24:30Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRemoveFilesSparkAction.java

+    }
+    long minTimeStamp = Long.MAX_VALUE;
+    String minMetadataLocation = null;
+    TableMetadata metadata = TableMetadataParser.read(io, metadataFileLocation);


Since this seems to be running recursively, is it possible that a previous metadata file's "previousFiles" is already cleaned up an no longer exist, and here when we read a non existing file it will throw exception?

yyanyy · 2021-04-29T23:25:36Z

spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceHiveTables.java

  public Table createTable(TableIdentifier ident, Schema schema, PartitionSpec spec) {
    TestIcebergSourceHiveTables.currentIdentifier = ident;
-    return TestIcebergSourceHiveTables.catalog.createTable(ident, schema, spec);
+    Table table = TestIcebergSourceHiveTables.catalog.createTable(ident, schema, spec);


do we need this change?

api/src/main/java/org/apache/iceberg/actions/RemoveFiles.java

core/src/main/java/org/apache/iceberg/actions/RemoveFilesActionResult.java

spark/src/main/java/org/apache/iceberg/actions/Actions.java

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRemoveFilesSparkAction.java

aokolnychyi · 2021-05-06T20:36:46Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRemoveFilesSparkAction.java

+  private Consumer<String> deleteFunc = defaultDelete;
+  private ExecutorService deleteExecutorService = DEFAULT_DELETE_EXECUTOR_SERVICE;
+
+  public BaseRemoveFilesSparkAction(SparkSession spark, Table table) {


The main purpose of this action is to remove files once a table dropped. How do we see that happening? Will we try to load the table before drop, then drop the table, then call this action to cleanup by passing the table object we loaded before the drop? That way we get the correct fileIO?

cc @RussellSpitzer @rdblue too

We could accept a metadata file location as an alternative.

actions.removeFiles(jsonFileLocation) .fileIO(io) .execute();

I think the second option sends a message the table has been dropped and we just point to a version file. In addition, this will enable removing files not only during DROP but also later. Users won't need to construct StaticTableOperations at all.

I like passing the JSON file location, but we would probably want to make it clear that it deletes the whole tree referenced by the metadata file. I don't think that's clear from just the code sample above.

@rdblue, any ideas on naming or other ways to convey that?

removeFileTree(jsonFileLocation) or something?

Hi, I was wondering too, if we call this to drop the files without the table being dropped, doesn't it leave behind a broken table in the catalog that cannot be loaded anymore?

Best is if we can drop the table itself in the same call but probably that's not possible. Maybe as another option, this still is an action on Table but it commits back a single empty metadata like create-table? Then it still keeps atomicity, and also we can expose this action via SparkSQL procedure, to run before dropping a big table, and not worry about broken table if user fails to actually drop the table.

Metadata json option could be a good indication that table needs to be dropped before, but there's no opportunity to sanity check that (as it break the table otherwise)? It's also a bit harder to expose this action eventually to Spark or Hive users via procedures, given metadata location is harder to get there.

That being said, I think this is a nice feature (catalog.dropTable with purge=true can often timeout), and drop table is rare enough that I might be over-complicating it, I was just considering that Iceberg actions in general are atomic and do not have potential to break the table.

I consider this action as something we would mostly use from Iceberg code. For example, we currently drop the table and then call CatalogUtil$dropTableData that will try to clean as much as possible but may also fail at some point. I see this as a replacement of that logic that should scale much better as we will use metadata tables.

I hope users will not really interact with this action and just use the regular DROP with or without purge. The only case when someone would need to invoke this manually is if something went wrong during DROP. In that case, it seems reasonable to ask the user to check whether the table was dropped and find a recent metadata file to use for cleaning the file tree. By calling this as RemoveReachableFiles and accepting a metadata location instead of a table we send a message it is something that is invoked after DROP.

Oh I see, yea if we can use this code as part of drop, that would be great.

I did hit issues where drop fails, and leaves behind files but the table is already gone. In this case, deleting the metadata json last makes good sense for recovery, thanks for the context, I suppose we can add this documentation of the action if not done already.

aokolnychyi

This looks good to me except the public method name. I think it should match the action name.

aokolnychyi · 2021-06-05T18:04:41Z

api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java

+  /**
+   * Instantiates an action to remove all the files referenced by given metadata location.
+   */
+  default RemoveReachableFiles removeFiles(String metadataLocation) {


I think the name of the method should match the name of the action: removeReachableFiles.

aokolnychyi · 2021-06-05T18:05:11Z

api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java

+   * Instantiates an action to remove all the files referenced by given metadata location.
+   */
+  default RemoveReachableFiles removeFiles(String metadataLocation) {
+    throw new UnsupportedOperationException(this.getClass().getName() + " does not implement removeFiles");


Comment should refer to removeReachableFiles as well.

aokolnychyi · 2021-06-06T18:03:44Z

spark2/src/test/java/org/apache/iceberg/actions/TestExpireSnapshotsAction24.java

 package org.apache.iceberg.actions;

 public class TestExpireSnapshotsAction24 extends TestExpireSnapshotsAction{
+


not needed?

RussellSpitzer · 2021-06-07T17:40:41Z

api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java

+   */
+  default RemoveReachableFiles removeReachableFiles(String metadataLocation) {
+    throw new UnsupportedOperationException(this.getClass().getName() + " does not implement " +
+        RemoveReachableFiles.class.toString());


nit: the others use the method name in the api and not the class name of the api

RussellSpitzer · 2021-06-07T17:48:32Z

api/src/main/java/org/apache/iceberg/actions/RemoveReachableFiles.java

+  /**
+   * Passes an alternative executor service that will be used for files removal.
+   * <p>
+   * If this method is not called, files will still be deleted in the current thread.


nit: will still be deleted -> will be deleted

RussellSpitzer · 2021-06-07T17:54:07Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRemoveReachableFilesSparkAction.java

+
+  @Override
+  public Result execute() {
+    Preconditions.checkArgument(io != null, "File IO cannot be null");


Probably fine here, but could put the precondition in the "io(FileIO fileIO)" method for just a little earlier erroring

RussellSpitzer

Tiny nits, looks good to me

aokolnychyi · 2021-06-08T22:49:51Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRemoveReachableFilesSparkAction.java

+
+  private Consumer<String> removeFunc = defaultDelete;
+  private ExecutorService removeExecutorService = DEFAULT_DELETE_EXECUTOR_SERVICE;
+  private FileIO io = new HadoopFileIO();


I think we cannot use this constructor for HadoopFileIO. This will result in a NPE as it is only meant for dynamic instantiation. It should use the underlying Hadoop conf from the session.

new HadoopFileIO(spark().sessionState().newHadoopConf())

Can we add a test that will use the default IO? I think it would currently fail.

aokolnychyi · 2021-06-08T23:19:06Z

api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java

+   */
+  default RemoveReachableFiles removeReachableFiles(String metadataLocation) {
+    throw new UnsupportedOperationException(this.getClass().getName() + " does not implement" +
+        " removeReachableFiles");


Can this be on one line?

aokolnychyi · 2021-06-09T03:05:42Z

Thanks a lot, @karuppayya! This is going to be a valuable contribution. Looking forward to PRs that will use this logic next.

Thanks everyone for reviewing!

Merge remote-tracking branch 'upstream/merge-master-20210816' into master ## 该MR主要解决什么？ merge upstream/master，引入最近的一些bugFix和优化 ## 该MR的修改是什么？核心关注PR： > Predicate PushDown 支持，https://github.com/apache/iceberg/pull/2358， https://github.com/apache/iceberg/pull/2926， https://github.com/apache/iceberg/pull/2777/files > Spark场景写入空dataset 报错问题，直接skip掉即可， apache#2960 > Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务， apache#288 > Spark 修复nested Struct Pruning问题， apache#2877 > 可以使用Table Properties指定创建v2 format表，apache#2887 > 补充SortRewriteStrategy框架，逐步支持不同rewrite策略， apache#2609 （WIP：apache#2829） > Spark 为catalog配置hadoop属性支持， apache#2792 > Spark 针对timestamps without timezone读写支持， apache#2757 > Spark MicroBatch支持配置属性skip delete snapshots， apache#2752 > Spark V2 RewriteDatafilesAction 支持 > Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关，补充schema-id， Core: add schema id to snapshot > Spark Extension支持identifier fields操作， apache#2560 > Parquet: Update to 1.12.0, apache#2441 > Hive: Vectorized ORC reads for Hive, apache#2613 > Spark: Add an action to remove all referenced files, apache#2415 ## 该MR是如何测试的？ UT

github-actions bot added API core spark labels Apr 5, 2021

RussellSpitzer reviewed Apr 6, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/actions/DropTable.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Apr 7, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/actions/DropTable.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Apr 7, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/actions/DropTable.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Apr 7, 2021

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java Outdated Show resolved Hide resolved

karuppayya requested review from RussellSpitzer and aokolnychyi April 9, 2021 21:53

RussellSpitzer reviewed Apr 12, 2021

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRemoveFilesSparkAction.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Apr 12, 2021

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRemoveFilesSparkAction.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Apr 12, 2021

View reviewed changes

karuppayya commented Apr 15, 2021

View reviewed changes

karuppayya requested a review from RussellSpitzer April 20, 2021 18:14

karuppayya mentioned this pull request Apr 20, 2021

Dedup files list generated in BaseSparkAction #2452

Closed

jackye1995 reviewed Apr 27, 2021

View reviewed changes

yyanyy reviewed Apr 29, 2021

View reviewed changes

aokolnychyi reviewed May 6, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/actions/RemoveFiles.java Outdated Show resolved Hide resolved

aokolnychyi reviewed May 6, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/actions/RemoveFiles.java Outdated Show resolved Hide resolved

aokolnychyi reviewed May 6, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/actions/RemoveFilesActionResult.java Outdated Show resolved Hide resolved

aokolnychyi reviewed May 6, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/actions/RemoveFilesActionResult.java Outdated Show resolved Hide resolved

aokolnychyi reviewed May 6, 2021

View reviewed changes

spark/src/main/java/org/apache/iceberg/actions/Actions.java Outdated Show resolved Hide resolved

aokolnychyi reviewed May 6, 2021

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRemoveFilesSparkAction.java Outdated Show resolved Hide resolved

aokolnychyi reviewed May 6, 2021

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRemoveFilesSparkAction.java Outdated Show resolved Hide resolved

aokolnychyi reviewed May 6, 2021

View reviewed changes

Karuppayya Rajendran added 11 commits June 4, 2021 16:01

Add an action to drop table

203c4ef

Trigger Build

23bc0d9

Address review comments

77ba4dd

Address review comments

01de29a

Fix test

a1a0f23

Fix checkstyle

b6bf48f

Address review comments

7fb0eea

Address review comments

67c698d

Consume ReachableFileUtils

74b73a7

Address review comments

1e7abb0

Address review comments

aa2ea82

karuppayya force-pushed the droptableaction branch from 8206e30 to aa2ea82 Compare June 4, 2021 23:02

Fix compilation

fc2f243

aokolnychyi approved these changes Jun 6, 2021

View reviewed changes

aokolnychyi reviewed Jun 6, 2021

View reviewed changes

Address review comments

c48a662

RussellSpitzer reviewed Jun 7, 2021

View reviewed changes

RussellSpitzer approved these changes Jun 7, 2021

View reviewed changes

Address review comment

e860fac

aokolnychyi reviewed Jun 8, 2021

View reviewed changes

Karuppayya Rajendran added 2 commits June 8, 2021 16:10

Address review comments

1ffece9

Add test

d810158

aokolnychyi reviewed Jun 8, 2021

View reviewed changes

Address review comments

98f8201

aokolnychyi merged commit efaad97 into apache:master Jun 9, 2021

rdblue mentioned this pull request Aug 17, 2021

Add 0.12.0 release notes pt 2 #2986

Merged

		package org.apache.iceberg.actions;

		public class TestExpireSnapshotsAction24 extends TestExpireSnapshotsAction{

Spark: Add an action to remove all referenced files #2415

Spark: Add an action to remove all referenced files #2415

Uh oh!

Conversation

karuppayya commented Apr 5, 2021

Uh oh!

karuppayya commented Apr 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Apr 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aokolnychyi commented Apr 7, 2021

Uh oh!

karuppayya commented Apr 9, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aokolnychyi May 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi May 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karuppayya commented Apr 6, 2021 •

edited

Loading

aokolnychyi Apr 7, 2021 •

edited

Loading

aokolnychyi May 6, 2021 •

edited

Loading

aokolnychyi May 25, 2021 •

edited

Loading

szehon-ho May 26, 2021 •

edited

Loading