Spark 3.2: Allow disabling HiddenPathFilter in RemoveOrphansFiles by ulmako · Pull Request #4307 · apache/iceberg

ulmako · 2022-03-10T19:11:57Z

Closes #4249

…OrphansFiles

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java

core/src/main/java/org/apache/iceberg/TableProperties.java

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java

rdblue

Thanks for submitting this, @ulmako! It looks really close to being ready but I had a few comments to fix.

Also, do you think that this should automatically detect whether hidden paths should be ignored? We should be able to do that by checking for partition field names starting with . or _ in the table's partition specs. If there is a partition field like _name then we know that we should not ignore hidden paths.

ulmako · 2022-03-14T15:46:55Z

Hey @rdblue,

thanks for all your feedback.
The formatting stuff got mixed up before I set up the code-style scheme and I did not check it. I will obviously revert those changes.

Your suggestion to use the options in BaseSparkAction is definitely the better choice.

Regarding the automatic detection, by looking into the partition field names, I have one question: would you still ignore all other hidden paths, that are not related to the partitions or disable the HiddenPathFilter completely. For illustration purposes:
Given the partition name _part and the following paths:

-- /data/
      | --  _part=AA
      | --  _part=BB
      | --  _part=CC
      | --  _some-folder

Would you only include the paths _part=* and ignore the path _some-folder or include all paths?

Also, where should I document the new option?

kbendick

Hi @ulmako! I left some review comments, but I forgot to click submit 🙈 So sorry about that!

I largely agree with @rdblue that this is likely not a good candidate for a table property.

I also looked into the source code for the hadoop file system list path filtering, and it looks like all of the filtering is done on the client side. Meaning that the data is still queried from HDFS and then filtered within the program as far as I can tell.

We should likely just make this an option or not for the action, but I thought I'd leave that finding as a comment in case it makes a difference to others.]

Thanks again for your work on this and sorry for my slow response! Please feel free to reach out on Slack (or here) if you need anything 😄

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java

core/src/main/java/org/apache/iceberg/TableProperties.java

kbendick · 2022-03-14T16:29:26Z

Regarding the automatic detection, by looking into the partition field names, I have one question: would you still ignore all other hidden paths, that are not related to the partitions or disable the HiddenPathFilter completely. For illustration purposes: Given the partition name _part and the following paths:
-- /data/
      | --  _part=AA
      | --  _part=BB
      | --  _part=CC
      | --  _some-folder
Would you only include the paths _part=* and ignore the path _some-folder or include all paths?

I originally had the same comment as Ryan, about checking for partitions specifically as that was the source of the original problem.

However, thinking on this question, I would think we would still want to allow for removing directories / paths that start with a leading _ if the option were enabled, even if they aren't part of partitions. In particular, some file output committers use _ before moving their data over via rename operation, which might be something that needs to be cleaned up using RemoveOrphanFiles in case the job fails at a specific time.

Please correct me if I'm mistaken about the file output committers as I tend to personally use S3 (and then either S3FileIO or S3A) and the magic output committers if using s3a.

But if we're detecting partition paths automatically and not filtering on those, then possibly we'd want to keep this second scenario as an additional option.

I have also not historically used HDFS much in production scenarios, so possibly I'm mistaken about the benefits / utility of having hidden paths or how they'd end up in the directory structure of the table (outside of the previously two mentioned situations). Please feel free to let me know if I'm mistaken. 🙂

Also, where should I document the new option?

For documenting the new option, let's worry about that once we get it confirmed into the right place. But definitely let's be sure to document it.

I would start with adding it as a JavaDoc comment, assuming that there is a JavaDoc comment for the class already that documents the options for the action (I believe there is). Once we're closer to the final solution, we can also find a place in the docs themselves to put this option if necessary.

…teOrphanFilesSparkAction

…denPathFilter in DeleteOrphanFiles

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java

kbendick

Left some more feedback. It seems like it's getting pretty close!

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java

...k/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java

…hange

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java

rdblue

Thanks, @ulmako! This looks close to being ready to go to me. I think we just want to add = to the set of partition names and add a test for it. There's more detail in my comment.

Makes sure hidden paths, that are at the beginning of a partition name (i.e. "_hidden" is part of "_hiddenPartition", are not accidentally accepted.

ulmako · 2022-04-01T15:26:40Z

Thanks @rdblue, @kbendick for your feedback. I really appreciate it. There's a lot to learn from you guys.
I added your suggestions to the code. Sorry it took some days, but I was pretty occupied with other tasks. Employers tend to get mad, when you ignore the task they give to you ;)

...k/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java

rdblue

This looks good to me other than the use of Thread.sleep in tests. Could you fix that and remove the Draft marker, @ulmako? Thank you!

ulmako · 2022-04-01T20:05:06Z

Actually the sleep was not needed.

There are other test in the same class also using Thread.sleep(). Should I create another PR after this one is merged were I replace the usages with the waitUntilAfter method?

Also do I need to make a PR for documentation?

rdblue · 2022-04-01T21:39:25Z

Yeah, if you don't mind it would be great to get rid of those Thread.sleep calls! And it would still be a good idea to add waitUntilAfter here. Since this is using System.currentTimeMillis in the next line, I suspect that the tests may be flaky without waiting until after the current millisecond before the expiration call.

...k/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java

rdblue · 2022-04-04T15:34:50Z

@ulmako, sorry to ask for another change, but we actually don't need the +1000 when using waitUntilAfter. That method should allow us to proceed as quickly as possible to the rest of the test. Could you remove those and then I'll test/merge?

Thanks!

aokolnychyi

Looks reasonable to me (apart from other comments). I had two optional minor nits on styling.

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java

…n.java

ulmako · 2022-04-14T08:38:19Z

@rdblue, sorry it took me so long to remove the +1000. I somehow missed your last comment.

kbendick

Outside of some style issues with indentation in the tests, this looks good to me.

Thank you @ulmako for working on this!

...k/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java

kbendick · 2022-04-22T18:23:34Z

Oh I see. The checkstyle is failing. If you click on ther failing check, Java CI / build-checks, you can see this error.

Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java:276: Line is longer than 120 characters (found 121). [LineLength]

If you run ./gradlew -DflinkVersions=1.13,1.14 -DsparkVersions=2.4,3.0,3.1,3.2 -DhiveVersions=2,3 build -x test -x javadoc -x integrationTest locally and continue to fix the errors, then everything will be good to go 😄

Move Javadoc from 'forSpecs' method to PartitionAwareHiddenPathFilter Class

rdblue · 2022-04-28T19:42:43Z

I merged this in #4655. Thanks for fixing this, @ulmako!

ulmako · 2022-05-04T08:34:47Z

I merged this in #4655. Thanks for fixing this, @ulmako!

Closed because it was merged with other PR

rdblue · 2022-05-04T15:06:54Z

Sorry, I thought I'd closed this one. Thanks for fixing this!

Spark 3.2: Add property to allow disabling HiddenPathFilter in Remove…

2c51d1c

…OrphansFiles

github-actions bot added core spark labels Mar 10, 2022

rdblue reviewed Mar 13, 2022

View reviewed changes

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 13, 2022

View reviewed changes

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 13, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/TableProperties.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 13, 2022

View reviewed changes

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 13, 2022

View reviewed changes

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java Outdated Show resolved Hide resolved

rdblue requested changes Mar 13, 2022

View reviewed changes

Merge branch 'apache:master' into master

781c6f3

kbendick reviewed Mar 14, 2022

View reviewed changes

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java Show resolved Hide resolved

core/src/main/java/org/apache/iceberg/TableProperties.java Outdated Show resolved Hide resolved

ulmako added 2 commits March 14, 2022 17:31

Spark 3.2: Use option to ignore hidden paths in DeleteOrphanFiles

3dfbc48

Spark 3.2: Include IGNORE_HIDDEN_PATHS option to JavaDocs of BaseDele…

7725dab

…teOrphanFilesSparkAction

github-actions bot added the API label Mar 14, 2022

Spark 3.2: Exclude paths that are part of the partition spec from Hid…

2b69b81

…denPathFilter in DeleteOrphanFiles

kbendick reviewed Mar 16, 2022

View reviewed changes

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java Outdated Show resolved Hide resolved

kbendick reviewed Mar 16, 2022

View reviewed changes

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java Outdated Show resolved Hide resolved

kbendick reviewed Mar 16, 2022

View reviewed changes

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java Show resolved Hide resolved

kbendick reviewed Mar 16, 2022

View reviewed changes

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java Outdated Show resolved Hide resolved

kbendick reviewed Mar 16, 2022

View reviewed changes

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java Outdated Show resolved Hide resolved

ulmako added 2 commits March 17, 2022 10:40

Spark 3.2: Refactor new Code

6493ad8

Merge branch 'apache:master' into master

69ff95f

kbendick reviewed Mar 22, 2022

View reviewed changes

Spark 3.2: Change Set creation to stream and add test for partition c…

5167092

…hange

rdblue reviewed Mar 27, 2022

View reviewed changes

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 27, 2022

View reviewed changes

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java Show resolved Hide resolved

rdblue requested changes Mar 27, 2022

View reviewed changes

Spark 3.2: Add '=' to partition names in PartitionAwareHiddenPathFilter

493966d

Makes sure hidden paths, that are at the beginning of a partition name (i.e. "_hidden" is part of "_hiddenPartition", are not accidentally accepted.

rdblue reviewed Apr 1, 2022

View reviewed changes

...k/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java Outdated Show resolved Hide resolved

rdblue requested changes Apr 1, 2022

View reviewed changes

Spark 3.2: Implement Serializable in HiddenPathFilter to fix tests

8780b40

ulmako changed the title ~~WIP: Spark 3.2: Allow disabling HiddenPathFilter in RemoveOrphansFiles~~ Spark 3.2: Allow disabling HiddenPathFilter in RemoveOrphansFiles Apr 1, 2022

ulmako marked this pull request as ready for review April 1, 2022 19:53

Spark 3.2: add waitUntilAfter method to remove orphan files test

231c493

rdblue reviewed Apr 4, 2022

View reviewed changes

...k/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java Outdated Show resolved Hide resolved

aokolnychyi approved these changes Apr 4, 2022

View reviewed changes

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java Outdated Show resolved Hide resolved

...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java Outdated Show resolved Hide resolved

ulmako added 2 commits April 5, 2022 09:25

Spark 3.2: Make code more readable in BaseDeleteOrphanFilesSparkActio…

9a3e45b

…n.java

Spark 3.2: remove waiting time in TestRemoveOrphanFilesAction

067c716

rdblue approved these changes Apr 14, 2022

View reviewed changes

kbendick approved these changes Apr 22, 2022

View reviewed changes

...k/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java Outdated Show resolved Hide resolved

ulmako added 2 commits April 25, 2022 12:22

Spark 3.2: change formatting

e266f83

Spark 3.2: Move Javadoc to class

16e5f35

Move Javadoc from 'forSpecs' method to PartitionAwareHiddenPathFilter Class

rdblue mentioned this pull request Apr 27, 2022

Spark 3.2: Fix orphan file path filter when partitions use underscores #4655

Merged

ulmako closed this May 4, 2022

ulmako mentioned this pull request May 6, 2022

Spark 3.2: Refactor TestRemoveOrphanFilesAction - remove Thread.sleep() #4711

Merged

Conversation

ulmako commented Mar 10, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

ulmako commented Mar 14, 2022

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kbendick commented Mar 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

ulmako commented Apr 1, 2022

Uh oh!

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

ulmako commented Apr 1, 2022

Uh oh!

rdblue commented Apr 1, 2022

Uh oh!

Uh oh!

rdblue commented Apr 4, 2022

Uh oh!

aokolnychyi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ulmako commented Apr 14, 2022

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kbendick commented Apr 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Apr 28, 2022

Uh oh!

ulmako commented May 4, 2022

Uh oh!

rdblue commented May 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kbendick commented Mar 14, 2022 •

edited

Loading

aokolnychyi left a comment •

edited

Loading

kbendick commented Apr 22, 2022 •

edited

Loading