Skip to content

Spark 3.2: Allow disabling HiddenPathFilter in RemoveOrphansFiles#4307

Closed
ulmako wants to merge 15 commits intoapache:masterfrom
kasasi:master
Closed

Spark 3.2: Allow disabling HiddenPathFilter in RemoveOrphansFiles#4307
ulmako wants to merge 15 commits intoapache:masterfrom
kasasi:master

Conversation

@ulmako
Copy link
Copy Markdown
Contributor

@ulmako ulmako commented Mar 10, 2022

Closes #4249

Copy link
Copy Markdown
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this, @ulmako! It looks really close to being ready but I had a few comments to fix.

Also, do you think that this should automatically detect whether hidden paths should be ignored? We should be able to do that by checking for partition field names starting with . or _ in the table's partition specs. If there is a partition field like _name then we know that we should not ignore hidden paths.

@ulmako
Copy link
Copy Markdown
Contributor Author

ulmako commented Mar 14, 2022

Hey @rdblue,

thanks for all your feedback.
The formatting stuff got mixed up before I set up the code-style scheme and I did not check it. I will obviously revert those changes.

Your suggestion to use the options in BaseSparkAction is definitely the better choice.

Regarding the automatic detection, by looking into the partition field names, I have one question: would you still ignore all other hidden paths, that are not related to the partitions or disable the HiddenPathFilter completely. For illustration purposes:
Given the partition name _part and the following paths:

-- /data/
      | --  _part=AA
      | --  _part=BB
      | --  _part=CC
      | --  _some-folder

Would you only include the paths _part=* and ignore the path _some-folder or include all paths?

Also, where should I document the new option?

Copy link
Copy Markdown
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ulmako! I left some review comments, but I forgot to click submit 🙈 So sorry about that!

I largely agree with @rdblue that this is likely not a good candidate for a table property.

I also looked into the source code for the hadoop file system list path filtering, and it looks like all of the filtering is done on the client side. Meaning that the data is still queried from HDFS and then filtered within the program as far as I can tell.

We should likely just make this an option or not for the action, but I thought I'd leave that finding as a comment in case it makes a difference to others.]

Thanks again for your work on this and sorry for my slow response! Please feel free to reach out on Slack (or here) if you need anything 😄

@kbendick
Copy link
Copy Markdown
Contributor

kbendick commented Mar 14, 2022

Regarding the automatic detection, by looking into the partition field names, I have one question: would you still ignore all other hidden paths, that are not related to the partitions or disable the HiddenPathFilter completely. For illustration purposes: Given the partition name _part and the following paths:

-- /data/
      | --  _part=AA
      | --  _part=BB
      | --  _part=CC
      | --  _some-folder

Would you only include the paths _part=* and ignore the path _some-folder or include all paths?

I originally had the same comment as Ryan, about checking for partitions specifically as that was the source of the original problem.

However, thinking on this question, I would think we would still want to allow for removing directories / paths that start with a leading _ if the option were enabled, even if they aren't part of partitions. In particular, some file output committers use _ before moving their data over via rename operation, which might be something that needs to be cleaned up using RemoveOrphanFiles in case the job fails at a specific time.

Please correct me if I'm mistaken about the file output committers as I tend to personally use S3 (and then either S3FileIO or S3A) and the magic output committers if using s3a.

But if we're detecting partition paths automatically and not filtering on those, then possibly we'd want to keep this second scenario as an additional option.

I have also not historically used HDFS much in production scenarios, so possibly I'm mistaken about the benefits / utility of having hidden paths or how they'd end up in the directory structure of the table (outside of the previously two mentioned situations). Please feel free to let me know if I'm mistaken. 🙂

Also, where should I document the new option?

For documenting the new option, let's worry about that once we get it confirmed into the right place. But definitely let's be sure to document it.

I would start with adding it as a JavaDoc comment, assuming that there is a JavaDoc comment for the class already that documents the options for the action (I believe there is). Once we're closer to the final solution, we can also find a place in the docs themselves to put this option if necessary.

@github-actions github-actions bot added the API label Mar 14, 2022
Copy link
Copy Markdown
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some more feedback. It seems like it's getting pretty close!

Copy link
Copy Markdown
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @ulmako! This looks close to being ready to go to me. I think we just want to add = to the set of partition names and add a test for it. There's more detail in my comment.

Makes sure hidden paths, that are at the beginning of a partition name (i.e. "_hidden" is part of "_hiddenPartition", are not accidentally accepted.
@ulmako
Copy link
Copy Markdown
Contributor Author

ulmako commented Apr 1, 2022

Thanks @rdblue, @kbendick for your feedback. I really appreciate it. There's a lot to learn from you guys.
I added your suggestions to the code. Sorry it took some days, but I was pretty occupied with other tasks. Employers tend to get mad, when you ignore the task they give to you ;)

Copy link
Copy Markdown
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me other than the use of Thread.sleep in tests. Could you fix that and remove the Draft marker, @ulmako? Thank you!

@ulmako ulmako changed the title WIP: Spark 3.2: Allow disabling HiddenPathFilter in RemoveOrphansFiles Spark 3.2: Allow disabling HiddenPathFilter in RemoveOrphansFiles Apr 1, 2022
@ulmako ulmako marked this pull request as ready for review April 1, 2022 19:53
@ulmako
Copy link
Copy Markdown
Contributor Author

ulmako commented Apr 1, 2022

Actually the sleep was not needed.

There are other test in the same class also using Thread.sleep(). Should I create another PR after this one is merged were I replace the usages with the waitUntilAfter method?

Also do I need to make a PR for documentation?

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Apr 1, 2022

Yeah, if you don't mind it would be great to get rid of those Thread.sleep calls! And it would still be a good idea to add waitUntilAfter here. Since this is using System.currentTimeMillis in the next line, I suspect that the tests may be flaky without waiting until after the current millisecond before the expiration call.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Apr 4, 2022

@ulmako, sorry to ask for another change, but we actually don't need the +1000 when using waitUntilAfter. That method should allow us to proceed as quickly as possible to the rest of the test. Could you remove those and then I'll test/merge?

Thanks!

Copy link
Copy Markdown
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me (apart from other comments). I had two optional minor nits on styling.

@ulmako
Copy link
Copy Markdown
Contributor Author

ulmako commented Apr 14, 2022

@rdblue, sorry it took me so long to remove the +1000. I somehow missed your last comment.

Copy link
Copy Markdown
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of some style issues with indentation in the tests, this looks good to me.

Thank you @ulmako for working on this!

@kbendick
Copy link
Copy Markdown
Contributor

kbendick commented Apr 22, 2022

Oh I see. The checkstyle is failing. If you click on ther failing check, Java CI / build-checks, you can see this error.

Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java:276: Line is longer than 120 characters (found 121). [LineLength]

If you run ./gradlew -DflinkVersions=1.13,1.14 -DsparkVersions=2.4,3.0,3.1,3.2 -DhiveVersions=2,3 build -x test -x javadoc -x integrationTest locally and continue to fix the errors, then everything will be good to go 😄

ulmako added 2 commits April 25, 2022 12:22
Move Javadoc from 'forSpecs' method to  PartitionAwareHiddenPathFilter Class
@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Apr 28, 2022

I merged this in #4655. Thanks for fixing this, @ulmako!

@ulmako
Copy link
Copy Markdown
Contributor Author

ulmako commented May 4, 2022

I merged this in #4655. Thanks for fixing this, @ulmako!

Closed because it was merged with other PR

@ulmako ulmako closed this May 4, 2022
@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented May 4, 2022

Sorry, I thought I'd closed this one. Thanks for fixing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow disabling HiddenPathFilter from RemoveOrphanFiles

4 participants