Spark 3.2: Allow disabling HiddenPathFilter in RemoveOrphansFiles#4307
Spark 3.2: Allow disabling HiddenPathFilter in RemoveOrphansFiles#4307ulmako wants to merge 15 commits intoapache:masterfrom
Conversation
...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java
Outdated
Show resolved
Hide resolved
...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java
Outdated
Show resolved
Hide resolved
...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java
Outdated
Show resolved
Hide resolved
...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java
Outdated
Show resolved
Hide resolved
rdblue
left a comment
There was a problem hiding this comment.
Thanks for submitting this, @ulmako! It looks really close to being ready but I had a few comments to fix.
Also, do you think that this should automatically detect whether hidden paths should be ignored? We should be able to do that by checking for partition field names starting with . or _ in the table's partition specs. If there is a partition field like _name then we know that we should not ignore hidden paths.
|
Hey @rdblue, thanks for all your feedback. Your suggestion to use the Regarding the automatic detection, by looking into the partition field names, I have one question: would you still ignore all other hidden paths, that are not related to the partitions or disable the HiddenPathFilter completely. For illustration purposes: Would you only include the paths Also, where should I document the new option? |
kbendick
left a comment
There was a problem hiding this comment.
Hi @ulmako! I left some review comments, but I forgot to click submit 🙈 So sorry about that!
I largely agree with @rdblue that this is likely not a good candidate for a table property.
I also looked into the source code for the hadoop file system list path filtering, and it looks like all of the filtering is done on the client side. Meaning that the data is still queried from HDFS and then filtered within the program as far as I can tell.
We should likely just make this an option or not for the action, but I thought I'd leave that finding as a comment in case it makes a difference to others.]
Thanks again for your work on this and sorry for my slow response! Please feel free to reach out on Slack (or here) if you need anything 😄
...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java
Show resolved
Hide resolved
I originally had the same comment as Ryan, about checking for partitions specifically as that was the source of the original problem. However, thinking on this question, I would think we would still want to allow for removing directories / paths that start with a leading Please correct me if I'm mistaken about the file output committers as I tend to personally use S3 (and then either S3FileIO or S3A) and the magic output committers if using s3a. But if we're detecting partition paths automatically and not filtering on those, then possibly we'd want to keep this second scenario as an additional option. I have also not historically used HDFS much in production scenarios, so possibly I'm mistaken about the benefits / utility of having hidden paths or how they'd end up in the directory structure of the table (outside of the previously two mentioned situations). Please feel free to let me know if I'm mistaken. 🙂
For documenting the new option, let's worry about that once we get it confirmed into the right place. But definitely let's be sure to document it. I would start with adding it as a JavaDoc comment, assuming that there is a JavaDoc comment for the class already that documents the options for the action (I believe there is). Once we're closer to the final solution, we can also find a place in the docs themselves to put this option if necessary. |
…denPathFilter in DeleteOrphanFiles
...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java
Outdated
Show resolved
Hide resolved
...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java
Outdated
Show resolved
Hide resolved
...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java
Show resolved
Hide resolved
...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java
Outdated
Show resolved
Hide resolved
...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java
Outdated
Show resolved
Hide resolved
kbendick
left a comment
There was a problem hiding this comment.
Left some more feedback. It seems like it's getting pretty close!
...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java
Outdated
Show resolved
Hide resolved
...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java
Outdated
Show resolved
Hide resolved
...k/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java
Show resolved
Hide resolved
...k/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java
Outdated
Show resolved
Hide resolved
...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java
Outdated
Show resolved
Hide resolved
...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java
Show resolved
Hide resolved
Makes sure hidden paths, that are at the beginning of a partition name (i.e. "_hidden" is part of "_hiddenPartition", are not accidentally accepted.
...k/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java
Outdated
Show resolved
Hide resolved
|
Actually the sleep was not needed. There are other test in the same class also using Thread.sleep(). Should I create another PR after this one is merged were I replace the usages with the Also do I need to make a PR for documentation? |
|
Yeah, if you don't mind it would be great to get rid of those Thread.sleep calls! And it would still be a good idea to add |
...k/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java
Outdated
Show resolved
Hide resolved
|
@ulmako, sorry to ask for another change, but we actually don't need the Thanks! |
...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java
Outdated
Show resolved
Hide resolved
...2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java
Outdated
Show resolved
Hide resolved
|
@rdblue, sorry it took me so long to remove the |
...k/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java
Outdated
Show resolved
Hide resolved
|
Oh I see. The checkstyle is failing. If you click on ther failing check, If you run |
Move Javadoc from 'forSpecs' method to PartitionAwareHiddenPathFilter Class
|
Sorry, I thought I'd closed this one. Thanks for fixing this! |
Closes #4249