-
Notifications
You must be signed in to change notification settings - Fork 3k
Fix RemoveOrphanFilesAction when file_path is not a qualified path #1052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Dataset<Row> actualFileDF = buildActualFileDF(); | ||
|
|
||
| Column joinCond = validFileDF.col("file_path").equalTo(actualFileDF.col("file_path")); | ||
| Column joinCond = actualFileDF.col("file_path").contains(validFileDF.col("file_path")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this going to cause Spark to use a nested loop join (full join) because there is no way to partition the data for this expression?
To fix it, what about using just the file name as well? File names should be unique because we embed the write UUID, partition, and task ID. And if we add both checks, filename could be used to distribute the data without many collisions and contains could be used for final correctness.
Column nameEqual = filename(actualFileDF.col("file_path")).equals(filename(validFileDF.col("file_path")));
Column actualContains = actualFileDF.col("file_path").contains(validFileDF.col("file_path"));
Column joinCond = nameEqual.and(actualContains);FYI @aokolnychyi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see. If the file name is unique, then I think it would be fine to change to this way. Let me update the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if it isn't unique, we don't expect many duplicates because writers will be operating in parallel. And the worst case is all files have the same name and all get joined in a task -- which is pretty much the same as using a nested loop join.
|
Looks great! Thanks for fixing this, @jerryshao! FYI @aokolnychyi. |
|
Yeah, I've seen this problem but didn't get time to fix it. Thanks, @jerryshao. I believe the problem is not about having a qualified path. The problem is about not having a scheme in the table's location. I believe That's why I am not sure how this UDF will help us: Also, switching to |
|
Actually, I take it back. I missed the point that the UDF we added actually fetches the file name only and having an equality predicate will avoid the nested loop join. |
|
Yes, the key problem is that we don't always change the path to a qualified one, and depends on user's provided |
If we don't use qualified path (
file:/temp/test_db) to create or save into (Hadoop) table, then the file_path queried out is not a qualified path, for example:The result here could be:
But the code here
file.getPath().toString()inRemoveOrphanFilesAction#listDirRecursivelyreturns qualified path:So the join condition
equalTomay not correctly get the orphan files and delete the file mistakenly.So here propose to fix the join condition to
contains. Another solution is to change the relative path to qualified one in everywhere.