-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Core: Remove all delete files in RewriteFiles action. #2303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @Override | ||
| public RewriteDataFilesActionResult execute() { | ||
| CloseableIterable<FileScanTask> fileScanTasks = null; | ||
| CloseableIterable<FileScanTask> fileScanTasks = CloseableIterable.empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether this name RewriteDataFilesActionResult should be renamed to RewriteFilesActionResult because the rewrite action is removing all the deletions from files set, it also rewrite the delete files actually.
| Map<K, V> copy = Maps.newHashMapWithExpectedSize(map.size()); | ||
| copy.putAll(map); | ||
| return Collections.unmodifiableMap(copy); | ||
| return copy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a patch to address the kryo serialization issue: #2343
|
@aokolnychyi , looks like @rdblue is absent for the community in recent days, would you mind to take a look this PR if you have a chance ? Thanks. |
| private boolean isPartialFileScan(CombinedScanTask task) { | ||
| private boolean doPartitionNeedRewrite(Collection<FileScanTask> partitionTasks) { | ||
| int files = 0; | ||
| for (FileScanTask scanTask : partitionTasks) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this would break splitting large files, since a file scan task with only a single file will never be marked for rewrite.
| RewriteResult.Builder resultBuilder = RewriteResult.builder(); | ||
| for (FileScanTask scanTask : task.files()) { | ||
| resultBuilder.addDataFilesToDelete(scanTask.file()); | ||
| resultBuilder.addDeleteFilesToDelete(scanTask.deletes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the spec allows for a delete file to reference multiple data files, which I think means that just because a delete file is associated with a data file that is being rewritten, it doesn't mean that the file can removed. Instead you would need to check that there are no longer any live data files which are referenced by the delete. This probably requires getting a hold of a reversed delete file index.
|
Sorry for the delay. I'm back from parental leave now. I agree with @RussellSpitzer's comments on this. I don't think that we can remove delete files just because data files were rewritten. We need to ensure that there are no data files that are still referenced by the delete files. This is probably going to require some work and may require reading the delete file. We can use the For now, I'd recommend letting the sequence numbers handle this. Because rewriting files will move them to a newer sequence number, the delete file won't be added to the new file's scan task when reading. It would still be considered during job planning, but I think that is okay and we don't need to aggressively drop them. |
|
As I've cleaned the old forked github repo, so I cannot update this PR now. Let's just address those comments in a new PR. Closing this now. |
This PR is building on top of #2294, resolving the issue #1028