Skip to content

Conversation

@openinx
Copy link
Member

@openinx openinx commented Mar 8, 2021

This PR is building on top of #2294, resolving the issue #1028

@Override
public RewriteDataFilesActionResult execute() {
CloseableIterable<FileScanTask> fileScanTasks = null;
CloseableIterable<FileScanTask> fileScanTasks = CloseableIterable.empty();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether this name RewriteDataFilesActionResult should be renamed to RewriteFilesActionResult because the rewrite action is removing all the deletions from files set, it also rewrite the delete files actually.

Map<K, V> copy = Maps.newHashMapWithExpectedSize(map.size());
copy.putAll(map);
return Collections.unmodifiableMap(copy);
return copy;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a patch to address the kryo serialization issue: #2343

@openinx
Copy link
Member Author

openinx commented Apr 8, 2021

@aokolnychyi , looks like @rdblue is absent for the community in recent days, would you mind to take a look this PR if you have a chance ? Thanks.

private boolean isPartialFileScan(CombinedScanTask task) {
private boolean doPartitionNeedRewrite(Collection<FileScanTask> partitionTasks) {
int files = 0;
for (FileScanTask scanTask : partitionTasks) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this would break splitting large files, since a file scan task with only a single file will never be marked for rewrite.

RewriteResult.Builder resultBuilder = RewriteResult.builder();
for (FileScanTask scanTask : task.files()) {
resultBuilder.addDataFilesToDelete(scanTask.file());
resultBuilder.addDeleteFilesToDelete(scanTask.deletes());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the spec allows for a delete file to reference multiple data files, which I think means that just because a delete file is associated with a data file that is being rewritten, it doesn't mean that the file can removed. Instead you would need to check that there are no longer any live data files which are referenced by the delete. This probably requires getting a hold of a reversed delete file index.

@rdblue
Copy link
Contributor

rdblue commented May 20, 2021

Sorry for the delay. I'm back from parental leave now.

I agree with @RussellSpitzer's comments on this. I don't think that we can remove delete files just because data files were rewritten. We need to ensure that there are no data files that are still referenced by the delete files. This is probably going to require some work and may require reading the delete file. We can use the _file stats in some cases but we need to be careful.

For now, I'd recommend letting the sequence numbers handle this. Because rewriting files will move them to a newer sequence number, the delete file won't be added to the new file's scan task when reading. It would still be considered during job planning, but I think that is okay and we don't need to aggressively drop them.

@openinx
Copy link
Member Author

openinx commented Apr 2, 2022

As I've cleaned the old forked github repo, so I cannot update this PR now. Let's just address those comments in a new PR. Closing this now.

@openinx openinx closed this Apr 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add an action to rewrite data files and remove deleted rows

3 participants