Skip to content

Conversation

@tabmatfournier
Copy link

introduces a spark action to tackle two corrupt manifest issues:

  • duplicate files existing within the same manifest, or between manifests
  • missing files referenced by a manifest (configurable, for emergency purposes only)
  • implements a dryRun option

introduces a spark action to tackle two corrupt manifest issues:

- duplicate files existing within the same manifest, or between manifests
- missing files referenced by a manifest (configurable, for emergency purposes only)
- implements a dryRun option
@szehon-ho
Copy link
Member

It looks similar to my attempt in #2608

@danielcweeks
Copy link
Contributor

It looks similar to my attempt in #2608

@szehon-ho would you have any issue if we proceed with this PR? I think there's overlap between the two, but this one addresses two of the main issues we've seen which are missing file references and duplicated files (the latter causes some interesting problems). I think you version would also require a rebase as it looks a bit old.

@danielcweeks danielcweeks self-requested a review July 12, 2024 21:27
@szehon-ho
Copy link
Member

yea i think that makes sense, let's do it in a way that we can add more functionality later.

@szehon-ho
Copy link
Member

szehon-ho commented Jul 13, 2024

@danielcweeks @tabmatfournier while we are discussing, do you guys think it makes sense later to integrate my functionality into this Repair action? its been awhile, but iirc it was about fixing the manifest metadata (ie, file sizes, and metrics).

File sizes to fix a bug #1980 that at the time seemed important, but its probably rare.

Re-calculating metrics based on new table configs is a more common case. Ive seen a lot of users with huge metadatas that OOM planning, and realize they need to tune to prune out un-used metrics in existing metadata, and so far there's no mechanism to do this.

Another option for these is additonal flag on RewriteManifests, but I think there were some opinions in #2608 that it should be a separate action, because rewrite is more like 'rewrite-as-is'.

If we agree, I can add these funcitons in a follow up. Maybe it makes sense to have a bit array of options, as it seems Repair has many options then.

@amogh-jahagirdar
Copy link
Contributor

@szehon-ho @danielcweeks I did a comparison between the two PRs, and actually seems like @szehon-ho PR is more about repairing manifest entry details from the actual data file on disk. So the concerns being addressed by the two repair functions are a bit different. I think what I'd propose is (I think @szehon-ho is saying the same thing, but let me know if I'm misinterpreting):

1.) we take this forward to handle missing file references and duplicate files.
2.)later on add an option to the procedure to do the repair manifest entry statistics from the actual data file on disk. This is a bit "Deeper" of a repair imo in the sense that we actually need to read from disk. Although arguably here in the current implementation we still need to do the disk read for a file existence check for missing files anyways? I think it'll become clear as we tighten up this implementation.

Another aspect I'd propose is a SupportsFileRecovery mixin for FileIO. For example for S3FileIO, if the bucket is version enabled, we could attempt a best effort to recover in the case a live manifest entry points to a file which for whatever reason no longer exists on disk.

@danielcweeks
Copy link
Contributor

Thanks for the assessment @amogh-jahagirdar. @szehon-ho, yes I think we do want to support the work you did and add it to this action. Overall, this was focused on fixing broken tables, but we want this to support restoring a table to a "healthy state" and I know metrics is another area where people have run into issues.

@amogh-jahagirdar
Copy link
Contributor

Cool, I'll separate the action interface changes from this PR for easier review for folks.

@github-actions
Copy link

github-actions bot commented Nov 6, 2024

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Nov 6, 2024
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants