-
Notifications
You must be signed in to change notification settings - Fork 3k
Repair manifest action #10445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repair manifest action #10445
Conversation
introduces a spark action to tackle two corrupt manifest issues: - duplicate files existing within the same manifest, or between manifests - missing files referenced by a manifest (configurable, for emergency purposes only) - implements a dryRun option
|
It looks similar to my attempt in #2608 |
@szehon-ho would you have any issue if we proceed with this PR? I think there's overlap between the two, but this one addresses two of the main issues we've seen which are missing file references and duplicated files (the latter causes some interesting problems). I think you version would also require a rebase as it looks a bit old. |
|
yea i think that makes sense, let's do it in a way that we can add more functionality later. |
|
@danielcweeks @tabmatfournier while we are discussing, do you guys think it makes sense later to integrate my functionality into this Repair action? its been awhile, but iirc it was about fixing the manifest metadata (ie, file sizes, and metrics). File sizes to fix a bug #1980 that at the time seemed important, but its probably rare. Re-calculating metrics based on new table configs is a more common case. Ive seen a lot of users with huge metadatas that OOM planning, and realize they need to tune to prune out un-used metrics in existing metadata, and so far there's no mechanism to do this. Another option for these is additonal flag on RewriteManifests, but I think there were some opinions in #2608 that it should be a separate action, because rewrite is more like 'rewrite-as-is'. If we agree, I can add these funcitons in a follow up. Maybe it makes sense to have a bit array of options, as it seems Repair has many options then. |
|
@szehon-ho @danielcweeks I did a comparison between the two PRs, and actually seems like @szehon-ho PR is more about repairing manifest entry details from the actual data file on disk. So the concerns being addressed by the two repair functions are a bit different. I think what I'd propose is (I think @szehon-ho is saying the same thing, but let me know if I'm misinterpreting): 1.) we take this forward to handle missing file references and duplicate files. Another aspect I'd propose is a |
|
Thanks for the assessment @amogh-jahagirdar. @szehon-ho, yes I think we do want to support the work you did and add it to this action. Overall, this was focused on fixing broken tables, but we want this to support restoring a table to a "healthy state" and I know metrics is another area where people have run into issues. |
|
Cool, I'll separate the action interface changes from this PR for easier review for folks. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
introduces a spark action to tackle two corrupt manifest issues: