-
Notifications
You must be signed in to change notification settings - Fork 3k
API: Define RepairManifests action interface #14341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Mathew Fournier <[email protected]>
Co-authored-by: Amogh Jahagirdar <[email protected]> Co-authored-by: Mathew Fournier <[email protected]>
| * Configuration method for removing duplicate file entries and removing files which no longer | ||
| * exist in storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[doubt] My understanding is it's fine to ignore files which are duplicate, as it may happen a writer committed same file twice (may didn't checkpoint correctly) ,
but can we just ignore the files which are no longer there in the storage ,lets say some one accidentally deleted a file, its like acknowledging this is incorrect and move on, should ignore deleted files be a seperate API, rather than both under same API ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @singhpk234!! In regards to duplicates, I've seen if we have duplicates committed across snapshots it will result in duplicate rows in Spark. As for duplicates in the same operation, they are deduplicated with the DataFileSet during the operation. But in the previous PR, there are some users who can't even read their tables right now if their table has duplicate manifests accross the snapshots. For deleted files, the FileIO will throw an exception when it can't open a missing file but that's at th engine level.
I agree that duplicates and missing files are different types of issues, but I still think they should be handled together in one action. The main goal is to restore the table to a queryable state and returning the manifests to most correct state. Most users just want to be able to query their table again that’s the priority. Once the table is readable, they can inspect it, understand what went wrong, and decide whether it’s worth doing a deeper fix.
But, I'd like to hear what others think about this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah fundamentally the goal of this repair procedure is for unblocking users to access their tables.
When there are files referenced in metadata that don't actually exist in storage (caused by object storage retention policies typically), then there's a choice:
- Either the file can be recovered. If it can be, I think we should always try for that.
- The file cannot be recovered, in which case we should unblock the users table by removing that file, and indicating to the user that a file was missing and could not be recovered, and as a result we removed it.
We could treat best effort file recovery as separate but I don't really see the value in that. If you're running this procedure, you're probably running it because you hit some failure and you invoke this procedure with the intent to get the table back to a readable state.
They key thing is that the output should indicate what happened to a user so they can at least reason about the fact that after this procedure there may be more (in case of recovery) records than before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I covered missing files in my reply above but for duplicates, the same principle applies there.
We know that duplicate entries is a problem; no client ever intends to do that, it just happens (zombies in KC etc) due to cases we don't handle yet. Even thoguh the table is still readable, we know it's a correctness issue in this case because scan planning would return the files twice. We could maybe address this in the implementation in scan planning, but this is just more state that'd need to be retained in nodes doing planning, and it still really just masks the issue for other implementations. So I think the right solution here is also to just fix up the manifest and report to the user.
| * Configuration method for removing duplicate file entries and removing files which no longer | ||
| * exist in storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah fundamentally the goal of this repair procedure is for unblocking users to access their tables.
When there are files referenced in metadata that don't actually exist in storage (caused by object storage retention policies typically), then there's a choice:
- Either the file can be recovered. If it can be, I think we should always try for that.
- The file cannot be recovered, in which case we should unblock the users table by removing that file, and indicating to the user that a file was missing and could not be recovered, and as a result we removed it.
We could treat best effort file recovery as separate but I don't really see the value in that. If you're running this procedure, you're probably running it because you hit some failure and you invoke this procedure with the intent to get the table back to a readable state.
They key thing is that the output should indicate what happened to a user so they can at least reason about the fact that after this procedure there may be more (in case of recovery) records than before.
| /** Returns the number of manifest entries for which stats were incorrect */ | ||
| long entryStatsIncorrectCount(); | ||
|
|
||
| /** Returns the number of manifest entries for which stats were corrected */ | ||
| long entryStatsRepairedCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, when would these two not be the same? If it's incorrect, we'd always correct it no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, truly we only care about what was corrected here. I initially had the distinction because if there were failures (though not sure that's a real case here), we could document them. But more importantly, for the case where we run without the repair flag, we could report how many incorrect entry stats you have.
But looking back now, it wouldn't make sense to differentiate since If repair fails, the operation would likely fail entirely rather than partially succeed and incorrect count is the same as repaired in a normal op.
| Iterable<ManifestFile> rewrittenManifests(); | ||
|
|
||
| /** Returns the number of duplicate file entries that were removed from manifests */ | ||
| long duplicateFilesRemoved(); | ||
|
|
||
| /** Returns the number of missing file references that were removed from manifests */ | ||
| long missingFilesRemoved(); | ||
|
|
||
| /** Returns the number of missing files that were successfully recovered. */ | ||
| long missingFilesRecovered(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think last time most of the debate was on the output, which I think I'd like to advance the conversation a bit more. So the way I look at this is as a user I really want to see numbers of files that were repaired in the different categories.
I don't think we particularly care about the actual paths of data files, and besides that result set could be unfortunately large. Even if we do output for example which files were removed because they just don't exist and could not be recovered, that's not actionable output to a user; that file is just gone.
The numbers would clearly indicate how many files have been recovered, how many files have not been recovered, how many duplicates existed.
Then we could surface the higher level in the tree of which manifests were rewritten. If someone really cares about which particular files were there they could go through those. and even later on in the API it's not a 1 way door, if there's a need to see which particular files we could have an option to output those files to some other file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this approach. For the most part, the user will need actionable information to help them understand the impact of running the repair operation. Whether thy just want counts or are curious about the actual fixes then we can take a step up the tree and surface the rewritten manifests to encapsulate the leafs or content files which the operation touched.
| /** The action result that contains a summary of the execution. */ | ||
| interface Result { | ||
| /** Returns rewritten manifests. */ | ||
| Iterable<ManifestFile> rewrittenManifests(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth on this but I do think returning the whole manifest is reasonable, though it's not completely perfect for someone who wants to identify the number of records that may be impacted (because the rolled up manifest summaries could contain file that never needed to be repaired).
Alternatively, we could just return the paths and record level counts but I don't think we need to complicate it.
Besides, those would be rolled up in the produced snapshot anyways? Furthermore, we can always add that info later.
| * | ||
| * @return this for method chaining | ||
| */ | ||
| RepairManifests repairFileEntries(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the description, the interface only handles the missing or duplicated data files and does not include the missing manifest files.
This bug (#12792) accidentally deleted the manifest file, resulting in a corrupted table. I think we also need a separate interface repairManifestEntries() to fix the corrupted table caused by manifest issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @hangc0276, this is a great point about the scope of this operation. You're right that the current RepairManifests action focuses on repairing issues within manifests (duplicate data file entries, missing data files, incorrect stats), but it doesn't handle the case where manifest files themselves are missing or corrupted.
This issue seems a level above what we're addressing here, unless we expand the scope to make this operation more about repairing metadata in general.
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
I'm still watching this, and hoping it will land. (i.e. not-stale) |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
We've been seeing a need for repair manifests functionality at AWS to fix some corrupted Iceberg tables and been using a workaround with the original pr to fix. @amogh-jahagirdar mentioned it’d be good for someone to drive this forward since he doesn’t currently have the bandwidth, so I’ve picked up the work on RepairManifests (#10784), which was originally opened by @tabmatfournier in #10445.
You can check out the original discussions in #10784.
In this reopened PR, I’ve updated
RepairManifestsso that the result returns counts instead of full paths for duplicated, recovered and removed files. This helps avoid extremely large lists in the results when huge portions of the tables manifests have been impacted. For instance, when users have S3 retention policies or partitioned paths that are deleted from storage. In most cases, I think it's safer to return stats based on the count of operations than inspecting each file. The updated result still includes the new set of manifest lists, so users can inspect the changes if needed.