-
Notifications
You must be signed in to change notification settings - Fork 3k
API: Define RepairManifests action interface #10784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
| package org.apache.iceberg.actions; | ||
|
|
||
| import org.apache.iceberg.ManifestFile; | ||
|
|
||
| /** An action that will repair manifests. Implementations should produce a new set of manifests. */ | ||
| public interface RepairManifests extends SnapshotUpdate<RepairManifests, RepairManifests.Result> { | ||
|
|
||
| /** Configuration method for repairing manifest entry statistics */ | ||
| RepairManifests repairEntryStats(); | ||
|
|
||
| /** | ||
| * Configuration method for removing duplicate file entries and removing files which no longer | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Typo? |
||
| * exist in storage | ||
| */ | ||
| RepairManifests repairFileEntries(); | ||
|
Comment on lines
+27
to
+33
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based on the previous discussion @szehon-ho @RussellSpitzer the API now has 2 configuration methods for repairing file entries and reparing entry stats. The first one encompasses both the duplicate + missing case (I don't think there's a need to differentiate those options)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good, for completness I wonder if we need to dinstinguish what to repair.. ie, file_size_in_bytes vs metrics ,etc. Can just read the filesystem metadata and save from reading the footer, for example. Probably it is too fine grained, and this option is fine?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like if we're going through the process of hitting the file (to check size or read footer), we should just repair everything. I can't see a case where you would want to rewrite the information but possibly carryover incorrect information. |
||
|
|
||
| /** | ||
| * Configuration option for determining the rewritten and added manifests without actually | ||
| * committing the operation to the table | ||
| * | ||
| * @return this for method chaining | ||
| */ | ||
| RepairManifests dryRun(); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should this be rather part of the actual Spark procedure rather than in the API? As I'm not sure how one would examine the results of a dry run in this case. For example,
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't this be part of "apply()" which should already exist because this inherits PendingUpdate?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer this inherits I think you're thinking of the other The purpose of the Actions API is to enable engines to provide distributed/other custom implementations of procedures, and I think here we could just start by having it as an action and if there's a desire to add it to the typical API we could do it at that time?
Good point, I missed that this is what orphan file removal does. If I think through it a bit more: by defining dryRun() at the interface level, what we're saying is implementations must have a way of supporting that (the implementation could ignore the option technically but that would be a bad implementation of the interface). If we look at what orphan file does, the action interface has a "deleteWith" API and the procedure will pass in a no-op deletion function. What we could do is have an action API "commitRepairWith()" which will accept some sort of commit function to perform the manifest rewrite or if the procedure is a dry-run, some no-op commit. I think the real fix is to return Result as part of the dryRun...let me double check if this is feasible.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, I think it's good as it is @nastra. If a user wants to get the results, in the current API they'd do: |
||
|
|
||
| interface Result { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The one thing I feel is missing here is stats on how many entires were fixed for stats. Do we just assume everything was corrected? It's unclear from a dry run result if running the operation would result in any change if there are no duplicate/missing/recovered files, but you do need to run for missing stats. Unfortunately, I would expect that fixed stats could be a very large number of files, which wouldn't work with the iterable model for the other categories of issues.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point, I don't think we should assume everything was corrected since it involves a footer read, some computation, and in all that flow things can go wrong in that for an individual file but not for another. I've added two result methods for this: I think it's important to distinguish between how many were repaired and how many were incorrect to begin with. WIth only the repair count, if it's 0 a user wouldn't be able to distinguish "was everything actually OK with the stats" vs "Hey a bunch of entry stats are incorrect but we couldn't actually do the repair".
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've also removed "added manifests", because as I was working through the implementation a bit more, I came to the conclusion that it's not needed. I think it was just a copy from the rewrite manifests results but |
||
| /** Returns rewritten manifests. */ | ||
| Iterable<ManifestFile> rewrittenManifests(); | ||
|
|
||
| /** Returns the duplicate file paths removed */ | ||
| Iterable<String> duplicateFilesRemoved(); | ||
|
|
||
| /** Returns the paths of the missing files which were removed */ | ||
| Iterable<String> missingFilesRemoved(); | ||
|
|
||
| /** Returns the paths of the missing files which were recovered */ | ||
| Iterable<String> missingFilesRecovered(); | ||
|
|
||
| /** Returns the number of manifest entries for which stats were incorrect */ | ||
| long entryStatsIncorrectCount(); | ||
|
|
||
| /** Returns the number of manifest entries for which stats were corrected */ | ||
| long entryStatsRepairedCount(); | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this maybe extend
Actionlike the other actions?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
SnapshotUpdateis the correct one to extend, here's my thinking:1.) SnapshotUpdate already extends Action
2.) I think implementations necessarily will produce snapshots when they perform repairs (putting aside the dry run API option). The new snapshot that would be produced, at least in my mind in the reference implementation would be essentially a "rewrite manifests"-like snapshot.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah nvm, this seems to be correct. I thought it's extending
org.apache.iceberg.SnapshotUpdatewhile it's actually extendingorg.apache.iceberg.actions.SnapshotUpdate