Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -70,4 +70,10 @@ default RewritePositionDeleteFiles rewritePositionDeletes(Table table) {
throw new UnsupportedOperationException(
this.getClass().getName() + " does not implement rewritePositionDeletes");
}

/** Instantiates an action to repair manifests */
default RepairManifests repairManifests(Table table) {
throw new UnsupportedOperationException(
this.getClass().getName() + " does not implement repairManifests");
}
}
62 changes: 62 additions & 0 deletions api/src/main/java/org/apache/iceberg/actions/RepairManifests.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.apache.iceberg.actions;

import org.apache.iceberg.ManifestFile;

/** An action that will repair manifests. Implementations should produce a new set of manifests. */
public interface RepairManifests extends SnapshotUpdate<RepairManifests, RepairManifests.Result> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this maybe extend Action like the other actions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SnapshotUpdate is the correct one to extend, here's my thinking:

1.) SnapshotUpdate already extends Action
2.) I think implementations necessarily will produce snapshots when they perform repairs (putting aside the dry run API option). The new snapshot that would be produced, at least in my mind in the reference implementation would be essentially a "rewrite manifests"-like snapshot.

Copy link
Contributor

@nastra nastra Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah nvm, this seems to be correct. I thought it's extending org.apache.iceberg.SnapshotUpdate while it's actually extending org.apache.iceberg.actions.SnapshotUpdate


/** Configuration method for repairing manifest entry statistics */
RepairManifests repairEntryStats();

/**
* Configuration method for removing duplicate file entries and removing files which no longer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo?

* exist in storage
*/
RepairManifests repairFileEntries();
Comment on lines +27 to +33
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the previous discussion @szehon-ho @RussellSpitzer the API now has 2 configuration methods for repairing file entries and reparing entry stats. The first one encompasses both the duplicate + missing case (I don't think there's a need to differentiate those options)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, for completness I wonder if we need to dinstinguish what to repair.. ie, file_size_in_bytes vs metrics ,etc. Can just read the filesystem metadata and save from reading the footer, for example. Probably it is too fine grained, and this option is fine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like if we're going through the process of hitting the file (to check size or read footer), we should just repair everything. I can't see a case where you would want to rewrite the information but possibly carryover incorrect information.


/**
* Configuration option for determining the rewritten and added manifests without actually
* committing the operation to the table
*
* @return this for method chaining
*/
RepairManifests dryRun();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be rather part of the actual Spark procedure rather than in the API? As I'm not sure how one would examine the results of a dry run in this case. For example, RemoveOrphanFilesProcedure has a dry run option part of the procedure, but not as part of the DeleteOrphanFiles API

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be part of "apply()" which should already exist because this inherits PendingUpdate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be part of "apply()" which should already exist because this inherits PendingUpdate?

@RussellSpitzer this inherits SnapshotUpdate in the Actions API, there's no apply() in that.

I think you're thinking of the other SnapshotUpdate that's part of the "normal" API. This brings up a good point of should this also have an interface there or not. RewriteManifests is defined in the regular API and then there's also an action for that.

The purpose of the Actions API is to enable engines to provide distributed/other custom implementations of procedures, and I think here we could just start by having it as an action and if there's a desire to add it to the typical API we could do it at that time?

should this be rather part of the actual Spark procedure rather than in the API? As I'm not sure how one would examine the results of a dry run in this case. For example, RemoveOrphanFilesProcedure has a dry run option part of the procedure, but not as part of the DeleteOrphanFiles API

Good point, I missed that this is what orphan file removal does. If I think through it a bit more: by defining dryRun() at the interface level, what we're saying is implementations must have a way of supporting that (the implementation could ignore the option technically but that would be a bad implementation of the interface).

If we look at what orphan file does, the action interface has a "deleteWith" API and the procedure will pass in a no-op deletion function.

What we could do is have an action API "commitRepairWith()" which will accept some sort of commit function to perform the manifest rewrite or if the procedure is a dry-run, some no-op commit.
It feels like it complicates the API more to do that.

I think the real fix is to return Result as part of the dryRun...let me double check if this is feasible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it's good as it is @nastra. If a user wants to get the results, in the current API they'd do:

RepairManifests.Result result = actionsProvider.repairManifests(table).repairFileEntries().repairEntryStats().dryRun().execute();

// Do whatever with result


interface Result {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one thing I feel is missing here is stats on how many entires were fixed for stats. Do we just assume everything was corrected? It's unclear from a dry run result if running the operation would result in any change if there are no duplicate/missing/recovered files, but you do need to run for missing stats.

Unfortunately, I would expect that fixed stats could be a very large number of files, which wouldn't work with the iterable model for the other categories of issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I don't think we should assume everything was corrected since it involves a footer read, some computation, and in all that flow things can go wrong in that for an individual file but not for another.

I've added two result methods for this:

    /** Returns the number of manifest entries for which stats were incorrect */
    long entryStatsIncorrect();

    /** Returns the number of manifest entries for which stats were corrected */
    long entryStatsRepairedCount();

I think it's important to distinguish between how many were repaired and how many were incorrect to begin with. WIth only the repair count, if it's 0 a user wouldn't be able to distinguish "was everything actually OK with the stats" vs "Hey a bunch of entry stats are incorrect but we couldn't actually do the repair".

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also removed "added manifests", because as I was working through the implementation a bit more, I came to the conclusion that it's not needed. I think it was just a copy from the rewrite manifests results but
my thinking is that for repairing we're always working with some existing manifests and producing new manifests with rewritten contents but the "source" is always some existing manifest for repair. For repair there isn't really an "added" manifest in the sense of how the rewrite procedure defines it. cc @danielcweeks @szehon-ho @RussellSpitzer . It's also something we could add later on with a default implementation if we realize we need it.

/** Returns rewritten manifests. */
Iterable<ManifestFile> rewrittenManifests();

/** Returns the duplicate file paths removed */
Iterable<String> duplicateFilesRemoved();

/** Returns the paths of the missing files which were removed */
Iterable<String> missingFilesRemoved();

/** Returns the paths of the missing files which were recovered */
Iterable<String> missingFilesRecovered();

/** Returns the number of manifest entries for which stats were incorrect */
long entryStatsIncorrectCount();

/** Returns the number of manifest entries for which stats were corrected */
long entryStatsRepairedCount();
}
}