Skip to content

Conversation

@amogh-jahagirdar
Copy link
Contributor

This change adds a SupportsRecoveryOperations mixin for FileIO which offers a boolean recoverFile(String path) API. The purpose of this interface is to enable FileIO implementations to attempt a best effort recovery of a given file in case of corruption cases. This could be used as a primitive in repair functions such as #10445.

This change also implements this mixin for S3FileIO which will get object versions for the specific object and take the latest non-deleted version. If one is not found, the recoverFile API returns false since this mixin is meant for FileIO to perform a best effort recovery, not neccessarily guarantee recovery since that depends on a lot of factors of the underlying storage configuration.

@RussellSpitzer
Copy link
Member

I think usually we would separate out the Implementation from the API change here. Mostly I just want to make sure we have a good description for what this API should attempt to do, what it shouldn't do, etc.

I'll re-check when the javadoc is added

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Jul 17, 2024

Agreed, @RussellSpitzer I'll separate out the interface and implementation PRs since that's whats typically done. I'll update this PR to just be the interface PR and have the implementation PR just be dependent on this one, in case someone wants to look at the implementation in parallel while folks talk about the interface.

@amogh-jahagirdar amogh-jahagirdar force-pushed the supports-recovery-mixin branch from 3d59a66 to 4ff289e Compare July 18, 2024 00:15
@github-actions github-actions bot removed the AWS label Jul 18, 2024
@amogh-jahagirdar amogh-jahagirdar force-pushed the supports-recovery-mixin branch from 4ff289e to 3cc099b Compare July 18, 2024 00:17
public interface SupportsRecoveryOperations {

/**
* Perform a best-effort recovery of a file at a given path
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How recovery is performed is left up to implementations but I think the one thing this interface is prescriptive on is that the expectation that it is best effort. It's not guaranteed that just because FileIO implementations implement the mixin that they will recover the file no matter what (e.g. there's a limit to versioning etc); as a result the interface doesn't surface an exception, rather just a boolean indicating success or not.

I guess we may want to qualify what it means for a file to be recovered (it may be obvious but since it's an interface, it's better to be clear):

It means that assuming no concurrent deletes of a file at that path, fileIO.inputFile(path).exists() must be true after this method call.

If that definition makes sense to folks: I can update @RussellSpitzer @danielcweeks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other thought I had:

We could offer another method which uplevels the manifestFile and attempts to recover the manifest and if that's successful, attempt recovery of the data files. I discarded this case since someone could just traverse the metadata tree on their own and just use recoverFile(String path). Also in that API, probably would need a response that's higher fidelity (the simplest is Map<String, Boolean>) than just a boolean since maybe a caller wants to take action if the manifest was recovered but data files weren't etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would want to add to the notes that implementations should do their best not to throw any runtime exceptions and that this method is essentially assumed to be a no-op on false or something like that.

I think we also need to describe how this method will be used. For example, "When reading a file if any exception is thrown this method will be called in an attempt to remedy the exception."

Although if we use that description should we also passthrough a Throwable?

A bigger question I have is whether or not this is needed as part of the general API or if it should be part of a Repair action that we talked about before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting point @RussellSpitzer, I think initially we were planning to use this in something like the proposed repair manifests procedure. I would be a little concerned about putting it in the normal read path because it could hide more serious issues if we're trying to repair table state on the fly.

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on the comment that we should note that impls should do their best to avoid throwing runtime exceptions.

I think we also need to describe how this method will be used. For example, "When reading a file if any exception is thrown this method will be called in an attempt to remedy the exception."

I actually don't think this should be recommended for the read path like @danielcweeks said performing recovery's on the fly probably will mask that the table was corrupted in the first place; I feel a user should always determine that their table is corrupted and then explicitly opt-in to recover the file by running the repair procedure. This procedure would indicate "Hey file A was missing but I recovered it". The user can then reason about better how their dataset may change after the recovery.

A bigger question I have is whether or not this is needed as part of the general API or if it should be part of a Repair action that we talked about before.

On the repair action, I'm not sure functions could technically reside specifically because the challenge is the actual implementations will look different for different FileIOs which is what is driving defining an interface.

To me even if the interface itself is used primarily in the context of repair actions, there's enough merit to generalizing the interface in the public API because different FileIO's will have different ways of recovering a given file so we probably want to standardize the method for that.

@amogh-jahagirdar amogh-jahagirdar changed the title API, AWS: Add SupportsRecoveryOperations mixin for FileIO and implement for S3FileIO API: Add SupportsRecoveryOperations mixin for FileIO Jul 18, 2024
Comment on lines +24 to +25
* reachable files missing from disk. (e.g. a live manifest points to data file entry which no
* longer exists on disk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example you mention (manifest points to non existing data files) isn't specific to a FileIO implementation, but each FileIO implementation would have to implement it. Such a scenario can be handled in places that know how to deal with manifests, manifest-lists, data-files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue isn't specific to the FileIO, but the remediation is specific to the FileIO and each requires a specific implementation.

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snazy basically what Dan said, for example S3FileIO would have it's own implementation for remediation by going through s3 object versions, same principle for GCS/Azure/HDFS. The example provided is providing an example from the perspective of a user of the interfcace, like a repair function.

For example, with this interface:

List<String> missingDataFiles = determineMissingDataFiles(manifest);
successfullyRecoveredFiles = new AtomicInteger(0);
if (fileIO instanceOf SupportsRecoveryOperations) {
missingDataFiles.foreach(missingFile -> {
   if ((SupportsRecoveryOperations)fileIO.recoverFile(missingFile) {
        successfullyRecoveredFiles.getAndIncrement();
   }
}
} 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, can you clarify the javadoc saying that the function's purpose is to do "its best" to recover a single file/object. It can be misinterpreted in a way that one can pass in a manifest-file to repair/recover data files referenced from that manifest-file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that sounds reasonable to me!

* @param path Absolute path of file to attempt recovery for
* @return true if recovery was successful, false otherwise
*/
boolean recoverFile(String path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any examples for scenarios specific to FileIO implementations that those can fix themselves?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any underlying storage that supports some sort of versioning (e.g. S3) or trash concept (HDFS) can provide the ability to restore a file. The interface is necessary because these implementations are somewhat specific to the storage implementation.

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/iceberg/pull/10721/files#diff-322d2e2be6cbad87106f05c0456cf953d99f396c71c99a1d8c7bc281022f8b6bR432 for an example with s3
Yes, for others like Gcs/azure, the implementations would need to go through those clients/those providers concepts of versioning. HDFS has a trash directory which someone could attempt recovery from (assuming it wasn't purged) https://stackoverflow.com/questions/51267089/how-to-restore-a-deleted-folder-from-hdfs

@snazy
Copy link
Member

snazy commented Jul 19, 2024

WDYT of adding the function directly to FileIO as a default function:

public interface FileIO extends Serializable, Closeable {
  boolean recoverFile(String path) {
    return false;
  }
}

All (major) FileIO implementations implement all mixins/extensions. Unrelated to this PR, I think it would be easier to have just FileIO with default implementations - otherwise we end up in situations where the actual backend impl (e.g. S3FileIO "looses" functionality, because the wrapper/delegate (think EncryptingFileIO) does not properly implement "all the interfaces". And if the wrapper/delegate has to implement all the interfaces, there's maybe no point in having distinct interfaces?

@danielcweeks
Copy link
Contributor

WDYT of adding the function directly to FileIO as a default function

We prefer to keep the FileIO as simple as possible and use various mixin interfaces for additional behavior (e.g. SupportsPrefixOperations, SupportsBulkOperations, etc.). These capabilities aren't necessary for iceberg to function, but does provide the ability for adding utilities to support maintaining the tables.

* @param path Absolute path of file to attempt recovery for
* @return true if recovery was successful, false otherwise
*/
boolean recoverFile(String path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on having the signature be
recoverFile(String path, Throwable exeception)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific use case you had in mind here? In practice, I've always found that either you have the versioned object to recover or you don't. There may be some cases where there's a permission issue or a reason that the operation can't be invoked, but that tends to be rare.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing in mind, just feels like it would be better to add it now than later. I was thinking the easiest thing would be being able to fail fast if the error is something the IO knows it can't recover from. If no one uses it then it shouldn't be a huge expense either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the intent is to use this in a normal read path, more like a procedure (the rational I cover in an earlier comment). Since this is a "best effort" operation, I feel it would be adding unnecessary complexity that we can address later if there's a need (e.g. we can always extend with default implementations in the future).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just thinking in generalities since I am not familiar with the planned implementations so I'm ok with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think this interface generalizes in the sense that it provides a common method for recovering a file that different FileIOs can implement.

Now does it generalize beyond the Repair action itself? I can see the interface being used as a primitive if someone doesn't want to do the full repair action. For instance if a user already knows a single corrupted manifest with entries of data files some of which may/may not be on disk, then they can just write their own logic to recover it via this interface.

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Jul 22, 2024

I'll leave this up for a bit, as I carry forward #10445 maybe it'll be helpful to have more of that implementation in place before we merge this interface? Maybe that'll give us some more confidence on the utility of the interface cc @RussellSpitzer @danielcweeks @rdblue @snazy

@amogh-jahagirdar
Copy link
Contributor Author

Since there's utility in this mixin beyond the repair manifests implementation (someone can make use of this mixin in their own code/FileIO implementation for recovering individual files), and it seems as though there's consensus, I will go ahead and merge.

@amogh-jahagirdar amogh-jahagirdar merged commit e9364fa into apache:main Aug 5, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants