Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.apache.iceberg.io;

/**
* This interface is intended as an extension for FileIO implementations to provide additional
* best-effort recovery operations that can be useful for repairing corrupted tables where there are
* reachable files missing from disk. (e.g. a live manifest points to data file entry which no
* longer exists on disk)
Comment on lines +24 to +25
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example you mention (manifest points to non existing data files) isn't specific to a FileIO implementation, but each FileIO implementation would have to implement it. Such a scenario can be handled in places that know how to deal with manifests, manifest-lists, data-files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue isn't specific to the FileIO, but the remediation is specific to the FileIO and each requires a specific implementation.

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snazy basically what Dan said, for example S3FileIO would have it's own implementation for remediation by going through s3 object versions, same principle for GCS/Azure/HDFS. The example provided is providing an example from the perspective of a user of the interfcace, like a repair function.

For example, with this interface:

List<String> missingDataFiles = determineMissingDataFiles(manifest);
successfullyRecoveredFiles = new AtomicInteger(0);
if (fileIO instanceOf SupportsRecoveryOperations) {
missingDataFiles.foreach(missingFile -> {
   if ((SupportsRecoveryOperations)fileIO.recoverFile(missingFile) {
        successfullyRecoveredFiles.getAndIncrement();
   }
}
} 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, can you clarify the javadoc saying that the function's purpose is to do "its best" to recover a single file/object. It can be misinterpreted in a way that one can pass in a manifest-file to repair/recover data files referenced from that manifest-file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that sounds reasonable to me!

*/
public interface SupportsRecoveryOperations {

/**
* Perform a best-effort recovery of a file at a given path
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How recovery is performed is left up to implementations but I think the one thing this interface is prescriptive on is that the expectation that it is best effort. It's not guaranteed that just because FileIO implementations implement the mixin that they will recover the file no matter what (e.g. there's a limit to versioning etc); as a result the interface doesn't surface an exception, rather just a boolean indicating success or not.

I guess we may want to qualify what it means for a file to be recovered (it may be obvious but since it's an interface, it's better to be clear):

It means that assuming no concurrent deletes of a file at that path, fileIO.inputFile(path).exists() must be true after this method call.

If that definition makes sense to folks: I can update @RussellSpitzer @danielcweeks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other thought I had:

We could offer another method which uplevels the manifestFile and attempts to recover the manifest and if that's successful, attempt recovery of the data files. I discarded this case since someone could just traverse the metadata tree on their own and just use recoverFile(String path). Also in that API, probably would need a response that's higher fidelity (the simplest is Map<String, Boolean>) than just a boolean since maybe a caller wants to take action if the manifest was recovered but data files weren't etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would want to add to the notes that implementations should do their best not to throw any runtime exceptions and that this method is essentially assumed to be a no-op on false or something like that.

I think we also need to describe how this method will be used. For example, "When reading a file if any exception is thrown this method will be called in an attempt to remedy the exception."

Although if we use that description should we also passthrough a Throwable?

A bigger question I have is whether or not this is needed as part of the general API or if it should be part of a Repair action that we talked about before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting point @RussellSpitzer, I think initially we were planning to use this in something like the proposed repair manifests procedure. I would be a little concerned about putting it in the normal read path because it could hide more serious issues if we're trying to repair table state on the fly.

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on the comment that we should note that impls should do their best to avoid throwing runtime exceptions.

I think we also need to describe how this method will be used. For example, "When reading a file if any exception is thrown this method will be called in an attempt to remedy the exception."

I actually don't think this should be recommended for the read path like @danielcweeks said performing recovery's on the fly probably will mask that the table was corrupted in the first place; I feel a user should always determine that their table is corrupted and then explicitly opt-in to recover the file by running the repair procedure. This procedure would indicate "Hey file A was missing but I recovered it". The user can then reason about better how their dataset may change after the recovery.

A bigger question I have is whether or not this is needed as part of the general API or if it should be part of a Repair action that we talked about before.

On the repair action, I'm not sure functions could technically reside specifically because the challenge is the actual implementations will look different for different FileIOs which is what is driving defining an interface.

To me even if the interface itself is used primarily in the context of repair actions, there's enough merit to generalizing the interface in the public API because different FileIO's will have different ways of recovering a given file so we probably want to standardize the method for that.

*
* @param path Absolute path of file to attempt recovery for
* @return true if recovery was successful, false otherwise
*/
boolean recoverFile(String path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any examples for scenarios specific to FileIO implementations that those can fix themselves?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any underlying storage that supports some sort of versioning (e.g. S3) or trash concept (HDFS) can provide the ability to restore a file. The interface is necessary because these implementations are somewhat specific to the storage implementation.

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/iceberg/pull/10721/files#diff-322d2e2be6cbad87106f05c0456cf953d99f396c71c99a1d8c7bc281022f8b6bR432 for an example with s3
Yes, for others like Gcs/azure, the implementations would need to go through those clients/those providers concepts of versioning. HDFS has a trash directory which someone could attempt recovery from (assuming it wasn't purged) https://stackoverflow.com/questions/51267089/how-to-restore-a-deleted-folder-from-hdfs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on having the signature be
recoverFile(String path, Throwable exeception)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific use case you had in mind here? In practice, I've always found that either you have the versioned object to recover or you don't. There may be some cases where there's a permission issue or a reason that the operation can't be invoked, but that tends to be rare.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing in mind, just feels like it would be better to add it now than later. I was thinking the easiest thing would be being able to fail fast if the error is something the IO knows it can't recover from. If no one uses it then it shouldn't be a huge expense either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the intent is to use this in a normal read path, more like a procedure (the rational I cover in an earlier comment). Since this is a "best effort" operation, I feel it would be adding unnecessary complexity that we can address later if there's a need (e.g. we can always extend with default implementations in the future).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just thinking in generalities since I am not familiar with the planned implementations so I'm ok with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think this interface generalizes in the sense that it provides a common method for recovering a file that different FileIOs can implement.

Now does it generalize beyond the Repair action itself? I can see the interface being used as a primitive if someone doesn't want to do the full repair action. For instance if a user already knows a single corrupted manifest with entries of data files some of which may/may not be on disk, then they can just write their own logic to recover it via this interface.

}