Skip to content

Conversation

@ChenSammi
Copy link
Contributor

What changes were proposed in this pull request?

Implement file recovery support in OM.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9638

How was this patch tested?

new unit tests and existing tests.

@ChenSammi ChenSammi force-pushed the HDDS-9638 branch 2 times, most recently from fffceae to f474013 Compare December 21, 2023 09:32
@jojochuang
Copy link
Contributor

There's a compilation error that needs attention.

Copy link
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm half way through the review.

RecoverLeaseResponse recoverLeaseResponse =
handleError(submitRequest(omRequest)).getRecoverLeaseResponse();
return recoverLeaseResponse.getResponse();
ArrayList list = new ArrayList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayList has no type parameter

return recoverLeaseResponse.getResponse();
ArrayList list = new ArrayList();
list.add(OmKeyInfo.getFromProtobuf(recoverLeaseResponse.getKeyInfo()));
list.add(OmKeyInfo.getFromProtobuf(recoverLeaseResponse.getOpenKeyInfo()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the second element is not used by callers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jojochuang , openKeyInfo is return for client. There is one case, suppose a hsynced file, a new block is allocated to it, then client writes some data to this new block, and crashes before it calls hsync for data on this new block, then the openKeyInfo will have one more block than keyInfo. In this case, If we want to recover the last new block length, then we need the openKeyInfo info. If we only recover the last block that hsynced ever called, then keyInfo is enough. The question is, what's expectation from user? Does recovering the last hsynced block is user's expectation?

Thread.sleep(1000);
}
// The lease should have been recovered.
assertTrue("File should be closed", fs.recoverLease(file));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does recoverLease() throw except if the file is already closed? If so, it would be a deviation from HDFS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the second call of recoverLease will fail if the first has succeeded, since the file is already committed so checks in OM side will fail. Should we keep the same behavior as HDFS? I remember @szetszwo has mentioned a case that if after a Hbase region server fails, two new Hbase region servers start on the server, if the second region server calls recoverLease and gets a successful result, then there could be two region servers started and running on the same server. Not sure if Hbase has other checks to prevent the second region server to start so that the two region servers running altogether will not happen.

}

message RecoverLeaseResponse {
optional bool response = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially incompatible. But given that the feature is disabled by default, I agree this is acceptable.

return value != null ? value.getCacheValue() : null;
}

@NotNull
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved from OMKeyCommitRequest.

Copy link
Contributor Author

@ChenSammi ChenSammi Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can move back. It's move in the first implementation edition where the key commit is called from OMRecoverLeaseRequest. Later I found the key commit in OMRecoverLeaseRequest missed to handle many thing, such as bucket quota check, bucket used bytes update, reallocated but not used block release. All these are already addressed in OMKeyCommitRequestWithFSO. So use OMKeyCommitRequestWithFSO to do the final key commit is a better way than commit in OMRecoverLeaseRequest. That's current edition.

Copy link
Contributor

@ashishkumar50 ashishkumar50 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenSammi Thanks for the patch, Please find few comments inline.

throws IOException, InterruptedException;

boolean recoverLease(String pathStr) throws IOException;
List<OmKeyInfo> recoverFilePrepare(String pathStr) throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to get list here? I think only openKey info may be required here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openKeyInfo has all the allocated blocks. If the file has multiple preallocated blocks, we actually don't know how many blocks are actually used, because each block in openKeyInfo only has the allocated block length, doesn't have the real data length. And keyInfo will have real data length for each block, but keyInfo will not have preallocated by not used yet blocks info.


message RecoverLeaseResponse {
optional bool response = 1;
optional KeyInfo keyInfo = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keyInfo is required or just openKeyInfo would suffice here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to one of above answer to Wei-Chiu's comment.

if (isHSync) {
boolean isHSync = commitKeyRequest.hasHsync() && commitKeyRequest.getHsync();
boolean isRecovery = commitKeyRequest.hasRecovery() && commitKeyRequest.getRecovery();
boolean realCommit = (!isHSync) || (isHSync && isRecovery);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we decouple hsync and recovery. We can just use recovery flag to determine that request is for commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current design is we only recover hsynced file, so that's why check both hsync and recovery flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it's changed now.

parentId);
OMRequestTestUtils.addDirKeyToDirTable(true, omDirInfo,
volumeName, bucketName, txnID, omMetaMgr);
volumeName, bucketName, ++txnID, omMetaMgr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change(++txnID) required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From txn ID point of view, ID should be different for every update operation. Although here in the test, previous same ID doesn't cause any problem since it's not checked in the test. But if later, some new tests start to check the ID, then they will have problems.

OzoneObj.ResourceType.class)))
.thenReturn("user");
InetSocketAddress address = new InetSocketAddress("localhost", 10000);
when(ozoneManager.getOmRpcServerAddr()).thenReturn(address);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above ACL and address usage seems to be in every new test method, may be we can move to init or other method and use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually not needed, removed them.

private String doWork(OzoneManager ozoneManager, long transactionLogIndex)
throws IOException {

private RecoverLeaseResponse doWork(OzoneManager ozoneManager,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better to have restriction on some time gap between recoverLease and latest hsync call. Or else here recoverLease can be called immediately after hsync call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the patch is quite big now, the support of soft and hard limit is not implemented in this patch. We can file new JIRA to do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes makes sense.

"Calls of isFileClosed()"),
INVOCATION_RECOVER_LEASE("op_recover_lease",
"Calls of recoverLease()"),
INVOCATION_COMMIT("op_commit", "Calls of commit()"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be called INVOCATION_RECOVER_FILE instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

@ashishkumar50
Copy link
Contributor

@ChenSammi, Thanks for updating patch, overall patch LGTM.

Copy link
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 let's merge and proceed.

@jojochuang jojochuang merged commit c45449c into apache:HDDS-7593 Jan 9, 2024
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Jan 17, 2024
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Jan 17, 2024
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Jan 17, 2024
Comment on lines +218 to +220
// Add to cache.
omMetadataManager.getOpenKeyTable(getBucketLayout()).addCacheEntry(
new CacheKey<>(dbOpenFileKey), CacheValue.get(transactionLogIndex, openKeyInfo));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@smengcl smengcl Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e.

Suggested change
// Add to cache.
omMetadataManager.getOpenKeyTable(getBucketLayout()).addCacheEntry(
new CacheKey<>(dbOpenFileKey), CacheValue.get(transactionLogIndex, openKeyInfo));
// Add to cache.
omMetadataManager.getOpenKeyTable(getBucketLayout()).addCacheEntry(
dbOpenFileKey, openKeyInfo, transactionLogIndex);

@jojochuang jojochuang added the hbase HBase on Ozone support label Jan 23, 2024
chungen0126 pushed a commit to chungen0126/ozone that referenced this pull request May 3, 2024
chungen0126 pushed a commit to chungen0126/ozone that referenced this pull request May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hbase HBase on Ozone support

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants