Skip to content

Conversation

@ChenSammi
Copy link
Contributor

What changes were proposed in this pull request?

implement the client side function of file lease recovery.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10044

How was this patch tested?

new unit tests

@ChenSammi ChenSammi force-pushed the HDDS-10044 branch 2 times, most recently from 9e0b98e to 087ff8d Compare January 11, 2024 04:09
// TODO: query DN to get the final block length

OmKeyInfo keyInfo = infoList.get(0);
// finalize the final block and get block length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we can extract this method so it can be reused by OzoneFileSystem.recoverLease()


@Override
public long finalizeBlock(OmKeyLocationInfo block) throws IOException {
incrementCounter(Statistic.INVOCATION_FINALIZE_BLOCK, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method appears exactly the same as BasicOzoneClientAdapterImpl.fiinalizeBlock().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, most functions' implementation in BasicOzoneClientAdapterImpl and BasicRootedOzoneClientAdapterImpl are the same, one for BasicOzoneFileSystem, one for BasicRootedOzoneFileSystem.

Copy link
Contributor Author

@ChenSammi ChenSammi Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is chance that the whole BasicOzoneClientAdapterImpl and BasicRootedOzoneClientAdapterImpl can be refactored to remove the duplicated codes.

if (recoverLeaseResponse.hasKeyInfo()) {
list.add(OmKeyInfo.getFromProtobuf(recoverLeaseResponse.getKeyInfo()));
} else if (recoverLeaseResponse.hasOpenKeyInfo()) {
list.add(OmKeyInfo.getFromProtobuf(recoverLeaseResponse.getOpenKeyInfo()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of list we can just use OmkeyInfo, as in caller we are using get(0).
Also in caller we may not able to distinguish whether returned keyInfo is from openKey or Key table. Instead of list can we add a class containing openkey/key info. So that this ambiguity will not arise in future.

Pipeline.Builder builder = Pipeline.newBuilder().setReplicationConfig(newConfig).setId(PipelineID.randomId())
.setNodes(block.getPipeline().getNodes()).setState(Pipeline.PipelineState.OPEN);
try {
client = xceiverClientFactory.acquireClient(builder.build());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required to use acquireClientForReadData instead of acquireClient ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, acquireClientForReadData is better.

FileStatus fileStatus = fs.getFileStatus(file);
assertEquals(dataSize, fileStatus.getLen());
// make sure the writer can not write again.
// TODO: write does not fail here. Looks like a bug. HDDS-8439 to fix it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: resolve HDDS-8439

return null;
}

@Nullable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nullable is not required

} catch (Throwable e) {
}
cluster.getOzoneManager().restart();
cluster.waitForClusterToBeReady();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we verify after OM restart recovery works fine.

OzoneTestUtils.closeContainer(scm, container);
GenericTestUtils.waitFor(() -> {
try {
return scm.getPipelineManager().getPipeline(container.getPipelineID()).isClosed();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to wait here to check for pipeline is CLOSED or not? I see in other test cases we are not checking it. closeContainer() waits for container to go into CLOSED state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Container closed before pipeline closed. So there is chance pipeline is still OPEN when container is closed. The explicitly check here is to make sure the pipeline is closed too.

return BlockData.getFromProtoBuf(finalizeBlockResponseProto.getBlockData()).getSize();
}
} catch (IOException e) {
LOG.warn("Failed to execute finalizeBlock command", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently for all exception we are proceeding to get block length from DN. There may be case when container is still not CLOSED.
I think we should get block length from DN when container replica is CLOSED.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking CLOSED state replica is my initial idea too. Then later I found there is more concise way. We can leverage this getCommittedBlockLength call regardless of the replica state given that ratis is used to update the replica data.

The implementation of getCommittedBlockLength compares the bcsid of involved block and replica's bcsid. If replica's bcsid is no less than block's bcsid, then the block info in this replica is a consensus result of raft, it's trustworthy.

Copy link
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

On a side note, it looks like we missed verifying modification time after recovery in the test code. I'll open a jira to investigate tat.

The test failure looks unrelated. Let me retrigger it.

Copy link
Contributor

@ashishkumar50 ashishkumar50 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

@ChenSammi ChenSammi merged commit 04b6aa5 into apache:HDDS-7593 Jan 19, 2024
@ChenSammi
Copy link
Contributor Author

Thanks @jojochuang and @ashishkumar50 for the review.

@jojochuang jojochuang added the hbase HBase on Ozone support label Jan 23, 2024
chungen0126 pushed a commit to chungen0126/ozone that referenced this pull request May 3, 2024
chungen0126 pushed a commit to chungen0126/ozone that referenced this pull request May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hbase HBase on Ozone support

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants