HDDS-10044. [hsync] File recovery support in Client #5978

ChenSammi · 2024-01-11T03:19:20Z

What changes were proposed in this pull request?

implement the client side function of file lease recovery.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10044

How was this patch tested?

new unit tests

jojochuang · 2024-01-16T05:47:17Z

hadoop-ozone/ozonefs/src/main/java/org/apache/hadoop/fs/ozone/RootedOzoneFileSystem.java

-    // TODO: query DN to get the final block length
+
    OmKeyInfo keyInfo = infoList.get(0);
+    // finalize the final block and get block length


Looks like we can extract this method so it can be reused by OzoneFileSystem.recoverLease()

jojochuang · 2024-01-16T05:52:52Z

...onefs-common/src/main/java/org/apache/hadoop/fs/ozone/BasicRootedOzoneClientAdapterImpl.java


+  @Override
+  public long finalizeBlock(OmKeyLocationInfo block) throws IOException {
+    incrementCounter(Statistic.INVOCATION_FINALIZE_BLOCK, 1);


this method appears exactly the same as BasicOzoneClientAdapterImpl.fiinalizeBlock().

Yes, most functions' implementation in BasicOzoneClientAdapterImpl and BasicRootedOzoneClientAdapterImpl are the same, one for BasicOzoneFileSystem, one for BasicRootedOzoneFileSystem.

There is chance that the whole BasicOzoneClientAdapterImpl and BasicRootedOzoneClientAdapterImpl can be refactored to remove the duplicated codes.

ashishkumar50 · 2024-01-16T05:20:47Z

...n/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java

+    if (recoverLeaseResponse.hasKeyInfo()) {
+      list.add(OmKeyInfo.getFromProtobuf(recoverLeaseResponse.getKeyInfo()));
+    } else if (recoverLeaseResponse.hasOpenKeyInfo()) {
+      list.add(OmKeyInfo.getFromProtobuf(recoverLeaseResponse.getOpenKeyInfo()));


Instead of list we can just use OmkeyInfo, as in caller we are using get(0).
Also in caller we may not able to distinguish whether returned keyInfo is from openKey or Key table. Instead of list can we add a class containing openkey/key info. So that this ambiguity will not arise in future.

ashishkumar50 · 2024-01-16T06:57:09Z

...onefs-common/src/main/java/org/apache/hadoop/fs/ozone/BasicRootedOzoneClientAdapterImpl.java

+    Pipeline.Builder builder = Pipeline.newBuilder().setReplicationConfig(newConfig).setId(PipelineID.randomId())
+        .setNodes(block.getPipeline().getNodes()).setState(Pipeline.PipelineState.OPEN);
+    try {
+      client = xceiverClientFactory.acquireClient(builder.build());


Required to use acquireClientForReadData instead of acquireClient ?

Right, acquireClientForReadData is better.

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/storage/ContainerProtocolCalls.java

jojochuang · 2024-01-16T07:21:20Z

...p-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/debug/TestLeaseRecoverer.java

    FileStatus fileStatus = fs.getFileStatus(file);
    assertEquals(dataSize, fileStatus.getLen());
-    // make sure the writer can not write again.
-    // TODO: write does not fail here. Looks like a bug. HDDS-8439 to fix it.


TODO: resolve HDDS-8439

ashishkumar50 · 2024-01-16T08:50:02Z

...a/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java

-    return null;
+  }
+
+  @Nullable


@nullable is not required

ashishkumar50 · 2024-01-16T09:00:04Z

hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/fs/ozone/TestLeaseRecovery.java

+      } catch (Throwable e) {
+      }
+      cluster.getOzoneManager().restart();
+      cluster.waitForClusterToBeReady();


Can we verify after OM restart recovery works fine.

ashishkumar50 · 2024-01-16T09:11:12Z

hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/fs/ozone/TestLeaseRecovery.java

+      OzoneTestUtils.closeContainer(scm, container);
+      GenericTestUtils.waitFor(() -> {
+        try {
+          return scm.getPipelineManager().getPipeline(container.getPipelineID()).isClosed();


Do we need to wait here to check for pipeline is CLOSED or not? I see in other test cases we are not checking it. closeContainer() waits for container to go into CLOSED state.

Container closed before pipeline closed. So there is chance pipeline is still OPEN when container is closed. The explicitly check here is to make sure the pipeline is closed too.

ashishkumar50 · 2024-01-16T10:03:07Z

...one/ozonefs-common/src/main/java/org/apache/hadoop/fs/ozone/BasicOzoneClientAdapterImpl.java

+        return BlockData.getFromProtoBuf(finalizeBlockResponseProto.getBlockData()).getSize();
+      }
+    } catch (IOException e) {
+      LOG.warn("Failed to execute finalizeBlock command", e);


Currently for all exception we are proceeding to get block length from DN. There may be case when container is still not CLOSED.
I think we should get block length from DN when container replica is CLOSED.

Checking CLOSED state replica is my initial idea too. Then later I found there is more concise way. We can leverage this getCommittedBlockLength call regardless of the replica state given that ratis is used to update the replica data.

The implementation of getCommittedBlockLength compares the bcsid of involved block and replica's bcsid. If replica's bcsid is no less than block's bcsid, then the block info in this replica is a consensus result of raft, it's trustworthy.

jojochuang

LGTM.

On a side note, it looks like we missed verifying modification time after recovery in the test code. I'll open a jira to investigate tat.

The test failure looks unrelated. Let me retrigger it.

ashishkumar50

LGTM +1

ChenSammi · 2024-01-19T08:47:52Z

Thanks @jojochuang and @ashishkumar50 for the review.

ChenSammi force-pushed the HDDS-10044 branch 2 times, most recently from 9e0b98e to 087ff8d Compare January 11, 2024 04:09

jojochuang reviewed Jan 16, 2024

View reviewed changes

ashishkumar50 reviewed Jan 16, 2024

View reviewed changes

jojochuang reviewed Jan 16, 2024

View reviewed changes

ashishkumar50 reviewed Jan 16, 2024

View reviewed changes

jojochuang force-pushed the HDDS-7593 branch from 074b6cc to 1c20d84 Compare January 17, 2024 14:37

HDDS-10044. [hsync] File recovery support in Client

db607c9

ChenSammi force-pushed the HDDS-10044 branch from ba9e0d4 to db607c9 Compare January 18, 2024 07:58

fix failed TestSecureOzoneRpcClient#testFileRecovery

17b87e3

jojochuang approved these changes Jan 19, 2024

View reviewed changes

ashishkumar50 approved these changes Jan 19, 2024

View reviewed changes

ChenSammi merged commit 04b6aa5 into apache:HDDS-7593 Jan 19, 2024

jojochuang added the hbase HBase on Ozone support label Jan 23, 2024

chungen0126 pushed a commit to chungen0126/ozone that referenced this pull request May 3, 2024

HDDS-10044. [hsync] File recovery support in Client (apache#5978)

8ebadf8

chungen0126 pushed a commit to chungen0126/ozone that referenced this pull request May 3, 2024

HDDS-10044. [hsync] File recovery support in Client (apache#5978)

3ac5b32

HDDS-10044. [hsync] File recovery support in Client #5978

HDDS-10044. [hsync] File recovery support in Client #5978

Uh oh!

Conversation

ChenSammi commented Jan 11, 2024

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi Jan 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jojochuang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashishkumar50 left a comment

Choose a reason for hiding this comment

Uh oh!

ChenSammi commented Jan 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ChenSammi Jan 16, 2024 •

edited

Loading

jojochuang left a comment •

edited

Loading