Skip to content

Conversation

@ivandika3
Copy link
Contributor

@ivandika3 ivandika3 commented Mar 10, 2023

What changes were proposed in this pull request?

Currently Ozone Manager enables raft.server.log.purge.upto.snapshot.index by default.
However, for OM cluster with large metadata store, there might be a case where OM leader purge its Ratis logs before a slow follower replicated it to its log. This means that the follower needs to download the whole metadata store from the OM leader. This can be problematic if the metadata store in leader is too large.

We should add two configurations in OM to enable/disable Ratis purge parameters:

  • raft.server.log.purge.upto.snapshot.index
    • Disabling this would guarantee that the OM leader will not purge its Ratis log unless all the logs have been replicated to all the followers (through commitIndex).
    • This would effectively means that there shouldn't be a case where the slow follower needs to download the full metadata from the leader. So no snapshot download from follower. For small OM metadata, it can be faster for follower to download the leader's metadata snapshot than normally replicating and applying the outstanding logs.
    • For a very slow follower / downed follower, the OM leader cannot purge the log until the follower catch up to it. This might increase the disk space usage for OM leader.
    • Default would be true to preserve the current OM snapshot behavior
  • raft.server.log.purge.preservation.log.num
    • RATIS-1626 introduces logic to preserve the latest n won't-be-purged logs
    • Setting n > 0 while still enabling raft.server.log.purge.upto.snapshot.index should balance a between the cost of preserving & transferring logs and the cost of transferring snapshot.
    • Default would be 0 to preserve the current OM snapshot behavior

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8131

How was this patch tested?

Should have already be covered in Ratis test.

@ivandika3
Copy link
Contributor Author

Hi @Xushaohong @szetszwo could you take a look?

@ivandika3
Copy link
Contributor Author

ivandika3 commented Mar 10, 2023

Hi @prashantpogde @hemantk-12 @GeorgeJahad I see that there is already an effort to introduce incremental checkpoint in the OM snapshot process. Our cluster is currently encountering issue in which a slow OM follower has to download a large OM metadata due to the leaders' log being purged. This merge request seeks to circumvent this issue by disabling raft.server.log.purge.upto.snapshot.index so that leader will only purge the log once the followers have replicated the log. Could you take a look?

However this has a risk in which the OM leader disk space could be filled quickly if the follower is down / very slow. Therefore I think the long-term solution would be to integrate incremental checkpoint. May I know what is the progress on the feature? We might be interested in integrating it in our cluster.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@szetszwo szetszwo merged commit 5f6547f into apache:master Mar 13, 2023
@ivandika3
Copy link
Contributor Author

@szetszwo Thank you for the review and merge.

errose28 added a commit to errose28/ozone that referenced this pull request Mar 16, 2023
* master: (262 commits)
  HDDS-8153. Integrate ContainerBalancer with MoveManager (apache#4391)
  HDDS-8090. When getBlock from a datanode fails, retry other datanodes. (apache#4357)
  HDDS-8163 Use try-with-resources to ensure close rockdb connection in SstFilteringService (apache#4402)
  HDDS-8065. Provide GNU long options (apache#4394)
  HDDS-7930. [addendum] input stream does not refresh expired block token.
  HDDS-7930. input stream does not refresh expired block token. (apache#4378)
  HDDS-7740. [Snapshot] Implement SnapshotDeletingService (apache#4244)
  HDDS-8076. Use container cache in Key listing API. (apache#4346)
  HDDS-8091. [addendum] Generate list of config tags from ConfigTag enum - Hadoop 3.1 compatibility fix (apache#4374)
  HDDS-8144. TestDefaultCertificateClient#testTimeBeforeExpiryGracePeriod fails as we approach DST. (apache#4382)
  HDDS-8151. Support fine grained lifetime for root CA certificate (apache#4386)
  HDDS-8150. RpcClientTest and ConfigurationSourceTest not run due to naming convention (apache#4388)
  HDDS-8131. Add Configuration for OM Ratis Log Purge Tuning Parameters. (apache#4371)
  HDDS-8133. Create ozone sh key checksum command (apache#4375)
  HDDS-8142. Check if no entries in Block DB for a container on container delete (apache#4379)
  HDDS-8118. Fail container delete on non empty chunks dir (apache#4367)
  HDDS-8028. JNI for RocksDB SST Dump tool (apache#4315)
  HDDS-8129. ContainerStateMachine allows two different tasks with the same container id running in parallel. (apache#4370)
  HDDS-8119. Remove loosely related AutoCloseable from SendContainerOutputStream (apache#4368)
  close db connection (apache#4366)
  ...
@ivandika3 ivandika3 deleted the HDDS-8131 branch January 28, 2024 05:29
@ivandika3 ivandika3 self-assigned this Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants