Skip to content

Conversation

@duongkame
Copy link
Contributor

@duongkame duongkame commented Apr 6, 2023

What changes were proposed in this pull request?

Integrate secret keys to SCM snapshot. The taken approach is during snapshot installation, the follower makes a separate call to get secrets from SCM leader.
Protocol description of a new node joining an existing SCM (ratis) ring:

  1. The new node connects to the current leader, and the leader decides if the new node needs to download a snapshot.
  2. The new node download the snapshot from the leader, also make a separate call to download (latest) secret keys. These states are apply to the new node state.
  3. The new node then joins the ring.
  4. The new node receives ratis logs (after the index of the downloaded snapshot). The logs may contain secret keys sync. If it does, the latest secret keys will overwrite the current secret keys state.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7945

How was this patch tested?

Integration test & CI run.

@duongkame duongkame changed the base branch from master to HDDS-7733-Symmetric-Tokens April 6, 2023 19:20
@duongkame duongkame marked this pull request as ready for review April 6, 2023 22:35
@kerneltime
Copy link
Contributor

kerneltime commented May 1, 2023

I will go over the integration tests, but is there a need for the follower to fetch secrets twice? Once before the snapshot install and once after to cover any window where the leader created a secret that did not get reflected on the follower?

…ozone/scm/TestSecretKeySnapshot.java

Co-authored-by: Ritesh H Shukla <[email protected]>
@kerneltime
Copy link
Contributor

Can you write down the sequence of events that is expected to happen to check if the code is meeting the intended design.

@duongkame
Copy link
Contributor Author

duongkame commented May 1, 2023

I will go over the integration tests, but is there a need for the follower to fetch secrets twice? Once before the snapshot install and once after to cover any window where the leader created a secret that did not get reflected on the follower?

I've updated the PR description to include the step. Basically, before the snapshot is installed, the new follower doesn't fetch secret keys from the leader. After the snapshot is installed, any new secret keys update may be synchronized again ( from leader to follower) if there a new rotation happens since the snapshot index. This is fine and expected.

@duongkame
Copy link
Contributor Author

I will go over the integration tests, but is there a need for the follower to fetch secrets twice? Once before the snapshot install and once after to cover any window where the leader created a secret that did not get reflected on the follower?

I've updated the PR description to include the step. Basically, before the snapshot is installed, the new follower doesn't fetch secret keys from the leader.

@kerneltime
Copy link
Contributor

The overall code looks fine, I will go over the integration test a few more times.

// Wait & retry for follower to update transactions to leader
// snapshot index.
// Timeout error if follower does not load update within 3s
GenericTestUtils.waitFor(() ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a race but it would be nice if we could validate that Ratis did not have any transactions on start and no keys and then post-sync caught up.

Copy link
Contributor Author

@duongkame duongkame May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a race and the test will fail if the time it takes for the follower to start >= secret key rotation duration (which is now at 30s for this test).
It'd probably not fail on my laptop, but it will a lot in the CI env, in which hardware is much slower.

We can increase the rotation duration to increase the chance of success, e.g. to 90s, but that will slow the test down. I'm not sure if we wanna do that.

@kerneltime
Copy link
Contributor

Minor nits and some additional tests. LGTM.
Once it is rebased and addressed we can merge it in.

@kerneltime kerneltime merged this pull request into apache:HDDS-7733-Symmetric-Tokens May 9, 2023
duongkame added a commit to duongkame/ozone that referenced this pull request May 10, 2023
duongkame added a commit to duongkame/ozone that referenced this pull request Jun 8, 2023
errose28 added a commit to errose28/ozone that referenced this pull request Jun 10, 2023
* master: (73 commits)
  HDDS-8587. Test that CertificateClient can store multiple rootCA certificates (apache#4852)
  HDDS-8801. ReplicationManager: Add metric to count how often replication is throttled (apache#4864)
  HDDS-8477. Unit test for Snapdiff using tombstone entries (apache#4678)
  HDDS-7507. [Snapshot] Implement List Snapshot API Pagination (apache#4065) (apache#4861)
  HDDS-8373. Document that setquota doesn't accept decimals (apache#4856)
  HDDS-8779. Recon - Expose flag for enable/disable of heatmap. (apache#4845)
  HDDS-8677. Ozone admin OM CLI command for block tokens (apache#4760)
  HDDS-8164. Authorize secret key APIs (apache#4597)
  HDDS-7945. Integrate secret keys to SCM snapshot (apache#4549)
  HDDS-8003. E2E integration test cases for block tokens (apache#4547)
  HDDS-7831. Use symmetric secret key to sign and verify token (apache#4417)
  HDDS-7830. SCM API for OM and Datanode to get secret keys (apache#4345)
  HDDS-7734. Implement symmetric SecretKeys lifescycle management in SCM (apache#4194)
  HDDS-8679. Add dedicated, configurable thread pool for OM gRPC server (apache#4771)
  HDDS-8790. Split EC acceptance tests (apache#4855)
  HDDS-8714. TestScmHAFinalization: mark testFinalizationWithRestart as flaky, enable other test cases
  HDDS-8787. Reduce ozone sh calls in robot tests (apache#4854)
  HDDS-8774. Log allocation stack trace for leaked CodecBuffer (apache#4840)
  HDDS-8729. Add metric for count of blocks pending deletion on datanode (apache#4800)
  HDDS-8780. Leak of ManagedChannel in HASecurityUtils (apache#4850)
  ...
errose28 added a commit to errose28/ozone that referenced this pull request Jun 10, 2023
* tmp-dir-refactor: (73 commits)
  HDDS-8587. Test that CertificateClient can store multiple rootCA certificates (apache#4852)
  HDDS-8801. ReplicationManager: Add metric to count how often replication is throttled (apache#4864)
  HDDS-8477. Unit test for Snapdiff using tombstone entries (apache#4678)
  HDDS-7507. [Snapshot] Implement List Snapshot API Pagination (apache#4065) (apache#4861)
  HDDS-8373. Document that setquota doesn't accept decimals (apache#4856)
  HDDS-8779. Recon - Expose flag for enable/disable of heatmap. (apache#4845)
  HDDS-8677. Ozone admin OM CLI command for block tokens (apache#4760)
  HDDS-8164. Authorize secret key APIs (apache#4597)
  HDDS-7945. Integrate secret keys to SCM snapshot (apache#4549)
  HDDS-8003. E2E integration test cases for block tokens (apache#4547)
  HDDS-7831. Use symmetric secret key to sign and verify token (apache#4417)
  HDDS-7830. SCM API for OM and Datanode to get secret keys (apache#4345)
  HDDS-7734. Implement symmetric SecretKeys lifescycle management in SCM (apache#4194)
  HDDS-8679. Add dedicated, configurable thread pool for OM gRPC server (apache#4771)
  HDDS-8790. Split EC acceptance tests (apache#4855)
  HDDS-8714. TestScmHAFinalization: mark testFinalizationWithRestart as flaky, enable other test cases
  HDDS-8787. Reduce ozone sh calls in robot tests (apache#4854)
  HDDS-8774. Log allocation stack trace for leaked CodecBuffer (apache#4840)
  HDDS-8729. Add metric for count of blocks pending deletion on datanode (apache#4800)
  HDDS-8780. Leak of ManagedChannel in HASecurityUtils (apache#4850)
  ...
@duongkame duongkame deleted the HDDS-7945 branch April 12, 2025 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants