-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-13234. Expired secret key can abort leader OM startup. #8601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Change-Id: I0643c8fa8cbdf0b4be37afcfc04b7893e0c02d9e
Change-Id: I962b2ca950779533bd4f69fa47158f59a7f1ae29
| byte[] password; | ||
| if (StringUtils.isNotEmpty(identifier.getSecretKeyId())) { | ||
| ManagedSecretKey signKey = secretKeyClient.getSecretKey(UUID.fromString(identifier.getSecretKeyId())); | ||
| if (signKey == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jojochuang , thanks for working on this improvements. For these token whose sign key is not available, we'd better put them in a map/list, and delete them at then end of loadTokenSecretState.
BTW, do you have idea how long a secret key will be deleted from SCM after it's creation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Secret key default life time is 1 day.
The background thread checks for validity every 10 minutes so a secret key can be removed at most 1 day + 10 minutes after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A delegation token associated with expired secret key does not mean the dt expired.
Secret key life time is 1 day. Delegation token life time is 1 day but can be renewed for up to 7 days.
Also, what if SCM is unreachable? Exception doesn't imply the secret key is invalid and does not imply dt is invalid.
I'd suggest to let the OM background thread (OzoneDelegationTokenSecretManager) to take care of expired dt later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: secret life time is 7 days. hdds.secret.key.expiry.duration
there's a property hdds.secret.key.rotate.duration which let SCM create a new secret key every 1 day. (SCM will keep a few secret keys in memory)
The SCM background checks for expired secret key every 10 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's possible a dt is created near the secret key rotation duration (1 day), in which case the dt could stay valid much longer than the secret key.
Say a secret key is created at June 1st, 00:00am.
A dt is created using this secret key at June 1st 23:59pm.
The secret key would expire at June 8th, 00:00am and removed from SCM memory by June 8th, 00:10am.
The dt would expire at June 8th, 23:59pm and removed by June 9th 00:09am.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to let the OM background thread (OzoneDelegationTokenSecretManager) to take care of expired dt later.
It makes sense. Besides, for those dt whose fetcing secret key failed and also it is expired, those dt will be not put int memory map, so will not cleaned up by background thread, we delete them directly during load, thoughts?
Change-Id: I41410e2b5bc3a47a373475d8f590afa2bcfceff4
|
As suggested by @ChenSammi, handle the second case: secret key is null and that delegation token expires, in which case, delete the dt from rocksdb. |
Change-Id: I6e8ae80c3575c3b0d25167648ec2655c343ccc18
ChenSammi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jojochuang , the last patch LGTM.
|
Merged. Thanks @ChenSammi |
…239-container-reconciliation Commits: 62 da53b5b HDDS-13299. Fix failures related to delete (apache#8665) 8c1b439 HDDS-13296. Integration check always passes due to missing output (apache#8662) 7329859 HDDS-13023. Container checksum is missing after container import (apache#8459) a0af93e HDDS-13292. Change `<? extends KeyValue>` to `<KeyValue>` in test (apache#8657) f3050cf HDDS-13276. Use KEY_ONLY/VALUE_ONLY iterator in SCM/Datanode. (apache#8638) e9c0a45 HDDS-13262. Simplify key name validation (apache#8619) f713e57 HDDS-12482. Avoid using CommonConfigurationKeys (apache#8647) b574709 HDDS-12924. datanode used space calculation optimization (apache#8365) de683aa HDDS-13263. Refactor DB Checkpoint Utilities. (apache#8620) 97262aa HDDS-13256. Updated OM Snapshot Grafana Dashboard to reflect metric updates from HDDS-13181. (apache#8639) 9d2b415 HDDS-13234. Expired secret key can abort leader OM startup. (apache#8601) d9049a2 HDDS-13220. Change Recon 'Negative usedBytes' message loglevel to DEBUG (apache#8648) 6df3077 HDDS-9223. Use protobuf for SnapshotDiffJobCodec (apache#8503) a7fc290 HDDS-13236. Change Table methods not to throw IOException. (apache#8645) 9958f5b HDDS-13287. Upgrade commons-beanutils to 1.11.0 due to CVE-2025-48734 (apache#8646) 48aefea HDDS-13277. [Docs] Native C/C++ Ozone clients (apache#8630) 052d912 HDDS-13037. Let container create command support STANDALONE , RATIS and EC containers (apache#8559) 90ed60b HDDS-13279. Skip verifying Apache Ranger binaries in CI (apache#8633) 9bc53b2 HDDS-11513. All deletion configurations should be configurable without restart (apache#8003) ac511ac HDDS-13259. Deletion Progress - Grafana Dashboard (apache#8617) 3370f42 HDDS-13246. Change `<? extend KeyValue>` to `<KeyValue>` in hadoop-hdds (apache#8631) 7af8c44 HDDS-11454. Ranger integration for Docker Compose environment (apache#8575) 5a3e4e7 HDDS-13273. Bump awssdk to 2.31.63 (apache#8626) 77138b8 HDDS-13254. Change table iterator to optionally read key or value. (apache#8621) ce288b6 HDDS-13265. Simplify the page Access Ozone using HTTPFS REST API (apache#8629) 36fe888 HDDS-13275. Improve CheckNative implementation (apache#8628) d38484e HDDS-13274. Bump sqlite-jdbc to 3.50.1.0 (apache#8627) 3f3ec43 HDDS-13266. `ozone debug checknative` to show OpenSSL lib (apache#8623) 8983a63 HDDS-13272. Bump junit to 5.13.1 (apache#8625) a927113 HDDS-13271. [Docs] Minor text updates, reference links. (apache#8624) 7e77058 HDDS-13112. [Docs] OM Bootstrap can also happen when follower falls behind too much. (apache#8600) fd13300 HDDS-10775. Support bucket ownership verification (apache#8558) 3ecf345 HDDS-13207. [Docs] Third party systems compatible with Ozone S3. (apache#8584) ad5a507 HDDS-13035. SnapshotDeletingService should hold write locks while purging deleted snapshots (apache#8554) 38a9186 HDDS-12637. Increase max buffer size for tar entry read/write (apache#8618) f31c264 HDDS-13045. Implement Immediate Triggering of Heartbeat when Volume Full (apache#8590) 0701d6a HDDS-13248. Remove `ozone debug replicas verify` option --output-dir (apache#8612) ca1afe8 HDDS-13257. Remove separate split for shell integration tests (apache#8616) 5d6fe94 HDDS-13216. Standardize Container[Replica]NotFoundException messages (apache#8599) 1e47217 HDDS-13168. Fix error response format in CheckUploadContentTypeFilter (apache#8614) 6d4d423 HDDS-13181. Added metrics for internal Snapshot Operations. (apache#8606) 4a461b2 HDDS-10490. Intermittent NPE in TestSnapshotDiffManager#testLoadJobsOnStartUp (apache#8596) bf29f7f HDDS-13235. The equals/hashCode methods in anonymous KeyValue classes may not work. (apache#8607) 6ff3ad6 HDDS-12873. Improve ContainerData statistics synchronization. (apache#8305) 09d3b27 HDDS-13244. TestSnapshotDeletingServiceIntegrationTest should close snapshots after deleting them (apache#8611) 931bc2d HDDS-13243. copy-rename-maven-plugin version is missing (apache#8605) 3b5985c HDDS-13244. Disable TestSnapshotDeletingServiceIntegrationTest 6bf009c HDDS-12927. metrics and log to indicate datanode crossing disk limits (apache#8573) 752da2b HDDS-12760. Intermittent Timeout in testImportedContainerIsClosed (apache#8349) 8c32363 HDDS-13050. Update StartFromDockerHub.md. (apache#8586) ba1887c HDDS-13241. Fix some potential resource leaks (apache#8602) bbaf71e HDDS-13130. Rename all instances of Disk Usage to Namespace usage (apache#8571) 0628386 HDDS-13142. Correct SCMPerformanceMetrics for delete operation. (apache#8592) 516bc96 HDDS-13148. [Docs] Update Transparent Data Encryption doc. (apache#8530) 5787135 HDDS-13229. [Doc] Fix incorrect CLI argument order in OM upgrade docs (apache#8598) ba95074 HDDS-13107. Support limiting output of `ozone admin datanode list` (apache#8595) e7f5544 HDDS-13171. Replace pipelineID if nodes are changed (apache#8562) 3c9d4d8 HDDS-13103. Correct transaction metrics in SCMBlockDeletingService. (apache#8516) f62eb8a HDDS-13160. Remove SnapshotDirectoryCleaningService and refactor AbstractDeletingService (apache#8547) b46e6b2 HDDS-13150. Fixed SnapshotLimitCheck when failures occur. (apache#8532) 203c1d3 HDDS-13206. Update documentation for Apache Ranger (apache#8583) 2072ef0 HDDS-13214. populate-cache fails due to unused dependency (apache#8594) Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerData.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/helpers/KeyValueContainerUtil.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/statemachine/background/BlockDeletingTask.java
) (cherry picked from commit 9d2b415)
) (cherry picked from commit 9d2b415)
) (cherry picked from commit 9d2b415)
) (cherry picked from commit 9d2b415)
) (cherry picked from commit 9d2b415)
…p. (apache#8601) Change-Id: I6bbaa48dd942908f5cdecc1e3f585aa19c32ae0f
What changes were proposed in this pull request?
HDDS-13234. Expired secret key can abort leader OM startup.
Please describe your PR in detail:
This pull request enhances error handling and introduces a new test case to improve the robustness of the
OzoneDelegationTokenSecretManagerclass. Key changes include adding exception handling during token loading, validating secret keys, and testing scenarios with expired secret keys.Error Handling Enhancements:
loadTokenSecretStateto log and ignore exceptions when loading delegation tokens from the database during startup. This ensures the service continues to start even if some tokens fail to load. (hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/security/OzoneDelegationTokenSecretManager.java, hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/security/OzoneDelegationTokenSecretManager.javaR516-R520)addPersistedDelegationTokento throw anIOExceptionif a secret key is not found for a token, preventing potential issues with invalid tokens. (hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/security/OzoneDelegationTokenSecretManager.java, hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/security/OzoneDelegationTokenSecretManager.javaR535-R538)Testing Improvements:
spy,doReturn,verify, etc.) to facilitate mocking and verification in tests. (hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/security/TestOzoneDelegationTokenSecretManager.java, hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/security/TestOzoneDelegationTokenSecretManager.javaR81-R85)testExpiredSecretKey, to simulate scenarios where a secret key is unavailable (e.g., expired or missing). This ensures the system handles such cases gracefully and validates the secret key retrieval logic. (hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/security/TestOzoneDelegationTokenSecretManager.java, hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/security/TestOzoneDelegationTokenSecretManager.javaR239-R259)What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13234
(Please replace this section with the link to the Apache JIRA)
How was this patch tested?
Unit test
https://github.com/jojochuang/ozone/actions/runs/15570256118