-
Notifications
You must be signed in to change notification settings - Fork 593
HDDS-6235. Empty KeyValueContainers are replicated without chunks directory. #3052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-6235. Empty KeyValueContainers are replicated without chunks directory. #3052
Conversation
* master: (27 commits) HDDS-6206. Application errors must not flood system log (apache#3001) HDDS-6181. Change SCMHAInvocationHandler#invokeRatis() logging to TRACE (apache#2992) HDDS-5529. For any IOexception from @REPLicated method we should throw it (apache#2788) HDDS-6201. Fix NPE for DataScanner with scanned container deleted by others. (apache#3005) HDDS-6239. ozonesecure-mr failing with No URLs in mirrorlist (apache#3029) HDDS-6227. Test helpers should observe naming conditions (apache#3020) HDDS-6205. Add CLI command to display the latest Replication Manager report (apache#3013) HDDS-6192. feature/Observability.md translated to Chinese (apache#2994) HDDS-6219. Switch to RATIS ReplicationType from STAND_ALONE in our tests. (apache#3014) HDDS-6211. [Docs] Image styling on deployed site does not replicate local builds. (apache#3007) HDDS-6177. Extend container info command to include replica details (apache#2995) HDDS-6128. CLI tool that downloads all the block replicas and creates a manifest file (apache#2987) HDDS-6191. Intermittent failure in TestDeleteWithSlowFollower (apache#3015) HDDS-6216. Move OMOpenKeysDeleteRequest to package om.request.key (apache#3011) HDDS-6203. CleanUp incomplete gz files during Container move (apache#3000) HDDS-3408. Rename ChunkLayOutVersion to ContainerLayoutVersion. (apache#2983) HDDS-6109. Preserve the underlying exception raised in client lib. (apache#2989) HDDS-6135. SCM Container DB bootstrap on Recon startup for SCM HA. (apache#2972) HDDS-6202. Avoid using jmh-generator-annprocess since it is GPL2.0. (apache#2998) HDDS-6163. Fix PATH in docker image (apache#2967) ...
.../src/main/java/org/apache/hadoop/ozone/container/keyvalue/helpers/KeyValueContainerUtil.java
Show resolved
Hide resolved
hanishakoneru
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @errose28 for working on fixing this.
Wouldn't marking a container Unhealthy, if it's metadata scan fails, be too aggressive?
This container would never be replicated and if for some reason this replica is also lost, it could lead to data loss, right? If the container is still recoverable, I think we shouldn't mark it as unhealthy. It would be safer to replicate the container and try fixing it later.
Let me know your thoughts.
nandakumar131
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @errose28 for working on this.
* master: (43 commits) HDDS-6212. SCM Container DB bootstrap on Recon startup for secure cluster (apache#3027) HDDS-6234. Repair containers affected by incorrect used bytes and block count. (apache#3042) HDDS-6262. ozone insight log stops working after OM DBUpdates message (apache#3044) HDDS-6290. operational-state and node-state options in datanode list CLI not working correctly (apache#3105) HDDS-6314. ConcurrentModificationException getting SCMContainerMetrics (apache#3101) HDDS-6284. Add BlockDeletingService worker size config (apache#3056) HDDS-6324. Do not trigger CI by reopening PR (apache#3092) HDDS-6283. Change ContainerStateMachine ContainerOpExecutor name (apache#3055) HDDS-6331. Remove toString in debug log parameters within SCMCommonPlacementPolicy (apache#3098) HDDS-6330. Remove unnecessary duplicate semicolons (apache#3097) HDDS-6305: Add metrics - number of FSO bucket creates (apache#3077) HDDS-6311. Fix number of keys displayed in Recon Overview. (apache#3081) HDDS-6325. Fix interface ClientProtocol methods typo setThreadLocalS3Auth and clearThreadLocalS3Auth (apache#3093) HDDS-6322. Fix Recon getting inccorrect sequenceNumber from OM (apache#3090) HDDS-5913. Avoid integer overflow when setting dfs.container.ratis.lo… (apache#2785) HDDS-6313. Remove replicas in ContainerStateMap when a container is deleted (apache#3086) HDDS-6186. Selective checks: skip integration check for unit test changes (apache#3061) HDDS-6310. Update json-smart to 2.4.7. (apache#3080) HDDS-6190. Cleanup unnecessary datanode id path checks. (apache#2993) HDDS-6304. Add enforcer to make sure ozone.version equals project.version (apache#3075) ...
|
Thanks for reviews @nandakumar131 @hanishakoneru @guihecheng. After some offline discussion we decided to remove the metadata scan on import/export and the unhealthy mark if there's an issue with the container on export. I filed HDDS-6344 for gentler handling of incorrect but recoverable containers, which could be used to flag container content errors on import/export. |
|
After this patch, missing chunks directories will be recreated in |
|
LGTM. +1 pending CI. |
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @errose28 for the patch, LGTM. I only have two minor suggestions.
Also, would you mind adding a unit test in TestTarContainerPacker?
...-service/src/test/java/org/apache/hadoop/ozone/container/keyvalue/TestKeyValueContainer.java
Show resolved
Hide resolved
...ner-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/TarContainerPacker.java
Outdated
Show resolved
Hide resolved
|
Thanks for the review @adoroszlai. I was initially confused by your test comments, but then realized the merge had deleted my unit test, so that was a good catch. I restored the test so hopefully that part makes more sense now. |
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @errose28 for updating the patch.
...ner-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/TarContainerPacker.java
Outdated
Show resolved
Hide resolved
|
@errose28 Thanks for the patch, our cluster does have the same issue while doing replication containers. One thing I noticed while running the unit test is even without the changes in TarContainerPacker, the test also passed. Does that mean the key point to this issue is the creation of the CHUNK_PATH in |
| archiveOutput.putArchiveEntry(entry); | ||
|
|
||
| // Add files in the directory. | ||
| try (Stream<Path> dirEntries = Files.list(dir)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the error is thrown here, so it still needs the CHUNK file to exist?
Reproduced by adding new File(data.getChunksPath()).delete(); after checking the state of original container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@symious the error is thrown because the chunk directory does not exist. Not having chunk files in the directory should not be a problem here.
KeyValueContainerUtil change fixes existing containers with missing chunks directory. This could have occurred when a container with empty chunks directory was replicated before. The change in TarContainerPacker is needed so that when containers with empty chunks directories are replicated, the replica container get the chunks directory even though it does not have any contents. |
|
Thanks @errose28 for fixing this and @nandakumar131, @adoroszlai , @avijayanhwx and @symious for reviews. |
|
Hi @symious following up on your questions:
There's two parts to this PR: preventing the issue and fixing existing containers. I think this comment might answer the question.
I cannot repro this. Running master from 68c5ac5 (before this patch was merged), I copy/pasted in the changes to TestKeyValueContainer and both new tests failed. Can you please check again? |
|
@hanishakoneru Thanks for the explaination. @errose28 I created a branch with the changes in TarContainerPacker removed here https://github.com/symious/ozone/tree/HDDS-6235, the unit test passed, could you help to check? |
|
I think the process sequence are as follows:
So even without the changes in TarContainerPacker, the unit test still passed. |
What changes were proposed in this pull request?
What is the link to the Apache JIRA
HDDS-6235
How was this patch tested?
ozone admin container create