Skip to content

Conversation

@errose28
Copy link
Contributor

@errose28 errose28 commented Feb 7, 2022

What changes were proposed in this pull request?

  • Include empty chunks directory in the tarball when replicating a container.
  • Recreate directory entries when unpacking the tarball.
  • On datanode startup, create an empty chunks directory if no chunks directory is present.
    • Repairs existing containers now that the bug is fixed.

What is the link to the Apache JIRA

HDDS-6235

How was this patch tested?

  • Unit test added, which passed only after the fix was applied.
  • Manually tested chunks directory creation on startup in docker:
    1. Create empty container with ozone admin container create
    2. Delete its chunks directory.
    3. Restart the datanode.
    4. Check that empty chunks directory is created.

* master: (27 commits)
  HDDS-6206. Application errors must not flood system log (apache#3001)
  HDDS-6181. Change SCMHAInvocationHandler#invokeRatis() logging to TRACE (apache#2992)
  HDDS-5529. For any IOexception from @REPLicated method we should throw it (apache#2788)
  HDDS-6201. Fix NPE for DataScanner with scanned container deleted by others. (apache#3005)
  HDDS-6239. ozonesecure-mr failing with No URLs in mirrorlist (apache#3029)
  HDDS-6227. Test helpers should observe naming conditions (apache#3020)
  HDDS-6205. Add CLI command to display the latest Replication Manager report (apache#3013)
  HDDS-6192. feature/Observability.md translated to Chinese (apache#2994)
  HDDS-6219. Switch to RATIS ReplicationType from STAND_ALONE in our tests. (apache#3014)
  HDDS-6211. [Docs] Image styling on deployed site does not replicate local builds. (apache#3007)
  HDDS-6177. Extend container info command to include replica details  (apache#2995)
  HDDS-6128. CLI tool that downloads all the block replicas and creates a manifest file (apache#2987)
  HDDS-6191. Intermittent failure in TestDeleteWithSlowFollower (apache#3015)
  HDDS-6216. Move OMOpenKeysDeleteRequest to package om.request.key (apache#3011)
  HDDS-6203. CleanUp incomplete gz files during Container move (apache#3000)
  HDDS-3408. Rename ChunkLayOutVersion to ContainerLayoutVersion. (apache#2983)
  HDDS-6109. Preserve the underlying exception raised in client lib. (apache#2989)
  HDDS-6135. SCM Container DB bootstrap on Recon startup for SCM HA. (apache#2972)
  HDDS-6202. Avoid using jmh-generator-annprocess since it is GPL2.0. (apache#2998)
  HDDS-6163. Fix PATH in docker image (apache#2967)
  ...
Copy link
Contributor

@hanishakoneru hanishakoneru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @errose28 for working on fixing this.
Wouldn't marking a container Unhealthy, if it's metadata scan fails, be too aggressive?
This container would never be replicated and if for some reason this replica is also lost, it could lead to data loss, right? If the container is still recoverable, I think we shouldn't mark it as unhealthy. It would be safer to replicate the container and try fixing it later.
Let me know your thoughts.

Copy link
Contributor

@nandakumar131 nandakumar131 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @errose28 for working on this.

* master: (43 commits)
  HDDS-6212. SCM Container DB bootstrap on Recon startup for secure cluster (apache#3027)
  HDDS-6234. Repair containers affected by incorrect used bytes and block count. (apache#3042)
  HDDS-6262. ozone insight log stops working after OM DBUpdates message (apache#3044)
  HDDS-6290. operational-state and node-state options in datanode list CLI not working correctly (apache#3105)
  HDDS-6314. ConcurrentModificationException getting SCMContainerMetrics (apache#3101)
  HDDS-6284. Add BlockDeletingService worker size config (apache#3056)
  HDDS-6324. Do not trigger CI by reopening PR (apache#3092)
  HDDS-6283. Change ContainerStateMachine ContainerOpExecutor name (apache#3055)
  HDDS-6331. Remove toString in debug log parameters within SCMCommonPlacementPolicy (apache#3098)
  HDDS-6330. Remove unnecessary duplicate semicolons (apache#3097)
  HDDS-6305: Add metrics - number of FSO bucket creates (apache#3077)
  HDDS-6311. Fix number of keys displayed in Recon Overview. (apache#3081)
  HDDS-6325. Fix interface ClientProtocol methods typo setThreadLocalS3Auth and clearThreadLocalS3Auth (apache#3093)
  HDDS-6322. Fix Recon getting inccorrect sequenceNumber from OM (apache#3090)
  HDDS-5913. Avoid integer overflow when setting dfs.container.ratis.lo… (apache#2785)
  HDDS-6313. Remove replicas in ContainerStateMap when a container is deleted (apache#3086)
  HDDS-6186. Selective checks: skip integration check for unit test changes (apache#3061)
  HDDS-6310. Update json-smart to 2.4.7. (apache#3080)
  HDDS-6190. Cleanup unnecessary datanode id path checks. (apache#2993)
  HDDS-6304. Add enforcer to make sure ozone.version equals project.version (apache#3075)
  ...
@errose28
Copy link
Contributor Author

errose28 commented Feb 17, 2022

Thanks for reviews @nandakumar131 @hanishakoneru @guihecheng. After some offline discussion we decided to remove the metadata scan on import/export and the unhealthy mark if there's an issue with the container on export. I filed HDDS-6344 for gentler handling of incorrect but recoverable containers, which could be used to flag container content errors on import/export.

@errose28
Copy link
Contributor Author

After this patch, missing chunks directories will be recreated in KeyValueContainerData#parseKVContainerData. This means the test in the recently merged metadata inspector patch to fix missing chunks directories fails and is not needed. I removed the metadata inspector test, and replaced it with a new unit test that makes sure missing chunks directories are created when KeyValueContainerData#parseKVContainerData is called.

@hanishakoneru
Copy link
Contributor

LGTM. +1 pending CI.
Unit test failure is unrelated. Retriggering CI.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @errose28 for the patch, LGTM. I only have two minor suggestions.

Also, would you mind adding a unit test in TestTarContainerPacker?

@errose28
Copy link
Contributor Author

Thanks for the review @adoroszlai. I was initially confused by your test comments, but then realized the merge had deleted my unit test, so that was a good catch. I restored the test so hopefully that part makes more sense now.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @errose28 for updating the patch.

@symious
Copy link
Contributor

symious commented Feb 24, 2022

@errose28 Thanks for the patch, our cluster does have the same issue while doing replication containers.

One thing I noticed while running the unit test is even without the changes in TarContainerPacker, the test also passed. Does that mean the key point to this issue is the creation of the CHUNK_PATH in KeyValueContainerUtil?

archiveOutput.putArchiveEntry(entry);

// Add files in the directory.
try (Stream<Path> dirEntries = Files.list(dir)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the error is thrown here, so it still needs the CHUNK file to exist?

Reproduced by adding new File(data.getChunksPath()).delete(); after checking the state of original container.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@symious the error is thrown because the chunk directory does not exist. Not having chunk files in the directory should not be a problem here.

@hanishakoneru
Copy link
Contributor

One thing I noticed while running the unit test is even without the changes in TarContainerPacker, the test also passed. Does that mean the key point to this issue is the creation of the CHUNK_PATH in KeyValueContainerUtil?

KeyValueContainerUtil change fixes existing containers with missing chunks directory. This could have occurred when a container with empty chunks directory was replicated before. The change in TarContainerPacker is needed so that when containers with empty chunks directories are replicated, the replica container get the chunks directory even though it does not have any contents.

@hanishakoneru
Copy link
Contributor

Thanks @errose28 for fixing this and @nandakumar131, @adoroszlai , @avijayanhwx and @symious for reviews.
Merging the patch with 4 +1s.

@hanishakoneru hanishakoneru merged commit a53a3a3 into apache:master Feb 24, 2022
@errose28
Copy link
Contributor Author

Hi @symious following up on your questions:

Does that mean the key point to this issue is the creation of the CHUNK_PATH in KeyValueContainerUtil?

There's two parts to this PR: preventing the issue and fixing existing containers. I think this comment might answer the question.

One thing I noticed while running the unit test is even without the changes in TarContainerPacker, the test also passed

I cannot repro this. Running master from 68c5ac5 (before this patch was merged), I copy/pasted in the changes to TestKeyValueContainer and both new tests failed. Can you please check again?

@symious
Copy link
Contributor

symious commented Feb 24, 2022

@hanishakoneru Thanks for the explaination.

@errose28 I created a branch with the changes in TarContainerPacker removed here https://github.com/symious/ozone/tree/HDDS-6235, the unit test passed, could you help to check?

@symious
Copy link
Contributor

symious commented Feb 24, 2022

I think the process sequence are as follows:

  1. container exported to FileOutputStream
  2. peer use importContainerData to import the contaienr.
  3. TarContainerPacker unpack the container from the FileInputStream. After the unpack, the directory of "chunks" shouldn't be there.
  4. But in KeyValueContainerUtil.parseKVContainerData(containerData, config); since the new logic was added that if the chunk directory doesn't exists, we create it.

So even without the changes in TarContainerPacker, the unit test still passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants