HDDS-7083. Spread container-copy directories #3648

symious · 2022-08-03T09:40:06Z

What changes were proposed in this pull request?

The default container-copy directory is "/tmp", when several replication jobs are running, the performance of the disk holding "/tmp" is quite bad, if we also have ratis directory set on the same disk, the ratis work will be affected.

This ticket is to add a configuration to spread the "container-copy" work on different volumes to mitigate this issue.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7083

How was this patch tested?

unit test.

errose28 · 2022-08-03T17:52:04Z

Thanks for the proposal Symious. I agree we should not use /tmp for intermediate datanode operations like container import. I have a few questions.

Why is this be behind a config key? I don't know of a situation where users would want to turn this feature off.
It looks like the code is randomly choosing volumes. Wouldn't it be better to choose the the volume the container is being imported to?
If a container is imported and there is some kind of error causing the import to be aborted, what happens to the remaining artifacts? Does this prevent the import from being retried?

FYI HDDS-6449 is under active development by @neils-dev, which will add a tmp directory to each volume to be used for intermediate container operations. Containers can be moved out of the directory to their final destination with an atomic rename since they reside on the same file system, and artifacts can be periodically cleaned out of the temp directory on restart or other failures. I think we can implement this after HDDS-6449 to use the same directory.

symious · 2022-08-04T07:21:07Z

@errose28 Thanks for the review.

Why is this be behind a config key? I don't know of a situation where users would want to turn this feature off.

It's because of the config of "hdds.datanode.replication.work.dir", if we enable the spread config by default, users might feel confused since the value of "hdds.datanode.replication.work.dir" is not used. If the user actively enables the spread config, we assume the user already know "hdds.datanode.replication.work.dir" won't be used.

It looks like the code is randomly choosing volumes. Wouldn't it be better to choose the the volume the container is being imported to?

It's because of the steps of the import process:

download container from other datanodes.
Initial an InputStream based on the downloaded file, and KeyValueContainerHandler will use this InputStream to import the container.

And the destination volume is chosen in step 2, so in step 1, we don't know in which volume the container will be stored. We can also send the download path to the KeyValueContainerHandler, (which requires interface change), but I tested the copy speed of inter-volume and intra-volume, the speed is kind of no difference, so I left the original Handler import process unchanged.

If a container is imported and there is some kind of error causing the import to be aborted, what happens to the remaining artifacts? Does this prevent the import from being retried?

If exceptions happen during the import, the downloaded files are deleted with the handler's implementation.

ChenSammi · 2022-08-09T02:44:43Z

Thanks @symious to improve the replication feature. Agree with @errose28 suggestions. We can create a proper sub-directory under each "hdds.datanode.dir" for downloaded containers, instead of introduce a new "hdds.datanode.replication.work.dir" configuration, and the downloaded container will be auto renamed to move to the data directory in the same volume to save the container copy between volume cost.

Consider that HDDS-6449 has quiet a few tasks and the time line is uncertain, maybe we can proceed with this JIRA first and let HDDS-6449 leverage the sub-directory if possible(HDDS-6449 introduces a background deleting service to periodically delete content under the directory, so it might be not a good idea to share the same directory). What do you think @errose28 ?

symious · 2022-08-09T10:19:40Z

@ChenSammi Thank you for the review.
Updated the PR to enable the feature by default.

errose28 · 2022-08-09T17:14:48Z

Thanks for the input @ChenSammi. The subtasks in HDDS-6449 were actually created before we had design discussions. There won't need to be a background service which greatly simplifies the task and it could probably be done in one Jira. But I agree that the progress on that task is uncertain and we have a proposal to fix the import issue now, so let's go ahead and introduce the new directory here.

Some thoughts on introducing the new directory:

I think we should have one tmp working directory per volume, with different subdirectories for different tasks as you mentioned. Container delete, import, and create would be examples, although we only need to handle import in this jira.
Containers should be moved from the tmp directory to their intended location using a directory rename atomic at the FS level.
IMO hdds.datanode.replication.work.dir should be deprecated and ignored. Datanodes require a clean disk state in their container directories on restart (see HDDS-6441 and HDDS-6449), so any non-atomic FS operation that could compromise that like RocksDB creation (schema < V3) copy from import directory, or delete must be done from an atomic directory rename on the same filesystem.
The tmp directory needs to be chosen with care.
- Currently all volume state is contained in the hdds subdirectory of each volume directory/disk mount. I think this feature should maintain that practice.
- We cannot add a new subdirectory immediately under hdds as this will break downgrade without an upgrade layout feature. Ozone 1.1.0 has a check that hdds directory only has one subdirectory.
- IMO hdds/<clusterID>/tmp would be the best place to put it, but we need to make sure this will not affect datanode startup on downgrade. If it does and there is no better directory, we may need to add a simple HDDS layout feature for this change.

Let me know your thoughts on choosing a location for the tmp directory.

ChenSammi · 2022-08-10T04:50:20Z

* IMO `hdds.datanode.replication.work.dir` should be deprecated and ignored. Datanodes require a clean disk state in their container directories on restart (see [HDDS-6441](https://issues.apache.org/jira/browse/HDDS-6441) and [HDDS-6449](https://issues.apache.org/jira/browse/HDDS-6449)), so any non-atomic FS operation that could compromise that like RocksDB creation (schema < V3) copy from import directory, or delete must be done from an atomic directory rename on the same filesystem.

We can hide this property from ozone-default.xml and comment it as deprecated. Then new Ozone user will not be aware of this property. For existing ozone user, we should consider still honor this property to keep backward compatibility if the implementation will not be too complex.

Let me know your thoughts on choosing a location for the tmp directory.

@errose28 agree, hdds//tmp is a good choice. BTW, currently how do we verify that datanode downgrade will be impacted or not?

errose28 · 2022-08-10T17:47:05Z

how do we verify that datanode downgrade will be impacted or not?

The docker based upgrade/downgrade tests run a downgrade and upgrade back to the previous release as part of each CI run. The full matrix to cover older version downgrades like 1.1.0 needs to be run manually, either by the release manager or in case a change is suspected of having downgrade implications like this one. See docs and test runner. The tests are already slow and complex, so if using an arm machine like M1 mac, modifying the github actions and test.sh to run the tests on your Ozone fork might be better since we only publish x86 release images right now.

For existing ozone user, we should consider still honor this property to keep backward compatibility if the implementation will not be too complex.

Ignoring the config, possibly with an associated log message, will not be backwards incompatible, as a datanode will not try to resume its in progress container imports/creates/deletes after a restart. However, using the old config exposes the cluster to bugs like we saw in this HDDS-6441 comment:

Container delete is in progress.
- The same would be true of container create or copying an import from the working dir to its destination.
I/O exception or datanode restart occurs.
On startup datanode finds incomplete container pieces in its volume working directories and logs and error.

Normally I would be in favor of respecting old configs, but I think this case is different as the config causes a bug.

ChenSammi · 2022-08-16T02:37:06Z

@errose28 thanks for the info about how the downgrade should be verified. Right, it makes sense to deprecate hdds.datanode.replication.work.dir considering all its known issues.

Hey @symious, could you update the patch accordingly?

symious · 2022-08-16T03:05:56Z

@ChenSammi @errose28 Updated the patch, please help to review.

ChenSammi · 2022-08-16T11:56:04Z

hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConfigKeys.java

  public static final String OZONE_HTTP_FILTER_INITIALIZERS_KEY =
      "ozone.http.filter.initializers";

-  public static final String OZONE_CONTAINER_COPY_WORKDIR =


Please add this key to OzoneConfiguration#addDeprecatedKeys.

Updated, please have a look.

ChenSammi · 2022-08-16T12:09:35Z

...e/src/main/java/org/apache/hadoop/ozone/container/replication/SimpleContainerDownloader.java

+    this.conf = conf;
    securityConfig = new SecurityConfig(conf);
    this.certClient = certClient;
+    this.volumeSet = volumeSet;


@symious , the new proposed flow should be,

choose a volume and download the container tar into the temp directory.

untar the container into temp directory.

move the container directory to destination directory and finish the container import so that if import failed, there will be no container residual in the data volume.

So not only SimpleContainerDownloader, but also the container import flow should be updated.

I see, should we apply the new proposal in a new ticket or the current one?

We can implement the proposal in more than one tickets. If so, please change the JIRA to a feature JIRA and create sub JIRAs for different tasks.

Noted with thanks.

ChenSammi · 2022-08-18T12:26:41Z

...e/src/main/java/org/apache/hadoop/ozone/container/replication/SimpleContainerDownloader.java

+
+  public Path getWorkingDirectory() {
+    Path defaultWorkingDirectory =
+        Paths.get(System.getProperty("java.io.tmpdir"));


Since data volume must be configured, then we don't fallback to this "java.io.tmpdir".

move this volume choose logic to outside DownloadAndImportReplicator, so we can use this choose volume later the container import process.

define a constant field for "tmp" directory.

ChenSammi · 2022-08-18T12:34:43Z

hadoop-hdds/common/src/main/resources/ozone-default.xml

-    <description>Temporary which is used during the container replication
-      betweeen datanodes. Should have enough space to store multiple container
-      (in compressed format), but doesn't require fast io access such as SSD.
+    <description>This configuration is deprecated. Default directory which is used


This configuration is deprecated. Temporary sub directory under each hdds.datanode.dir will be used during the container replication between datanodes to save the downloaded container(in compressed format).

symious · 2022-08-25T15:34:05Z

@ChenSammi Had a draft of the new proposal, please have a look.
I'll enrich the PR later if the implementation is not too far from the proposal.

symious · 2022-09-09T07:11:02Z

@ChenSammi Please help to review.

ChenSammi · 2022-12-12T15:21:39Z

...src/main/java/org/apache/hadoop/ozone/container/replication/DownloadAndImportReplicator.java

+
+    this.containerReaderMap = new HashMap<>();
+    for (HddsVolume hddsVolume: getHddsVolumesList()) {
+      containerReaderMap.put(hddsVolume,


This containerReaderMap is not used after initialized.

ChenSammi · 2022-12-12T15:24:04Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/TarContainerPacker.java

+    if (!Files.exists(destContainerDir)) {
+      Files.createDirectories(destContainerDir);
    }
+    Files.move(containerUntarDir, destContainerDir,


We should not use REPLACE_EXISTING here. If destContainerDir already exists, should throw out exception, just like the current behavior.

ChenSammi · 2022-12-12T15:29:40Z

...ainer-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/ContainerReader.java

   */
  public void verifyAndFixupContainerData(ContainerData containerData)
      throws IOException {
+    verifyAndFixupContainerData(containerData, false);


Are these ContainerReader and KeyValueContainerUtil change related with this spread directory goal? If not, could you put it into a new patch.

symious · 2022-12-14T03:27:47Z

@ChenSammi Thank you for the detailed review.

This ticket has been there for a long time, and it used to be a big one and not easy for review.

After these long time, I find it not easy to remember the earlier ideas of this topic. I think I'd better create some small tickets instead of this one for easier review.

symious · 2022-12-20T03:05:21Z

@ChenSammi Updated the patch, please have a look.

ChenSammi · 2023-01-09T02:54:50Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/HddsVolumeUtil.java


+  private static final Logger LOG =
+      LoggerFactory.getLogger(HddsVolumeUtil.class);
+


This LOG is not used in anywhere. We can revert this change in this file.

This comment is not addressed yet.

ChenSammi · 2023-01-09T03:07:12Z

.../src/main/java/org/apache/hadoop/ozone/container/keyvalue/helpers/KeyValueContainerUtil.java


    long containerID = kvContainerData.getContainerID();

-    // Verify Checksum


Pleas revert this change.

ChenSammi · 2023-01-09T03:11:42Z

...e/src/main/java/org/apache/hadoop/ozone/container/replication/SimpleContainerDownloader.java

+
+    if (downloadDir == null) {
+      downloadDir = Paths.get(System.getProperty("java.io.tmpdir"))
+              .resolve("container-copy");


We can use CONTAINER_COPY_DIR here to replace the string "container-copy".

ChenSammi · 2023-01-09T03:13:35Z

...e/src/main/java/org/apache/hadoop/ozone/container/replication/SimpleContainerDownloader.java

      LoggerFactory.getLogger(SimpleContainerDownloader.class);

-  private final Path workingDirectory;
+  private ConfigurationSource conf;


Looks like this conf is not used anymore after initialized.

ChenSammi · 2023-01-09T03:26:55Z

...src/main/java/org/apache/hadoop/ozone/container/replication/DownloadAndImportReplicator.java

+  private HddsVolume chooseNextVolume() throws IOException {
+    return volumeChoosingPolicy.chooseVolume(
+        StorageVolumeUtil.getHddsVolumesList(volumeSet.getVolumesList()),
+        containerSize * 2);


Can you add an inline comment to explain why container * 2 is used here?

ChenSammi · 2023-01-09T04:39:51Z

...src/main/java/org/apache/hadoop/ozone/container/replication/DownloadAndImportReplicator.java

+  public static final String CONTAINER_COPY_DIR = "container-copy";
+  public static final String CONTAINER_COPY_TMP_DIR = "tmp";

+  private final ConfigurationSource conf;


This conf is not used in anywhere after initialized too.

ChenSammi · 2023-01-09T04:53:43Z

...tainer-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java

-      HddsVolume containerVolume = volumeChoosingPolicy.chooseVolume(
-          StorageVolumeUtil.getHddsVolumesList(volumeSet.getVolumesList()),
-          container.getContainerData().getMaxSize());
+      if (hddsVolume == null) {


hddsVolume should not be null here. We need to make sure that in container.importContainerData, tmpDir and DownloadAndImportReplicator.getUntarDirectory(hddsVolume) are on the same volume so that atomic Files.move will not fail.

ChenSammi · 2023-01-09T05:24:56Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/TarContainerPacker.java

+    } catch (CompressorException e) {
+      throw new IOException("Can't uncompress to dbRoot: " + dbRoot +
+              ", chunksRoot: " + chunksRoot, e);
+    }


Please delete the input file and output directory in case of any Exception thrown.

The delete of dest directory is updated in KeyValueContainer#importContainerData(InputStream, ContainerPacker).
The delete of tar file is under DownloadAndImportReplicator#importContainer(long, Path, HddsVolume)

There are three file/directories are involved.
1). downloaded tar file, which is deleted in DownloadAndImportReplicator#importContainer
2) source untared container directory under "tmp", which should be deleted in case that it cannot renamed to the target untared container directory in case target untared container directory already exists. This one is missing.
3) target untared container directory, which will be deleted in KeyValueContainer#importContainerData if container already exists.

Got it, have updated the PR, please have a look.

ChenSammi · 2023-01-09T05:30:34Z

@symious , thanks for continuously working on this. The overall patch looks good. I have left some inline comments.

symious · 2023-01-10T02:34:58Z

@ChenSammi Thank you for the review. Updated the PR, please have a look.

ChenSammi · 2023-01-10T07:34:39Z

TestKeyValueContainer.testContainerImportExport failure could be related. @symious could you check it?

ChenSammi · 2023-01-11T04:55:29Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/TarContainerPacker.java

-      throw new IOException("Unpack destination directory " + destContainerDir
-              + " is not empty.");
+      throw new StorageContainerException("Container unpack failed because " +
+          "ContainerFile already exists", CONTAINER_ALREADY_EXISTS);


Please add container ID and destContainerDir in the error message for a easier diagnose.

ChenSammi · 2023-01-11T04:59:14Z

@symious , except some comments missed and need to be addressed, the patch is close to ready now.

ChenSammi · 2023-01-12T05:08:22Z

Thanks @symious , the last patch LGTM, +1.

symious added 2 commits August 3, 2022 17:39

HDDS-7083. Spread container-copy directories

db0cf77

HDDS-7083. Add config in ozone-default.xml

ea07251

HDDS-7083. Enable spread volume feature by default

e5c0a34

HDDS-7083. Deprecate configuration and use tmp/container-copy directory

75e9f8d

HDDS-7083. Fix config test

1cc345d

ChenSammi reviewed Aug 16, 2022

View reviewed changes

HDDS-7083. Add deprecated key

14d3e00

ChenSammi reviewed Aug 16, 2022

View reviewed changes

ChenSammi reviewed Aug 18, 2022

View reviewed changes

HDDS-7083. Download, untar, move, then load container

615070d

symious changed the title ~~HDDS-7083. Spread container-copy directories~~ [WIP] HDDS-7083. Spread container-copy directories Aug 25, 2022

symious added 3 commits August 26, 2022 21:24

HDDS-7083. Unit test to be fixed

417fd4b

HDDS-7083. parameter not to create dir when initializing

1ef25a4

HDDS-7083. Fix unit test

4bfa4df

symious changed the title ~~[WIP] HDDS-7083. Spread container-copy directories~~ HDDS-7083. Spread container-copy directories Sep 2, 2022

symious added 3 commits September 2, 2022 21:06

trigger new CI check

f590482

trigger new CI check

8cb4030

trigger new CI check

dfd161d

ChenSammi reviewed Dec 12, 2022

View reviewed changes

Merge branch 'master' into HDDS-7083

fc6bee4

symious and others added 2 commits December 18, 2022 21:20

Merge branch 'master' into HDDS-7083

cc5dcdf

HDDS-7083. Fix according to comments

4ecff14

symious and others added 2 commits January 8, 2023 23:57

Merge branch 'master' into HDDS-7083

2207a50

HDDS-7083. Remove duplicated import

a5c34fe

ChenSammi reviewed Jan 9, 2023

View reviewed changes

HDDS-7083. Delete tmp direcotry

299d0f6

ChenSammi reviewed Jan 9, 2023

View reviewed changes

HDDS-7083. Delete temporary files

c62d37f

HDDS-7083. Throw proper exception

e8f036e

ChenSammi reviewed Jan 11, 2023

View reviewed changes

symious added 2 commits January 11, 2023 17:11

HDDS-7083. Clean tmp data

5806f6c

HDDS-7083. Remove unused import

0f1780d

ChenSammi merged commit 9d5cfd6 into apache:master Jan 12, 2023


		private static final Logger LOG =
		LoggerFactory.getLogger(HddsVolumeUtil.class);


		long containerID = kvContainerData.getContainerID();

		// Verify Checksum

HDDS-7083. Spread container-copy directories #3648

HDDS-7083. Spread container-copy directories #3648

Uh oh!

Conversation

symious commented Aug 3, 2022

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

errose28 commented Aug 3, 2022

Uh oh!

symious commented Aug 4, 2022

Uh oh!

ChenSammi commented Aug 9, 2022

Uh oh!

symious commented Aug 9, 2022

Uh oh!

errose28 commented Aug 9, 2022

Uh oh!

ChenSammi commented Aug 10, 2022

Uh oh!

errose28 commented Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChenSammi commented Aug 16, 2022

Uh oh!

symious commented Aug 16, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi Aug 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi Aug 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

symious commented Aug 25, 2022

Uh oh!

symious commented Sep 9, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

symious commented Dec 14, 2022

Uh oh!

symious commented Dec 20, 2022

Uh oh!

ChenSammi Jan 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi Jan 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi Jan 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

errose28 commented Aug 10, 2022 •

edited

Loading

ChenSammi Aug 16, 2022 •

edited

Loading

ChenSammi Aug 18, 2022 •

edited

Loading

ChenSammi Aug 18, 2022 •

edited

Loading

ChenSammi Jan 9, 2023 •

edited

Loading

ChenSammi Jan 9, 2023 •

edited

Loading

ChenSammi Jan 9, 2023 •

edited

Loading

ChenSammi Jan 9, 2023 •

edited

Loading

ChenSammi Jan 11, 2023 •

edited

Loading