Skip to content

Conversation

@symious
Copy link
Contributor

@symious symious commented Aug 3, 2022

What changes were proposed in this pull request?

The default container-copy directory is "/tmp", when several replication jobs are running, the performance of the disk holding "/tmp" is quite bad, if we also have ratis directory set on the same disk, the ratis work will be affected.

This ticket is to add a configuration to spread the "container-copy" work on different volumes to mitigate this issue.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7083

How was this patch tested?

unit test.

@errose28
Copy link
Contributor

errose28 commented Aug 3, 2022

Thanks for the proposal Symious. I agree we should not use /tmp for intermediate datanode operations like container import. I have a few questions.

  1. Why is this be behind a config key? I don't know of a situation where users would want to turn this feature off.
  2. It looks like the code is randomly choosing volumes. Wouldn't it be better to choose the the volume the container is being imported to?
  3. If a container is imported and there is some kind of error causing the import to be aborted, what happens to the remaining artifacts? Does this prevent the import from being retried?

FYI HDDS-6449 is under active development by @neils-dev, which will add a tmp directory to each volume to be used for intermediate container operations. Containers can be moved out of the directory to their final destination with an atomic rename since they reside on the same file system, and artifacts can be periodically cleaned out of the temp directory on restart or other failures. I think we can implement this after HDDS-6449 to use the same directory.

@symious
Copy link
Contributor Author

symious commented Aug 4, 2022

@errose28 Thanks for the review.

  1. Why is this be behind a config key? I don't know of a situation where users would want to turn this feature off.

It's because of the config of "hdds.datanode.replication.work.dir", if we enable the spread config by default, users might feel confused since the value of "hdds.datanode.replication.work.dir" is not used. If the user actively enables the spread config, we assume the user already know "hdds.datanode.replication.work.dir" won't be used.

  1. It looks like the code is randomly choosing volumes. Wouldn't it be better to choose the the volume the container is being imported to?

It's because of the steps of the import process:

  1. download container from other datanodes.
  2. Initial an InputStream based on the downloaded file, and KeyValueContainerHandler will use this InputStream to import the container.

And the destination volume is chosen in step 2, so in step 1, we don't know in which volume the container will be stored. We can also send the download path to the KeyValueContainerHandler, (which requires interface change), but I tested the copy speed of inter-volume and intra-volume, the speed is kind of no difference, so I left the original Handler import process unchanged.

  1. If a container is imported and there is some kind of error causing the import to be aborted, what happens to the remaining artifacts? Does this prevent the import from being retried?

If exceptions happen during the import, the downloaded files are deleted with the handler's implementation.

@ChenSammi
Copy link
Contributor

Thanks @symious to improve the replication feature. Agree with @errose28 suggestions. We can create a proper sub-directory under each "hdds.datanode.dir" for downloaded containers, instead of introduce a new "hdds.datanode.replication.work.dir" configuration, and the downloaded container will be auto renamed to move to the data directory in the same volume to save the container copy between volume cost.

Consider that HDDS-6449 has quiet a few tasks and the time line is uncertain, maybe we can proceed with this JIRA first and let HDDS-6449 leverage the sub-directory if possible(HDDS-6449 introduces a background deleting service to periodically delete content under the directory, so it might be not a good idea to share the same directory). What do you think @errose28 ?

@symious
Copy link
Contributor Author

symious commented Aug 9, 2022

@ChenSammi Thank you for the review.
Updated the PR to enable the feature by default.

@errose28
Copy link
Contributor

errose28 commented Aug 9, 2022

Thanks for the input @ChenSammi. The subtasks in HDDS-6449 were actually created before we had design discussions. There won't need to be a background service which greatly simplifies the task and it could probably be done in one Jira. But I agree that the progress on that task is uncertain and we have a proposal to fix the import issue now, so let's go ahead and introduce the new directory here.

Some thoughts on introducing the new directory:

  • I think we should have one tmp working directory per volume, with different subdirectories for different tasks as you mentioned. Container delete, import, and create would be examples, although we only need to handle import in this jira.
  • Containers should be moved from the tmp directory to their intended location using a directory rename atomic at the FS level.
  • IMO hdds.datanode.replication.work.dir should be deprecated and ignored. Datanodes require a clean disk state in their container directories on restart (see HDDS-6441 and HDDS-6449), so any non-atomic FS operation that could compromise that like RocksDB creation (schema < V3) copy from import directory, or delete must be done from an atomic directory rename on the same filesystem.
  • The tmp directory needs to be chosen with care.
    • Currently all volume state is contained in the hdds subdirectory of each volume directory/disk mount. I think this feature should maintain that practice.
    • We cannot add a new subdirectory immediately under hdds as this will break downgrade without an upgrade layout feature. Ozone 1.1.0 has a check that hdds directory only has one subdirectory.
    • IMO hdds/<clusterID>/tmp would be the best place to put it, but we need to make sure this will not affect datanode startup on downgrade. If it does and there is no better directory, we may need to add a simple HDDS layout feature for this change.

Let me know your thoughts on choosing a location for the tmp directory.

@ChenSammi
Copy link
Contributor

* IMO `hdds.datanode.replication.work.dir` should be deprecated and ignored. Datanodes require a clean disk state in their container directories on restart (see [HDDS-6441](https://issues.apache.org/jira/browse/HDDS-6441) and [HDDS-6449](https://issues.apache.org/jira/browse/HDDS-6449)), so any non-atomic FS operation that could compromise that like RocksDB creation (schema < V3) copy from import directory, or delete must be done from an atomic directory rename on the same filesystem.

We can hide this property from ozone-default.xml and comment it as deprecated. Then new Ozone user will not be aware of this property. For existing ozone user, we should consider still honor this property to keep backward compatibility if the implementation will not be too complex.

Let me know your thoughts on choosing a location for the tmp directory.

@errose28 agree, hdds//tmp is a good choice. BTW, currently how do we verify that datanode downgrade will be impacted or not?

@errose28
Copy link
Contributor

errose28 commented Aug 10, 2022

how do we verify that datanode downgrade will be impacted or not?

The docker based upgrade/downgrade tests run a downgrade and upgrade back to the previous release as part of each CI run. The full matrix to cover older version downgrades like 1.1.0 needs to be run manually, either by the release manager or in case a change is suspected of having downgrade implications like this one. See docs and test runner. The tests are already slow and complex, so if using an arm machine like M1 mac, modifying the github actions and test.sh to run the tests on your Ozone fork might be better since we only publish x86 release images right now.

For existing ozone user, we should consider still honor this property to keep backward compatibility if the implementation will not be too complex.

Ignoring the config, possibly with an associated log message, will not be backwards incompatible, as a datanode will not try to resume its in progress container imports/creates/deletes after a restart. However, using the old config exposes the cluster to bugs like we saw in this HDDS-6441 comment:

  1. Container delete is in progress.
    • The same would be true of container create or copying an import from the working dir to its destination.
  2. I/O exception or datanode restart occurs.
  3. On startup datanode finds incomplete container pieces in its volume working directories and logs and error.

Normally I would be in favor of respecting old configs, but I think this case is different as the config causes a bug.

@ChenSammi
Copy link
Contributor

@errose28 thanks for the info about how the downgrade should be verified. Right, it makes sense to deprecate hdds.datanode.replication.work.dir considering all its known issues.

Hey @symious, could you update the patch accordingly?

@symious
Copy link
Contributor Author

symious commented Aug 16, 2022

@ChenSammi @errose28 Updated the patch, please help to review.

public static final String OZONE_HTTP_FILTER_INITIALIZERS_KEY =
"ozone.http.filter.initializers";

public static final String OZONE_CONTAINER_COPY_WORKDIR =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this key to OzoneConfiguration#addDeprecatedKeys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, please have a look.

this.conf = conf;
securityConfig = new SecurityConfig(conf);
this.certClient = certClient;
this.volumeSet = volumeSet;
Copy link
Contributor

@ChenSammi ChenSammi Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@symious , the new proposed flow should be,

  1. choose a volume and download the container tar into the temp directory.
  2. untar the container into temp directory.
  3. move the container directory to destination directory and finish the container import so that if import failed, there will be no container residual in the data volume.

So not only SimpleContainerDownloader, but also the container import flow should be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, should we apply the new proposal in a new ticket or the current one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can implement the proposal in more than one tickets. If so, please change the JIRA to a feature JIRA and create sub JIRAs for different tasks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted with thanks.


public Path getWorkingDirectory() {
Path defaultWorkingDirectory =
Paths.get(System.getProperty("java.io.tmpdir"));
Copy link
Contributor

@ChenSammi ChenSammi Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Since data volume must be configured, then we don't fallback to this "java.io.tmpdir".
  2. move this volume choose logic to outside DownloadAndImportReplicator, so we can use this choose volume later the container import process.
  3. define a constant field for "tmp" directory.

<description>Temporary which is used during the container replication
betweeen datanodes. Should have enough space to store multiple container
(in compressed format), but doesn't require fast io access such as SSD.
<description>This configuration is deprecated. Default directory which is used
Copy link
Contributor

@ChenSammi ChenSammi Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This configuration is deprecated. Temporary sub directory under each hdds.datanode.dir will be used during the container replication between datanodes to save the downloaded container(in compressed format).

@symious symious changed the title HDDS-7083. Spread container-copy directories [WIP] HDDS-7083. Spread container-copy directories Aug 25, 2022
@symious
Copy link
Contributor Author

symious commented Aug 25, 2022

@ChenSammi Had a draft of the new proposal, please have a look.
I'll enrich the PR later if the implementation is not too far from the proposal.

@symious symious changed the title [WIP] HDDS-7083. Spread container-copy directories HDDS-7083. Spread container-copy directories Sep 2, 2022
@symious
Copy link
Contributor Author

symious commented Sep 9, 2022

@ChenSammi Please help to review.


this.containerReaderMap = new HashMap<>();
for (HddsVolume hddsVolume: getHddsVolumesList()) {
containerReaderMap.put(hddsVolume,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This containerReaderMap is not used after initialized.

if (!Files.exists(destContainerDir)) {
Files.createDirectories(destContainerDir);
}
Files.move(containerUntarDir, destContainerDir,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use REPLACE_EXISTING here. If destContainerDir already exists, should throw out exception, just like the current behavior.

*/
public void verifyAndFixupContainerData(ContainerData containerData)
throws IOException {
verifyAndFixupContainerData(containerData, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these ContainerReader and KeyValueContainerUtil change related with this spread directory goal? If not, could you put it into a new patch.

@symious
Copy link
Contributor Author

symious commented Dec 14, 2022

@ChenSammi Thank you for the detailed review.

This ticket has been there for a long time, and it used to be a big one and not easy for review.

After these long time, I find it not easy to remember the earlier ideas of this topic. I think I'd better create some small tickets instead of this one for easier review.

@symious
Copy link
Contributor Author

symious commented Dec 20, 2022

@ChenSammi Updated the patch, please have a look.


private static final Logger LOG =
LoggerFactory.getLogger(HddsVolumeUtil.class);

Copy link
Contributor

@ChenSammi ChenSammi Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LOG is not used in anywhere. We can revert this change in this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not addressed yet.


long containerID = kvContainerData.getContainerID();

// Verify Checksum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pleas revert this change.


if (downloadDir == null) {
downloadDir = Paths.get(System.getProperty("java.io.tmpdir"))
.resolve("container-copy");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use CONTAINER_COPY_DIR here to replace the string "container-copy".

LoggerFactory.getLogger(SimpleContainerDownloader.class);

private final Path workingDirectory;
private ConfigurationSource conf;
Copy link
Contributor

@ChenSammi ChenSammi Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this conf is not used anymore after initialized.

private HddsVolume chooseNextVolume() throws IOException {
return volumeChoosingPolicy.chooseVolume(
StorageVolumeUtil.getHddsVolumesList(volumeSet.getVolumesList()),
containerSize * 2);
Copy link
Contributor

@ChenSammi ChenSammi Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an inline comment to explain why container * 2 is used here?

public static final String CONTAINER_COPY_DIR = "container-copy";
public static final String CONTAINER_COPY_TMP_DIR = "tmp";

private final ConfigurationSource conf;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conf is not used in anywhere after initialized too.

HddsVolume containerVolume = volumeChoosingPolicy.chooseVolume(
StorageVolumeUtil.getHddsVolumesList(volumeSet.getVolumesList()),
container.getContainerData().getMaxSize());
if (hddsVolume == null) {
Copy link
Contributor

@ChenSammi ChenSammi Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hddsVolume should not be null here. We need to make sure that in container.importContainerData, tmpDir and DownloadAndImportReplicator.getUntarDirectory(hddsVolume) are on the same volume so that atomic Files.move will not fail.

} catch (CompressorException e) {
throw new IOException("Can't uncompress to dbRoot: " + dbRoot +
", chunksRoot: " + chunksRoot, e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete the input file and output directory in case of any Exception thrown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The delete of dest directory is updated in KeyValueContainer#importContainerData(InputStream, ContainerPacker).
The delete of tar file is under DownloadAndImportReplicator#importContainer(long, Path, HddsVolume)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three file/directories are involved.
1). downloaded tar file, which is deleted in DownloadAndImportReplicator#importContainer
2) source untared container directory under "tmp", which should be deleted in case that it cannot renamed to the target untared container directory in case target untared container directory already exists. This one is missing.
3) target untared container directory, which will be deleted in KeyValueContainer#importContainerData if container already exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, have updated the PR, please have a look.

@ChenSammi
Copy link
Contributor

@symious , thanks for continuously working on this. The overall patch looks good. I have left some inline comments.

@symious
Copy link
Contributor Author

symious commented Jan 10, 2023

@ChenSammi Thank you for the review. Updated the PR, please have a look.

@ChenSammi
Copy link
Contributor

TestKeyValueContainer.testContainerImportExport failure could be related. @symious could you check it?

throw new IOException("Unpack destination directory " + destContainerDir
+ " is not empty.");
throw new StorageContainerException("Container unpack failed because " +
"ContainerFile already exists", CONTAINER_ALREADY_EXISTS);
Copy link
Contributor

@ChenSammi ChenSammi Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add container ID and destContainerDir in the error message for a easier diagnose.

@ChenSammi
Copy link
Contributor

@symious , except some comments missed and need to be addressed, the patch is close to ready now.

@ChenSammi
Copy link
Contributor

Thanks @symious , the last patch LGTM, +1.

@ChenSammi ChenSammi merged commit 9d5cfd6 into apache:master Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants