-
Notifications
You must be signed in to change notification settings - Fork 589
HDDS-7083. Spread container-copy directories #3648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for the proposal Symious. I agree we should not use /tmp for intermediate datanode operations like container import. I have a few questions.
FYI HDDS-6449 is under active development by @neils-dev, which will add a tmp directory to each volume to be used for intermediate container operations. Containers can be moved out of the directory to their final destination with an atomic rename since they reside on the same file system, and artifacts can be periodically cleaned out of the temp directory on restart or other failures. I think we can implement this after HDDS-6449 to use the same directory. |
|
@errose28 Thanks for the review.
It's because of the config of "hdds.datanode.replication.work.dir", if we enable the spread config by default, users might feel confused since the value of "hdds.datanode.replication.work.dir" is not used. If the user actively enables the spread config, we assume the user already know "hdds.datanode.replication.work.dir" won't be used.
It's because of the steps of the import process:
And the destination volume is chosen in step 2, so in step 1, we don't know in which volume the container will be stored. We can also send the download path to the KeyValueContainerHandler, (which requires interface change), but I tested the copy speed of inter-volume and intra-volume, the speed is kind of no difference, so I left the original Handler import process unchanged.
If exceptions happen during the import, the downloaded files are deleted with the handler's implementation. |
|
Thanks @symious to improve the replication feature. Agree with @errose28 suggestions. We can create a proper sub-directory under each "hdds.datanode.dir" for downloaded containers, instead of introduce a new "hdds.datanode.replication.work.dir" configuration, and the downloaded container will be auto renamed to move to the data directory in the same volume to save the container copy between volume cost. Consider that HDDS-6449 has quiet a few tasks and the time line is uncertain, maybe we can proceed with this JIRA first and let HDDS-6449 leverage the sub-directory if possible(HDDS-6449 introduces a background deleting service to periodically delete content under the directory, so it might be not a good idea to share the same directory). What do you think @errose28 ? |
|
@ChenSammi Thank you for the review. |
|
Thanks for the input @ChenSammi. The subtasks in HDDS-6449 were actually created before we had design discussions. There won't need to be a background service which greatly simplifies the task and it could probably be done in one Jira. But I agree that the progress on that task is uncertain and we have a proposal to fix the import issue now, so let's go ahead and introduce the new directory here. Some thoughts on introducing the new directory:
Let me know your thoughts on choosing a location for the tmp directory. |
We can hide this property from ozone-default.xml and comment it as deprecated. Then new Ozone user will not be aware of this property. For existing ozone user, we should consider still honor this property to keep backward compatibility if the implementation will not be too complex.
@errose28 agree, hdds//tmp is a good choice. BTW, currently how do we verify that datanode downgrade will be impacted or not? |
The docker based upgrade/downgrade tests run a downgrade and upgrade back to the previous release as part of each CI run. The full matrix to cover older version downgrades like 1.1.0 needs to be run manually, either by the release manager or in case a change is suspected of having downgrade implications like this one. See docs and test runner. The tests are already slow and complex, so if using an arm machine like M1 mac, modifying the github actions and test.sh to run the tests on your Ozone fork might be better since we only publish x86 release images right now.
Ignoring the config, possibly with an associated log message, will not be backwards incompatible, as a datanode will not try to resume its in progress container imports/creates/deletes after a restart. However, using the old config exposes the cluster to bugs like we saw in this HDDS-6441 comment:
Normally I would be in favor of respecting old configs, but I think this case is different as the config causes a bug. |
|
@ChenSammi @errose28 Updated the patch, please help to review. |
| public static final String OZONE_HTTP_FILTER_INITIALIZERS_KEY = | ||
| "ozone.http.filter.initializers"; | ||
|
|
||
| public static final String OZONE_CONTAINER_COPY_WORKDIR = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this key to OzoneConfiguration#addDeprecatedKeys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, please have a look.
| this.conf = conf; | ||
| securityConfig = new SecurityConfig(conf); | ||
| this.certClient = certClient; | ||
| this.volumeSet = volumeSet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@symious , the new proposed flow should be,
- choose a volume and download the container tar into the temp directory.
- untar the container into temp directory.
- move the container directory to destination directory and finish the container import so that if import failed, there will be no container residual in the data volume.
So not only SimpleContainerDownloader, but also the container import flow should be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, should we apply the new proposal in a new ticket or the current one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can implement the proposal in more than one tickets. If so, please change the JIRA to a feature JIRA and create sub JIRAs for different tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted with thanks.
|
|
||
| public Path getWorkingDirectory() { | ||
| Path defaultWorkingDirectory = | ||
| Paths.get(System.getProperty("java.io.tmpdir")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Since data volume must be configured, then we don't fallback to this "java.io.tmpdir".
- move this volume choose logic to outside DownloadAndImportReplicator, so we can use this choose volume later the container import process.
- define a constant field for "tmp" directory.
| <description>Temporary which is used during the container replication | ||
| betweeen datanodes. Should have enough space to store multiple container | ||
| (in compressed format), but doesn't require fast io access such as SSD. | ||
| <description>This configuration is deprecated. Default directory which is used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This configuration is deprecated. Temporary sub directory under each hdds.datanode.dir will be used during the container replication between datanodes to save the downloaded container(in compressed format).
|
@ChenSammi Had a draft of the new proposal, please have a look. |
|
@ChenSammi Please help to review. |
|
|
||
| this.containerReaderMap = new HashMap<>(); | ||
| for (HddsVolume hddsVolume: getHddsVolumesList()) { | ||
| containerReaderMap.put(hddsVolume, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This containerReaderMap is not used after initialized.
| if (!Files.exists(destContainerDir)) { | ||
| Files.createDirectories(destContainerDir); | ||
| } | ||
| Files.move(containerUntarDir, destContainerDir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not use REPLACE_EXISTING here. If destContainerDir already exists, should throw out exception, just like the current behavior.
| */ | ||
| public void verifyAndFixupContainerData(ContainerData containerData) | ||
| throws IOException { | ||
| verifyAndFixupContainerData(containerData, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these ContainerReader and KeyValueContainerUtil change related with this spread directory goal? If not, could you put it into a new patch.
|
@ChenSammi Thank you for the detailed review. This ticket has been there for a long time, and it used to be a big one and not easy for review. After these long time, I find it not easy to remember the earlier ideas of this topic. I think I'd better create some small tickets instead of this one for easier review. |
|
@ChenSammi Updated the patch, please have a look. |
|
|
||
| private static final Logger LOG = | ||
| LoggerFactory.getLogger(HddsVolumeUtil.class); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LOG is not used in anywhere. We can revert this change in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not addressed yet.
|
|
||
| long containerID = kvContainerData.getContainerID(); | ||
|
|
||
| // Verify Checksum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pleas revert this change.
|
|
||
| if (downloadDir == null) { | ||
| downloadDir = Paths.get(System.getProperty("java.io.tmpdir")) | ||
| .resolve("container-copy"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use CONTAINER_COPY_DIR here to replace the string "container-copy".
| LoggerFactory.getLogger(SimpleContainerDownloader.class); | ||
|
|
||
| private final Path workingDirectory; | ||
| private ConfigurationSource conf; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this conf is not used anymore after initialized.
| private HddsVolume chooseNextVolume() throws IOException { | ||
| return volumeChoosingPolicy.chooseVolume( | ||
| StorageVolumeUtil.getHddsVolumesList(volumeSet.getVolumesList()), | ||
| containerSize * 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an inline comment to explain why container * 2 is used here?
| public static final String CONTAINER_COPY_DIR = "container-copy"; | ||
| public static final String CONTAINER_COPY_TMP_DIR = "tmp"; | ||
|
|
||
| private final ConfigurationSource conf; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conf is not used in anywhere after initialized too.
| HddsVolume containerVolume = volumeChoosingPolicy.chooseVolume( | ||
| StorageVolumeUtil.getHddsVolumesList(volumeSet.getVolumesList()), | ||
| container.getContainerData().getMaxSize()); | ||
| if (hddsVolume == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hddsVolume should not be null here. We need to make sure that in container.importContainerData, tmpDir and DownloadAndImportReplicator.getUntarDirectory(hddsVolume) are on the same volume so that atomic Files.move will not fail.
| } catch (CompressorException e) { | ||
| throw new IOException("Can't uncompress to dbRoot: " + dbRoot + | ||
| ", chunksRoot: " + chunksRoot, e); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please delete the input file and output directory in case of any Exception thrown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The delete of dest directory is updated in KeyValueContainer#importContainerData(InputStream, ContainerPacker).
The delete of tar file is under DownloadAndImportReplicator#importContainer(long, Path, HddsVolume)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three file/directories are involved.
1). downloaded tar file, which is deleted in DownloadAndImportReplicator#importContainer
2) source untared container directory under "tmp", which should be deleted in case that it cannot renamed to the target untared container directory in case target untared container directory already exists. This one is missing.
3) target untared container directory, which will be deleted in KeyValueContainer#importContainerData if container already exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, have updated the PR, please have a look.
|
@symious , thanks for continuously working on this. The overall patch looks good. I have left some inline comments. |
|
@ChenSammi Thank you for the review. Updated the PR, please have a look. |
|
TestKeyValueContainer.testContainerImportExport failure could be related. @symious could you check it? |
| throw new IOException("Unpack destination directory " + destContainerDir | ||
| + " is not empty."); | ||
| throw new StorageContainerException("Container unpack failed because " + | ||
| "ContainerFile already exists", CONTAINER_ALREADY_EXISTS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add container ID and destContainerDir in the error message for a easier diagnose.
|
@symious , except some comments missed and need to be addressed, the patch is close to ready now. |
|
Thanks @symious , the last patch LGTM, +1. |
What changes were proposed in this pull request?
The default container-copy directory is "/tmp", when several replication jobs are running, the performance of the disk holding "/tmp" is quite bad, if we also have ratis directory set on the same disk, the ratis work will be affected.
This ticket is to add a configuration to spread the "container-copy" work on different volumes to mitigate this issue.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7083
How was this patch tested?
unit test.