Skip to content

Conversation

@aswinshakil
Copy link
Member

What changes were proposed in this pull request?

When we hard link files between different disk/partition we face this Invalid cross-device link error. Because we can't create hard links between different disks. In this patch, Instead of creating a hard link we move the file to the target directory.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10300

How was this patch tested?

Existing Tests

Copy link
Contributor

@Galsza Galsza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this. I've left 2 minor comments, please fix them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write a test case where we assert for the new functionality. (New file is in a different partition than the old file)

Copy link
Contributor

@hemantk-12 hemantk-12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aswinshakil for the quick fix.

Left some comments.

Copy link
Contributor

@hemantk-12 hemantk-12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Regarding @Galsza's comment to add test, it would be great if can be added as part of this change otherwise create a follow up task.

Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aswinshakil Thanks for the patch. The patch overall looks good to me, apart from a new nitpicky changes and a test case for the same function would be great.

if (isSamePartition(fullFromPath.getParent(), fullToPath.getParent())) {
Files.createLink(fullToPath, fullFromPath);
} else {
Files.copy(fullFromPath, fullToPath, StandardCopyOption.REPLACE_EXISTING);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is replace existing correct? Can this kind of a situation occur where we are re-copying stuff? We shouldn't inadvertently delete data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do some kind of a checksum verification if the fullToPath exists.

  1. copy to tmp file
  2. Verify checksum if the files exist
  3. if checksums don't match raise exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file shouldn't exist at all, I'm doing a replace just as a precaution. I'm following the normal OM bootstrap process, We don't do a checksum check for that either.

assertFalse(f1Link.exists());

OmSnapshotUtils.linkFiles(tree1, tree2);
OmSnapshotUtils.linkFilesOrCopy(tree1, tree2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add similar test case for copy case, when both files are in different partitions. You can achieve this by mocking the layer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mocking this isSamePartition function should help. From what I know you cannot mock Files interface through mockito but PowerMockito can do it.

@errose28
Copy link
Contributor

Copying seems like a bad solution since we will lose efficiency of snapshots space and diff. Additionally there is no indication to the user that this will happen if they set up their configurations in this way, and the state machines, although functionally the same, will be different across OM replicas when only one node installs a snapshot.

Why not use either of these simpler options:

  1. Rebuild hardlinks after moving the DB from the snapshot install directory to the OM DB directory.
  2. Deprecate configs for setting snapshot install location to a different directory and put it directly in the OM DB directory. I don't see an advantage to doing the initial download in a different place.

@aswinshakil
Copy link
Member Author

@errose28 Thanks for taking a look at it. For the first approach, With moving and creating a hardlink we would be doing double the work. Instead, why not move to the correct directory and avoid the hard linking part altogether?

I agree with 2nd, I'm not sure why the snapshot download directory is in a different disk. Ideally, both should be in the same disk.

@hemantk-12
Copy link
Contributor

hemantk-12 commented Feb 24, 2024

@errose28 regarding point#2

2. Deprecate configs for setting snapshot install location to a different directory and put it directly in the OM DB directory. I don't see an advantage to doing the initial download in a different place.

@aswinshakil, @swamirishi and I had a discussion about it and we don't see any problem if candidate dir just resides inside ozone.om.db.dirs.

I looked into code and config for the snapshot dir is ozone.om.ratis.snapshot.dir and if it is not set, it falls back to ozone.metadata.dirs. There is no call out or recommendation to have ozone.om.ratis.snapshot.dir or ozone.metadata.dirs in the same partition as ozone.om.db.dirs. Original PR#703 discussion.

If we can get to an agreement that ozone.om.ratis.snapshot.dir should be in the same partition as ozone.om.db.dirs. I think we should add it to release notes and make it a general recommendation. Or we can simply deprecate ozone.om.ratis.snapshot.dir config as you said by moving snapshot dir inside ozone.om.db.dirs.

@errose28
Copy link
Contributor

I was talking about this with @aswinshakil and there is another problem we should consider: the size of the Ratis snapshot might be far greater than the size of the final OM DB. This is because the Ratis snapshot has "inflated" all the hardlinks, so if there are many filesystem snapshots on dense buckets, the OM DB device needs an undetermined amount of space greater than the current DB in order to install the ratis snapshot. If this space is not available and we remove the configs, there is no good way out of the situation. We already don't have great handling to make sure there is always room to install a ratis snapshot, but the hardlink inflation makes the problem worse.

One idea to mitigate the problem:

  1. Allow ratis snapshot to be installed wherever is configured, as it is currently.
  2. Copy files one-by-one to the OM DB dir. If the file is in the hardlink list, create a link instead of a new file.
    • This keeps space required on the OM DB device to no more than double that of the current DB, regardless of the amount of snapshots and hardlinks. It still does not make sure there is always space, but it is not worse than the current system behavior.

@aswinshakil
Copy link
Member Author

Discussed with @hemantk-12 and @swamirishi offline, Right now installSnapshotFromLeader is synchronized so this is where the follower downloads the checkpoint(candidate) from the leader and moves it to the actual OM DB. There will be only one download happening at any point in time.

This can happen in the same partition because we would be doing the same thing if we downloaded it in a different partition and moved that into OM DB's partition. Instead, download the checkpoint(candidate) in the OM DB partition.

We need to do a few things, deprecate configs ozone.om.ratis.snapshot.dir which allows the user to specify different partitions and also don't fall back to ozone.metadata.dirs. Instead, use ozone.om.db.dirs to download the checkpoint and avoid this problem altogether.

@adoroszlai adoroszlai marked this pull request as draft March 3, 2024 09:06
@errose28
Copy link
Contributor

errose28 commented Mar 5, 2024

Regarding this previous comment, it turns out that the snapshot install only copies one copy of each SST file needed, and then builds the hardlinks from there. That means the Ratis snapshot will not be bigger than the OM DB and we should be able to put it right on the main DB disk without any more space concerns than what we already have.

I'm +1 for deprecating the configs.

@weimingdiit
Copy link
Contributor

@errose28 @aswinshakil Hi, the issue we encountered is similar, and we would like to know if this issue will continue to progress according to the discussion above?

@errose28
Copy link
Contributor

Hi @weimingdiit based on the discussion here I think we would like to simplify the snapshot install configuration by deprecating ozone.om.ratis.snapshot.dir and putting the snapshot directly in ozone.om.db.dir, or ozone.om.metadata.dir if the DB dir is not configured. This makes sure that hardlinks and directory renames always work. I'll close this PR since I don't think it lines up with the current proposal and I'll update the Jira. This should be a minor change if you want to take it on.

@errose28 errose28 closed this Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants