Skip to content

Conversation

@guihecheng
Copy link
Contributor

@guihecheng guihecheng commented Apr 11, 2022

What changes were proposed in this pull request?

Per-disk DB location management.
More descriptions about the db location could be found in the JIRA below.
Here are some descriptions of the 3 separated commits:

  • Add some LayoutFeature definitions but not the whole non-rolling upgrade related stuff, so we could check if we have to init the per-disk db instances.
  • A new StorageVolume type DbVolume for optional dedicated SSDs for db instances to speed up meta operations.
  • Format db instances on DN first register and load db instances on DN startup.

We have one subdirectory for each data volume under a db volume, with the StorageID(UUID) of the data volume as the name of the subdirectory.

When extra SSDs are used

  • Should create DbVolume to manage it for bad disk checking
  • Should create special directory structure including clusterID to be isolated with other stuff on the disk, e.g. /ssd1/db/CID-
  • Should have a configuration item: “hdds.datanode.container.db.dir”
  • HddsVolume should be mapped to a dedicated subdir named after the StorageID of it, e.g. DS-b559933f-9de3-4da4-a634-07d3a94f7438/
  • Container metafile (e.g. 1.container under metadata) does not have to record the dbPath since now it is bind to the HddsVolume and we already know which HddsVolume the container resides.

Screen Shot 2022-04-14 at 6 55 42 PM

When no SSD, use the same disk as data by default

  • Create container with the same disk where the block files resides
  • No DbVolume created, e.g. hddsVolume.dbVolume = null
  • Configuration item not specified
  • Could be easily migrated to a newly added SSD, e.g.

             mv /data1/hdds/CID-<clusterID>/DS-<StorageID>  /ssd1/db/CID-<clusterID>/

Screen Shot 2022-04-14 at 6 55 53 PM

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6541

How was this patch tested?

New UTs.

@guihecheng
Copy link
Contributor Author

@ChenSammi @nandakumar131 PTAL~

<value/>
<tag>OZONE, CONTAINER, STORAGE, MANAGEMENT</tag>
<description>Determines where the per-disk rocksdb instances will be
stored. Defaults to empty if not specified, then rocksdb instances
Copy link
Contributor

@ChenSammi ChenSammi Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults to empty if not specified -> This setting is optional. If unspecified

context, VolumeType.META_VOLUME, volumeChecker);
if (VersionedDatanodeFeatures.SchemaV3.chooseSchemaVersion()
.equals(OzoneConsts.SCHEMA_V3)) {
dbVolumeSet = HddsServerUtil.getDatanodeDbDirs(conf).isEmpty() ? null :
Copy link
Contributor

@ChenSammi ChenSammi Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Shutdown this dbVolumeSet in #stop function.
  2. #getNodeReport should handle this new dbVolumeSet.
  3. need a new scanner for dbVolumeSet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Oh, yes, we should shutdown there
  2. Sure.
  3. That's a good idea, we could have a new scanner to check the db instances in another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created HDDS-6616 to track this new db scanner requirement.

if (storageDirs == null) {
LOG.error("IO error for the db volume {}, skipped loading",
getStorageDir());
return false;
Copy link
Contributor

@ChenSammi ChenSammi Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please throw out Exception in all false cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll follow the handling of those in HddsVolume.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll follow the handling of those in HddsVolume.


if (!getStorageDir().exists()) {
// Not formatted yet
return true;
Copy link
Contributor

@ChenSammi ChenSammi Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please do the format action here. Don't delay it to the later #checkVolume function.
  2. check the storageDir is a directory or not. Refer to HddsVolume #analyzeVolumeState function.
  3. need a VERSION file under the storageDir. Refer to HddsVolume#createVersionFile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as I understand, generally, here we should have similar volume state lifecycle as HddsVolume does.
I think I'll try to extract the VolumeState and VERSION file stuff into a common place such as StorageVolume,
then both HddsVolume and DbVolume could have the same management.

volume.getStorageDir().getAbsolutePath(),
clusterIdDir.getAbsolutePath());
}
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please throw exception instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here in this function, we are iterating all of the db instances for all HddsVolumes, so we don't just throw and bailout.
We should go on trying to load as many db instances as possible.

Copy link
Contributor

@ChenSammi ChenSammi Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For storage, data integrity and security is very important. So we better be conservative and fail the code fast if something happens which requires admin's notice and interference. So please do throw out the exception here and other places.

Besides, default value of failedDbVolumeTolerated is -1, which is too optimistic for a storage system. We should change the default value to 0 for all failedVolumeTolerated. But we can do it in a follow up PR.

Copy link
Contributor Author

@guihecheng guihecheng Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a simple question here: say we have 100 disks for a DN, on extra SSDs, so db instances are on the same disk as data. If a disk failure causes the db init fails, do we continue for other disks or just throw and out?
I don't think we will stop DN startup on a disk failure for now with per-container db instances since disk failure is common, it does not really hurt data integrity or security for a distributed system.

The failedDbVolumeTolerated defaults to -1 means we'll have at least one dbVolume if configured any, please check hasEnoughVolumes#hasEnoughVolumes.

Copy link
Contributor

@ChenSammi ChenSammi Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on how many disk failure is tolerated. For example, in HDFS, default disk failure tolerated is 0, which means as long as one disk fails, DN cannot start up. Admin may change the configuration based on his/her estimation of disk failure rate. That's acceptable and the risk is on Admin. So "at least one disk volume" is not acceptable, the risk is too high.

Think about this case, all HDDS volumes function well while one of the two configured dbVolumes is down at DN startup. How will the impacted HDDS volume behave? Create a new RocksDB instance on the remaining dbVolume and then go on to provide service?
I think we should persist this HDDS volume to RocksDB instance relation in HDDS version file. And we also need a replicate container meta data command to recovery the rocksdb content for a dedicated container.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for the 2 DbVolumes with 1 failed case, first I should make it clear that the case you raised is already prevented with current code(plus the stricter check that I proposed), because those HddsVolumes with failed db instances on the failed dbVolume still have their clusterIDDir created, so no new db instance will be able to create since the HddsVolume is already checkVolumed and never again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At last, I think I should really add dedicated tests for the cases you raised seriously, thanks for checking it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case I just thought about is upgrading. If we enable this feature on an existing cluster, HddsFiles.length will be 2 in such case, so which piece of code will help us to create this rocksDB per disk instance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have a dedicated patch for non-rolling upgrade at last and we could define non-rolling upgrade hooks to help create the db instances upon FinalizeUpgrade.
We could implement those hooks by referring to the hooks defined for SCM HA.
(BTW, in the internal version we disabled the non-rolling upgrade feature, so the db instance logic is a bit different from the code present here.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good.

* @param clusterID
* @param conf
*/
public static void formatDbStoreForHddsVolume(HddsVolume hddsVolume,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatDbStoreForHddsVolume -> createDbStoreForHddsVolume

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

}

if (VersionedDatanodeFeatures.SchemaV3.chooseSchemaVersion()
.equals(OzoneConsts.SCHEMA_V3)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a check that hddsVolume.setDbParentDir is set to avoid duplicate rockdb initialization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we don't have to, because here we have the check above that hddsFiles.length == 1, so we are sure that this HddsVolume is fresh.

}
continue;
}
volume.setDbParentDir(storageIdDir);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this statement to the end of this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, makes sense.

} catch (IOException e) {
if (logger != null) {
logger.error("Can't load db instance under path {} for volume {}",
containerDBPath, volume.getStorageDir().getAbsolutePath());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw exception here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if you don't really like the way that we log an error and go on for other instances, I could split this big functions into smaller ones and do throw inside and (catch & go on) outside.
But generally, we are going to make best effort, or we'll miss some good db instances.

volume.getStorageDir().getAbsolutePath(),
storageIdDir.getAbsolutePath());
}
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw exception here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


dbVolumeList.parallelStream().forEach(dbVolume -> {
String id = VersionedDatanodeFeatures.ScmHA
.chooseContainerPathID(configuration, scmId, clusterId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this statement out of the stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, makes sense.

@ChenSammi
Copy link
Contributor

@guihecheng, I just left some comments. We can have a call if some comments confuse you.

// This should be on a test path, normally we should get
// the clusterID from SCM, and it should not be available
// while restarting.
if (clusterID != null) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clusterID comes from the SCM, but in all Unit tests, we don't really have a SCM, so we always have a global clusterID set and passed to MutableVolumeSet, so all volumes in the set will get initialized.
But as discussed above, if we are going to do format early rather than in checkVolume, then we may not need this.

@guihecheng
Copy link
Contributor Author

guihecheng commented Apr 12, 2022

@ChenSammi Thanks for your comments, I'll update soon, let's keep in touch.

return false;
}

if (VersionedDatanodeFeatures.SchemaV3.chooseSchemaVersion()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a helper function to detect whether SchemaV3 is enabled. And user this helper function in all places where need a detection.

I discussed the feature with Arpit today. His suggestion is in the first release version, we disable this feature by default. User can turn on it manually. So other than this VersionedDataFeatures check, we may need a property, which disable, or enable this feature, just like we did internally.
When this feature become mature enough, we deprecate this property and make the feature always enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course, I can't agree more with Arpit.
It would be easy for me to add this config back to the codes.
Having this config item also benefits the compatibility test as I tested this feature internally.

@guihecheng
Copy link
Contributor Author

guihecheng commented Apr 19, 2022

Hi @ChenSammi , I've made a new push just now.
Comments address check list:

  • ozone-default description
  • shutdown dbVolumeSet
  • nodeReport for dbVolumeSet
  • VERSION file logic extracted into a common place StorageVolume, so DbVolume now has VERSION file
  • dbParentDir is only set at the end of the function
  • a config switch for schemaV3 with a helper function to check it
  • tests check that no duplicate db instance will be created on dbVolume failures.
  • Some renames like formatXXX -> createXXX

Apart from the comments above, I've make some refactors to keep related things together.

/**
* Records all HddsVolumes that put its db instance under this DbVolume.
*/
private final Set<String> hddsVolumeIDs;
Copy link
Contributor

@ChenSammi ChenSammi Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this Set to a Map, hold both the storageID and db path, so that we can use the db path directly in #closeAllDbStore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, then we can prevent create some temp file objects.


private void scanForHddsVolumeIDs() throws IOException {
// Not formatted yet
if (getClusterID() == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be better if we detect if state == NORMAL , then continue the process, otherwise return?

Checking state is more straightforward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, by reading the routine initialize again, I'm sure that check state == NORMAL is correct, because initialize will be called in format for a second time, then we will have our clusterID.
But is seems not so straightforward, but a bit complex to understand ?
Could I keep it to check the clusterID == null here?

try {
db.getStore().stop();
} catch (Exception e) {
LOG.warn("Stop DatanodeStore: {} failed", containerDBPath, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use error level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine.

public static boolean checkVolume(StorageVolume volume, String scmId,
String clusterId, ConfigurationSource conf, Logger logger,
MutableVolumeSet dbVolumeSet) {
File hddsRoot = volume.getStorageDir();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use a more generic name other than "hdds" in this function now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, than I shall rename it to volumeRoot.

try {
volume.loadDbStore();
} catch (IOException e) {
onFailure(volume);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we just fail the volume, like "hddsVolumeSet.failVolume(volume.getStorageDir().getPath());"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, onFailure and failVolume are different.
Here the db load failed, it may be caused by a failed disk or other reasons like: not enough memory, rocksdb bugs, etc.
So onFailure checks whether this volume is bad or not asynchronously. Only with the onFailure call we could trigger a potential "max tolerated volumes reached" event, failVolume will not do.
And we can't just failVolume here because there may be containers of old schemas(v1, v2) on this volume, we should keep them readable.

@ChenSammi
Copy link
Contributor

The last patch LGTM, +1.

Thanks @guihecheng for the contribution.

@ChenSammi ChenSammi merged this pull request into apache:HDDS-3630 Apr 22, 2022
guihecheng pushed a commit to guihecheng/ozone that referenced this pull request Apr 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants