HDDS-6541. [Merge rocksdb in datanode] Per-disk DB location management. #3292

guihecheng · 2022-04-11T09:31:25Z

What changes were proposed in this pull request?

Per-disk DB location management.
More descriptions about the db location could be found in the JIRA below.
Here are some descriptions of the 3 separated commits:

Add some LayoutFeature definitions but not the whole non-rolling upgrade related stuff, so we could check if we have to init the per-disk db instances.
A new StorageVolume type DbVolume for optional dedicated SSDs for db instances to speed up meta operations.
Format db instances on DN first register and load db instances on DN startup.

We have one subdirectory for each data volume under a db volume, with the StorageID(UUID) of the data volume as the name of the subdirectory.

When extra SSDs are used

Should create DbVolume to manage it for bad disk checking
Should create special directory structure including clusterID to be isolated with other stuff on the disk, e.g. /ssd1/db/CID-
Should have a configuration item: “hdds.datanode.container.db.dir”
HddsVolume should be mapped to a dedicated subdir named after the StorageID of it, e.g. DS-b559933f-9de3-4da4-a634-07d3a94f7438/
Container metafile (e.g. 1.container under metadata) does not have to record the dbPath since now it is bind to the HddsVolume and we already know which HddsVolume the container resides.

When no SSD, use the same disk as data by default

Create container with the same disk where the block files resides
No DbVolume created, e.g. hddsVolume.dbVolume = null
Configuration item not specified
Could be easily migrated to a newly added SSD, e.g.

mv /data1/hdds/CID-<clusterID>/DS-<StorageID> /ssd1/db/CID-<clusterID>/

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6541

How was this patch tested?

New UTs.

… conditionally.

guihecheng · 2022-04-11T09:32:17Z

@ChenSammi @nandakumar131 PTAL~

ChenSammi · 2022-04-12T05:40:05Z

hadoop-hdds/common/src/main/resources/ozone-default.xml

+    <value/>
+    <tag>OZONE, CONTAINER, STORAGE, MANAGEMENT</tag>
+    <description>Determines where the per-disk rocksdb instances will be
+      stored. Defaults to empty if not specified, then rocksdb instances


Defaults to empty if not specified -> This setting is optional. If unspecified

ChenSammi · 2022-04-12T07:16:03Z

...tainer-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java

        context, VolumeType.META_VOLUME, volumeChecker);
+    if (VersionedDatanodeFeatures.SchemaV3.chooseSchemaVersion()
+        .equals(OzoneConsts.SCHEMA_V3)) {
+      dbVolumeSet = HddsServerUtil.getDatanodeDbDirs(conf).isEmpty() ? null :


Shutdown this dbVolumeSet in #stop function.

#getNodeReport should handle this new dbVolumeSet.

need a new scanner for dbVolumeSet.

Oh, yes, we should shutdown there

Sure.

That's a good idea, we could have a new scanner to check the db instances in another PR.

I created HDDS-6616 to track this new db scanner requirement.

ChenSammi · 2022-04-12T08:26:43Z

...ontainer-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/DbVolume.java

+    if (storageDirs == null) {
+      LOG.error("IO error for the db volume {}, skipped loading",
+          getStorageDir());
+      return false;


Please throw out Exception in all false cases.

OK, I'll follow the handling of those in HddsVolume.

ChenSammi · 2022-04-12T08:28:44Z

...ontainer-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/DbVolume.java

+
+    if (!getStorageDir().exists()) {
+      // Not formatted yet
+      return true;


Please do the format action here. Don't delay it to the later #checkVolume function.

check the storageDir is a directory or not. Refer to HddsVolume #analyzeVolumeState function.

need a VERSION file under the storageDir. Refer to HddsVolume#createVersionFile.

Yes, as I understand, generally, here we should have similar volume state lifecycle as HddsVolume does.
I think I'll try to extract the VolumeState and VERSION file stuff into a common place such as StorageVolume,
then both HddsVolume and DbVolume could have the same management.

ChenSammi · 2022-04-12T09:45:30Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/HddsVolumeUtil.java

+              volume.getStorageDir().getAbsolutePath(),
+              clusterIdDir.getAbsolutePath());
+        }
+        continue;


Please throw exception instead.

Here in this function, we are iterating all of the db instances for all HddsVolumes, so we don't just throw and bailout.
We should go on trying to load as many db instances as possible.

For storage, data integrity and security is very important. So we better be conservative and fail the code fast if something happens which requires admin's notice and interference. So please do throw out the exception here and other places.

Besides, default value of failedDbVolumeTolerated is -1, which is too optimistic for a storage system. We should change the default value to 0 for all failedVolumeTolerated. But we can do it in a follow up PR.

Just a simple question here: say we have 100 disks for a DN, on extra SSDs, so db instances are on the same disk as data. If a disk failure causes the db init fails, do we continue for other disks or just throw and out?
I don't think we will stop DN startup on a disk failure for now with per-container db instances since disk failure is common, it does not really hurt data integrity or security for a distributed system.

The failedDbVolumeTolerated defaults to -1 means we'll have at least one dbVolume if configured any, please check hasEnoughVolumes#hasEnoughVolumes.

It depends on how many disk failure is tolerated. For example, in HDFS, default disk failure tolerated is 0, which means as long as one disk fails, DN cannot start up. Admin may change the configuration based on his/her estimation of disk failure rate. That's acceptable and the risk is on Admin. So "at least one disk volume" is not acceptable, the risk is too high.

Think about this case, all HDDS volumes function well while one of the two configured dbVolumes is down at DN startup. How will the impacted HDDS volume behave? Create a new RocksDB instance on the remaining dbVolume and then go on to provide service?
I think we should persist this HDDS volume to RocksDB instance relation in HDDS version file. And we also need a replicate container meta data command to recovery the rocksdb content for a dedicated container.

And for the 2 DbVolumes with 1 failed case, first I should make it clear that the case you raised is already prevented with current code(plus the stricter check that I proposed), because those HddsVolumes with failed db instances on the failed dbVolume still have their clusterIDDir created, so no new db instance will be able to create since the HddsVolume is already checkVolumed and never again.

At last, I think I should really add dedicated tests for the cases you raised seriously, thanks for checking it.

Another case I just thought about is upgrading. If we enable this feature on an existing cluster, HddsFiles.length will be 2 in such case, so which piece of code will help us to create this rocksDB per disk instance?

We'll have a dedicated patch for non-rolling upgrade at last and we could define non-rolling upgrade hooks to help create the db instances upon FinalizeUpgrade.
We could implement those hooks by referring to the hooks defined for SCM HA.
(BTW, in the internal version we disabled the non-rolling upgrade feature, so the db instance logic is a bit different from the code present here.)

OK, sounds good.

ChenSammi · 2022-04-12T09:50:43Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/HddsVolumeUtil.java

+   * @param clusterID
+   * @param conf
+   */
+  public static void formatDbStoreForHddsVolume(HddsVolume hddsVolume,


formatDbStoreForHddsVolume -> createDbStoreForHddsVolume

ChenSammi · 2022-04-12T09:51:57Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/HddsVolumeUtil.java

      }
+
+      if (VersionedDatanodeFeatures.SchemaV3.chooseSchemaVersion()
+          .equals(OzoneConsts.SCHEMA_V3)) {


Add a check that hddsVolume.setDbParentDir is set to avoid duplicate rockdb initialization.

Actually we don't have to, because here we have the check above that hddsFiles.length == 1, so we are sure that this HddsVolume is fresh.

ChenSammi · 2022-04-12T09:53:46Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/HddsVolumeUtil.java

+        }
+        continue;
+      }
+      volume.setDbParentDir(storageIdDir);


Move this statement to the end of this function.

OK, makes sense.

ChenSammi · 2022-04-12T09:55:22Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/HddsVolumeUtil.java

+      } catch (IOException e) {
+        if (logger != null) {
+          logger.error("Can't load db instance under path {} for volume {}",
+              containerDBPath, volume.getStorageDir().getAbsolutePath());


Throw exception here.

Well, if you don't really like the way that we log an error and go on for other instances, I could split this big functions into smaller ones and do throw inside and (catch & go on) outside.
But generally, we are going to make best effort, or we'll miss some good db instances.

ChenSammi · 2022-04-12T09:57:03Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/HddsVolumeUtil.java

+              volume.getStorageDir().getAbsolutePath(),
+              storageIdDir.getAbsolutePath());
+        }
+        continue;


Throw exception here.

ChenSammi · 2022-04-12T10:04:37Z

.../main/java/org/apache/hadoop/ozone/container/common/states/endpoint/VersionEndpointTask.java

+
+        dbVolumeList.parallelStream().forEach(dbVolume -> {
+          String id = VersionedDatanodeFeatures.ScmHA
+              .chooseContainerPathID(configuration, scmId, clusterId);


Move this statement out of the stream.

OK, makes sense.

ChenSammi · 2022-04-12T10:08:15Z

@guihecheng, I just left some comments. We can have a call if some comments confuse you.

guihecheng · 2022-04-12T10:30:39Z

...ontainer-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/DbVolume.java

+    // This should be on a test path, normally we should get
+    // the clusterID from SCM, and it should not be available
+    // while restarting.
+    if (clusterID != null) {


The clusterID comes from the SCM, but in all Unit tests, we don't really have a SCM, so we always have a global clusterID set and passed to MutableVolumeSet, so all volumes in the set will get initialized.
But as discussed above, if we are going to do format early rather than in checkVolume, then we may not need this.

guihecheng · 2022-04-12T10:41:47Z

@ChenSammi Thanks for your comments, I'll update soon, let's keep in touch.

ChenSammi · 2022-04-14T12:38:43Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/HddsVolumeUtil.java

        return false;
      }
+
+      if (VersionedDatanodeFeatures.SchemaV3.chooseSchemaVersion()


Please use a helper function to detect whether SchemaV3 is enabled. And user this helper function in all places where need a detection.

I discussed the feature with Arpit today. His suggestion is in the first release version, we disable this feature by default. User can turn on it manually. So other than this VersionedDataFeatures check, we may need a property, which disable, or enable this feature, just like we did internally.
When this feature become mature enough, we deprecate this property and make the feature always enabled.

Yes, of course, I can't agree more with Arpit.
It would be easy for me to add this config back to the codes.
Having this config item also benefits the compatibility test as I tested this feature internally.

…logics.

guihecheng · 2022-04-19T07:03:43Z

Hi @ChenSammi , I've made a new push just now.
Comments address check list:

ozone-default description
shutdown dbVolumeSet
nodeReport for dbVolumeSet
VERSION file logic extracted into a common place StorageVolume, so DbVolume now has VERSION file
dbParentDir is only set at the end of the function
a config switch for schemaV3 with a helper function to check it
tests check that no duplicate db instance will be created on dbVolume failures.
Some renames like formatXXX -> createXXX

Apart from the comments above, I've make some refactors to keep related things together.

ChenSammi · 2022-04-20T09:44:02Z

...ontainer-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/DbVolume.java

+  /**
+   * Records all HddsVolumes that put its db instance under this DbVolume.
+   */
+  private final Set<String> hddsVolumeIDs;


Can we change this Set to a Map, hold both the storageID and db path, so that we can use the db path directly in #closeAllDbStore?

That's a good idea, then we can prevent create some temp file objects.

ChenSammi · 2022-04-20T09:48:12Z

...ontainer-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/DbVolume.java

+
+  private void scanForHddsVolumeIDs() throws IOException {
+    // Not formatted yet
+    if (getClusterID() == null) {


Will it be better if we detect if state == NORMAL , then continue the process, otherwise return?

Checking state is more straightforward.

Well, by reading the routine initialize again, I'm sure that check state == NORMAL is correct, because initialize will be called in format for a second time, then we will have our clusterID.
But is seems not so straightforward, but a bit complex to understand ?
Could I keep it to check the clusterID == null here?

ChenSammi · 2022-04-20T09:51:50Z

...service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DatanodeStoreCache.java

+    try {
+      db.getStore().stop();
+    } catch (Exception e) {
+      LOG.warn("Stop DatanodeStore: {} failed", containerDBPath, e);


Use error level.

ChenSammi · 2022-04-20T09:53:47Z

...-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/StorageVolumeUtil.java

+  public static boolean checkVolume(StorageVolume volume, String scmId,
+      String clusterId, ConfigurationSource conf, Logger logger,
+      MutableVolumeSet dbVolumeSet) {
+    File hddsRoot = volume.getStorageDir();


Should use a more generic name other than "hdds" in this function now.

OK, than I shall rename it to volumeRoot.

ChenSammi · 2022-04-20T10:06:20Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/HddsVolumeUtil.java

+      try {
+        volume.loadDbStore();
+      } catch (IOException e) {
+        onFailure(volume);


Shall we just fail the volume, like "hddsVolumeSet.failVolume(volume.getStorageDir().getPath());"?

Nope, onFailure and failVolume are different.
Here the db load failed, it may be caused by a failed disk or other reasons like: not enough memory, rocksdb bugs, etc.
So onFailure checks whether this volume is bad or not asynchronously. Only with the onFailure call we could trigger a potential "max tolerated volumes reached" event, failVolume will not do.
And we can't just failVolume here because there may be containers of old schemas(v1, v2) on this volume, we should keep them readable.

ChenSammi · 2022-04-22T07:34:24Z

The last patch LGTM, +1.

Thanks @guihecheng for the contribution.

…t. (apache#3292)

Add necessary LayoutFeature utils so as to init per-disk db instances…

58f8351

… conditionally.

markgui added 4 commits April 11, 2022 17:24

Define new DbVolume for optional dedicated volumes for db instances.

f0be606

DB instance format and load for HddsVolume.

3ce72b8

Fix TestDatanodeUpgradeToScmHA.

6d8273f

Fix acceptance test of non-rolling-upgrade.

9e4d34c

ChenSammi reviewed Apr 12, 2022

View reviewed changes

guihecheng commented Apr 12, 2022

View reviewed changes

ChenSammi reviewed Apr 14, 2022

View reviewed changes

markgui added 5 commits April 15, 2022 14:28

Refine the description of a new config item.

c0a20d9

Shutdown dbVolumeSet and add dbVolume info to node reports.

dabeb10

Add VERSION file for DbVolume and refactor db instance create & load …

4fb9a97

…logics.

Add a config switch for schema v3 and a helper function.

dd1c2a3

Add dedicated tests to verify that no odd db instance is created.

3e67e87

markgui added 2 commits April 19, 2022 15:10

Make chooseSchemaVersion use isFinalizedAndEnabled.

8d0ec74

Fix findbugs.

60924ee

ChenSammi reviewed Apr 20, 2022

View reviewed changes

markgui added 4 commits April 21, 2022 17:00

Some renames.

40a457b

Log level warn -> error.

57bb860

Check state NORMAL instead of clusterID.

c976aa4

DbVolume Set -> Map.

b986c6e

ChenSammi merged this pull request into apache:HDDS-3630 Apr 22, 2022

guihecheng pushed a commit to guihecheng/ozone that referenced this pull request Apr 22, 2022

HDDS-6541. [Merge rocksdb in datanode] Per-disk DB location managemen…

7c8d51d

…t. (apache#3292)

HDDS-6541. [Merge rocksdb in datanode] Per-disk DB location management. #3292

HDDS-6541. [Merge rocksdb in datanode] Per-disk DB location management. #3292

Uh oh!

Conversation

guihecheng commented Apr 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

guihecheng commented Apr 11, 2022

Uh oh!

ChenSammi Apr 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi Apr 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi Apr 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi Apr 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi Apr 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guihecheng Apr 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi Apr 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guihecheng commented Apr 11, 2022 •

edited

Loading

ChenSammi Apr 12, 2022 •

edited

Loading

ChenSammi Apr 12, 2022 •

edited

Loading

ChenSammi Apr 12, 2022 •

edited

Loading

ChenSammi Apr 12, 2022 •

edited

Loading

ChenSammi Apr 13, 2022 •

edited

Loading

guihecheng Apr 13, 2022 •

edited

Loading

ChenSammi Apr 13, 2022 •

edited

Loading

guihecheng commented Apr 12, 2022 •

edited

Loading

guihecheng commented Apr 19, 2022 •

edited

Loading

ChenSammi Apr 20, 2022 •

edited

Loading