-
Notifications
You must be signed in to change notification settings - Fork 593
HDDS-6541. [Merge rocksdb in datanode] Per-disk DB location management. #3292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@ChenSammi @nandakumar131 PTAL~ |
| <value/> | ||
| <tag>OZONE, CONTAINER, STORAGE, MANAGEMENT</tag> | ||
| <description>Determines where the per-disk rocksdb instances will be | ||
| stored. Defaults to empty if not specified, then rocksdb instances |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defaults to empty if not specified -> This setting is optional. If unspecified
| context, VolumeType.META_VOLUME, volumeChecker); | ||
| if (VersionedDatanodeFeatures.SchemaV3.chooseSchemaVersion() | ||
| .equals(OzoneConsts.SCHEMA_V3)) { | ||
| dbVolumeSet = HddsServerUtil.getDatanodeDbDirs(conf).isEmpty() ? null : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Shutdown this dbVolumeSet in #stop function.
- #getNodeReport should handle this new dbVolumeSet.
- need a new scanner for dbVolumeSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Oh, yes, we should shutdown there
- Sure.
- That's a good idea, we could have a new scanner to check the db instances in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created HDDS-6616 to track this new db scanner requirement.
| if (storageDirs == null) { | ||
| LOG.error("IO error for the db volume {}, skipped loading", | ||
| getStorageDir()); | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please throw out Exception in all false cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll follow the handling of those in HddsVolume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll follow the handling of those in HddsVolume.
|
|
||
| if (!getStorageDir().exists()) { | ||
| // Not formatted yet | ||
| return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please do the format action here. Don't delay it to the later #checkVolume function.
- check the storageDir is a directory or not. Refer to HddsVolume #analyzeVolumeState function.
- need a VERSION file under the storageDir. Refer to HddsVolume#createVersionFile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as I understand, generally, here we should have similar volume state lifecycle as HddsVolume does.
I think I'll try to extract the VolumeState and VERSION file stuff into a common place such as StorageVolume,
then both HddsVolume and DbVolume could have the same management.
| volume.getStorageDir().getAbsolutePath(), | ||
| clusterIdDir.getAbsolutePath()); | ||
| } | ||
| continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please throw exception instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here in this function, we are iterating all of the db instances for all HddsVolumes, so we don't just throw and bailout.
We should go on trying to load as many db instances as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For storage, data integrity and security is very important. So we better be conservative and fail the code fast if something happens which requires admin's notice and interference. So please do throw out the exception here and other places.
Besides, default value of failedDbVolumeTolerated is -1, which is too optimistic for a storage system. We should change the default value to 0 for all failedVolumeTolerated. But we can do it in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a simple question here: say we have 100 disks for a DN, on extra SSDs, so db instances are on the same disk as data. If a disk failure causes the db init fails, do we continue for other disks or just throw and out?
I don't think we will stop DN startup on a disk failure for now with per-container db instances since disk failure is common, it does not really hurt data integrity or security for a distributed system.
The failedDbVolumeTolerated defaults to -1 means we'll have at least one dbVolume if configured any, please check hasEnoughVolumes#hasEnoughVolumes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on how many disk failure is tolerated. For example, in HDFS, default disk failure tolerated is 0, which means as long as one disk fails, DN cannot start up. Admin may change the configuration based on his/her estimation of disk failure rate. That's acceptable and the risk is on Admin. So "at least one disk volume" is not acceptable, the risk is too high.
Think about this case, all HDDS volumes function well while one of the two configured dbVolumes is down at DN startup. How will the impacted HDDS volume behave? Create a new RocksDB instance on the remaining dbVolume and then go on to provide service?
I think we should persist this HDDS volume to RocksDB instance relation in HDDS version file. And we also need a replicate container meta data command to recovery the rocksdb content for a dedicated container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And for the 2 DbVolumes with 1 failed case, first I should make it clear that the case you raised is already prevented with current code(plus the stricter check that I proposed), because those HddsVolumes with failed db instances on the failed dbVolume still have their clusterIDDir created, so no new db instance will be able to create since the HddsVolume is already checkVolumed and never again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At last, I think I should really add dedicated tests for the cases you raised seriously, thanks for checking it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another case I just thought about is upgrading. If we enable this feature on an existing cluster, HddsFiles.length will be 2 in such case, so which piece of code will help us to create this rocksDB per disk instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll have a dedicated patch for non-rolling upgrade at last and we could define non-rolling upgrade hooks to help create the db instances upon FinalizeUpgrade.
We could implement those hooks by referring to the hooks defined for SCM HA.
(BTW, in the internal version we disabled the non-rolling upgrade feature, so the db instance logic is a bit different from the code present here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sounds good.
| * @param clusterID | ||
| * @param conf | ||
| */ | ||
| public static void formatDbStoreForHddsVolume(HddsVolume hddsVolume, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatDbStoreForHddsVolume -> createDbStoreForHddsVolume
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
| } | ||
|
|
||
| if (VersionedDatanodeFeatures.SchemaV3.chooseSchemaVersion() | ||
| .equals(OzoneConsts.SCHEMA_V3)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a check that hddsVolume.setDbParentDir is set to avoid duplicate rockdb initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we don't have to, because here we have the check above that hddsFiles.length == 1, so we are sure that this HddsVolume is fresh.
| } | ||
| continue; | ||
| } | ||
| volume.setDbParentDir(storageIdDir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this statement to the end of this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, makes sense.
| } catch (IOException e) { | ||
| if (logger != null) { | ||
| logger.error("Can't load db instance under path {} for volume {}", | ||
| containerDBPath, volume.getStorageDir().getAbsolutePath()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throw exception here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if you don't really like the way that we log an error and go on for other instances, I could split this big functions into smaller ones and do throw inside and (catch & go on) outside.
But generally, we are going to make best effort, or we'll miss some good db instances.
| volume.getStorageDir().getAbsolutePath(), | ||
| storageIdDir.getAbsolutePath()); | ||
| } | ||
| continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throw exception here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
|
|
||
| dbVolumeList.parallelStream().forEach(dbVolume -> { | ||
| String id = VersionedDatanodeFeatures.ScmHA | ||
| .chooseContainerPathID(configuration, scmId, clusterId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this statement out of the stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, makes sense.
|
@guihecheng, I just left some comments. We can have a call if some comments confuse you. |
| // This should be on a test path, normally we should get | ||
| // the clusterID from SCM, and it should not be available | ||
| // while restarting. | ||
| if (clusterID != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clusterID comes from the SCM, but in all Unit tests, we don't really have a SCM, so we always have a global clusterID set and passed to MutableVolumeSet, so all volumes in the set will get initialized.
But as discussed above, if we are going to do format early rather than in checkVolume, then we may not need this.
|
@ChenSammi Thanks for your comments, I'll update soon, let's keep in touch. |
| return false; | ||
| } | ||
|
|
||
| if (VersionedDatanodeFeatures.SchemaV3.chooseSchemaVersion() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a helper function to detect whether SchemaV3 is enabled. And user this helper function in all places where need a detection.
I discussed the feature with Arpit today. His suggestion is in the first release version, we disable this feature by default. User can turn on it manually. So other than this VersionedDataFeatures check, we may need a property, which disable, or enable this feature, just like we did internally.
When this feature become mature enough, we deprecate this property and make the feature always enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, of course, I can't agree more with Arpit.
It would be easy for me to add this config back to the codes.
Having this config item also benefits the compatibility test as I tested this feature internally.
|
Hi @ChenSammi , I've made a new push just now.
Apart from the comments above, I've make some refactors to keep related things together. |
| /** | ||
| * Records all HddsVolumes that put its db instance under this DbVolume. | ||
| */ | ||
| private final Set<String> hddsVolumeIDs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this Set to a Map, hold both the storageID and db path, so that we can use the db path directly in #closeAllDbStore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea, then we can prevent create some temp file objects.
|
|
||
| private void scanForHddsVolumeIDs() throws IOException { | ||
| // Not formatted yet | ||
| if (getClusterID() == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be better if we detect if state == NORMAL , then continue the process, otherwise return?
Checking state is more straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, by reading the routine initialize again, I'm sure that check state == NORMAL is correct, because initialize will be called in format for a second time, then we will have our clusterID.
But is seems not so straightforward, but a bit complex to understand ?
Could I keep it to check the clusterID == null here?
| try { | ||
| db.getStore().stop(); | ||
| } catch (Exception e) { | ||
| LOG.warn("Stop DatanodeStore: {} failed", containerDBPath, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use error level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine.
| public static boolean checkVolume(StorageVolume volume, String scmId, | ||
| String clusterId, ConfigurationSource conf, Logger logger, | ||
| MutableVolumeSet dbVolumeSet) { | ||
| File hddsRoot = volume.getStorageDir(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use a more generic name other than "hdds" in this function now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, than I shall rename it to volumeRoot.
| try { | ||
| volume.loadDbStore(); | ||
| } catch (IOException e) { | ||
| onFailure(volume); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we just fail the volume, like "hddsVolumeSet.failVolume(volume.getStorageDir().getPath());"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, onFailure and failVolume are different.
Here the db load failed, it may be caused by a failed disk or other reasons like: not enough memory, rocksdb bugs, etc.
So onFailure checks whether this volume is bad or not asynchronously. Only with the onFailure call we could trigger a potential "max tolerated volumes reached" event, failVolume will not do.
And we can't just failVolume here because there may be containers of old schemas(v1, v2) on this volume, we should keep them readable.
|
The last patch LGTM, +1. Thanks @guihecheng for the contribution. |
What changes were proposed in this pull request?
Per-disk DB location management.
More descriptions about the db location could be found in the JIRA below.
Here are some descriptions of the 3 separated commits:
StorageVolumetypeDbVolumefor optional dedicated SSDs for db instances to speed up meta operations.We have one subdirectory for each data volume under a db volume, with the StorageID(UUID) of the data volume as the name of the subdirectory.
When extra SSDs are used
When no SSD, use the same disk as data by default
mv /data1/hdds/CID-<clusterID>/DS-<StorageID> /ssd1/db/CID-<clusterID>/What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6541
How was this patch tested?
New UTs.