HDDS-3892. Datanode initialization is too slow when there are thousan… #1147

ChenSammi · 2020-06-29T08:30:20Z

https://issues.apache.org/jira/browse/HDDS-3892

sodonnel

This change look mostly good to me. I just have one question about possibly cleaning up the DB object if we don't need to use it, added inline.

As I understand it, before this patch, the expensive call to create the DB Entry was performed under the lock.

Now, with the patch, you check if the DB entry exists and if it does, return it. Otherwise, drop the lock, create the new DB entry outside the lock, and then reclaim the lock. The change also handles the case where another thread may create the DB entry between dropping and re-taking the lock.

sodonnel · 2020-06-29T09:45:20Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/ContainerCache.java

What happens to the ReferenceCountedDB instance created at line 140 if we return here?

Do we need to call db.cleanup() on the newly created instance, and if we don't could this potentially leak db objects? I note that the remove() call in ContainerCache calls db.cleanup when removing the entry, which makes we suspect this is important.

If we do need to call cleanup() on it, we can probably capture the return value inside the lock and then call cleanup() from outside the lock and then return?

Thanks for the review. You're right. The new DB instance should be explicitely closed in this case.

bharatviswa504 · 2020-06-30T02:21:52Z

Hi @ChenSammi
Thanks for the PR.
I too have the same comment as @sodonnel. I think we need to close the DB instance if we get an instance from the cache.

And also do you have this version on your deployed cluster
a2ab8d6

This saves in iterating Db to compute block metadata, and get all the required metadata from db.get(). We have observed this issue during a billion object test.
For older containers created before HDDS-3217, we still iterate and set block metadata. (But this fixed a bug, where we used to iterate thrice for open containers)

ChenSammi · 2020-06-30T03:56:38Z

@bharatviswa504 we have HDDS-3217 deployed.

bharatviswa504

I see one issue with this approach.
If the database is already opened, and if we try to open again we will get this error.

I think, with this change, we will throw an exception if we try to open the database again an already existing one.

java.io.IOException: Failed init RocksDB, db path : /Users/bviswanadham/workspace/hadoop-ozone/hadoop-hdds/container-service/target/test-dir/xCkBnsLVrc/cont1, exception :org.rocksdb.RocksDBException lock : /Users/bviswanadham/workspace/hadoop-ozone/hadoop-hdds/container-service/target/test-dir/xCkBnsLVrc/cont1/LOCK: No locks available

ChenSammi · 2020-07-01T04:03:22Z

I see one issue with this approach.
If the database is already opened, and if we try to open again we will get this error.

I think, with this change, we will throw an exception if we try to open the database again an already existing one.

java.io.IOException: Failed init RocksDB, db path : /Users/bviswanadham/workspace/hadoop-ozone/hadoop-hdds/container-service/target/test-dir/xCkBnsLVrc/cont1, exception :org.rocksdb.RocksDBException lock : /Users/bviswanadham/workspace/hadoop-ozone/hadoop-hdds/container-service/target/test-dir/xCkBnsLVrc/cont1/LOCK: No locks available

@bharatviswa504 , I get your point. It's an issue, but not introduced by this patch. It's a currenlty existing issue and we need to carefully think about how to fix it with a new JIRA.

ChenSammi · 2020-07-03T08:34:40Z

The failed integration test is irrevelant.

…ds of containers per volume.

adoroszlai · 2020-07-06T12:22:12Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/ContainerCache.java

I see one issue with this approach.
If the database is already opened, and if we try to open again we will get this error.
I think, with this change, we will throw an exception if we try to open the database again an already existing one.
java.io.IOException: Failed init RocksDB, db path : /Users/bviswanadham/workspace/hadoop-ozone/hadoop-hdds/container-service/target/test-dir/xCkBnsLVrc/cont1, exception :org.rocksdb.RocksDBException lock : /Users/bviswanadham/workspace/hadoop-ozone/hadoop-hdds/container-service/target/test-dir/xCkBnsLVrc/cont1/LOCK: No locks available

@bharatviswa504 , I get your point. It's an issue, but not introduced by this patch. It's a currenlty existing issue and we need to carefully think about how to fix it with a new JIRA.

@ChenSammi can you please clarify why you think it's an existing issue? It seems to me that calling build() outside of the lock introduces this problem.

If I understand correctly, the performance issue is due to using a single lock for all containers. Instead of moving the expensive part outside of the lock, how about using a separate lock per containerDBPath?

@adoroszlai, OK, I get the point now. During the datanode initialization phase, actually there is no change a DB will be opened concurrently. But since this function is also used in late phase, there is possibility to trigger the exception.

ChenSammi · 2020-07-09T11:33:14Z

Thanks @bharatviswa504 and @adoroszlai for the review. lock per containerDBPath in ContainerCache is a good idea. I tried it but cannot solve our problem given a volume has nearly 20K containers, it still take nearly an hour for the datanode to finish the initialization.

sodonnel · 2020-07-15T10:21:23Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/ContainerCache.java

+                .setDBType(containerDBType)
+                .build();
        db = new ReferenceCountedDB(metadataStore, containerDBPath);
        this.put(containerDBPath, db);


If I understand correctly, you have created a boolean to either lock or not lock in this method. In the DN initialization logic, you pass false here to avoid locking, as the same container will not be initialized by multiple threads at once.

I am still not sure this is safe, specifically the line:

this.put(containerDBPath, db);

ContainerCache is an instance of LRUCache and the JavaDoc says:

Note that LRUMap is not synchronized and is not thread-safe.
If you wish to use this map from multiple threads concurrently, you must use
appropriate synchronization.

Therefore I think you must lock around access and writes to the LRU Cache.

I also worry about removing locking completely incase someone passes "false" mistakenly to this method in the future.

@adoroszlai Earlier suggested using a lock per container path. We might be able to use a Guava Striped lock here with something like 256 buckets.

Could we do something like:

public ReferenceCountedDB getDB(long containerID, String containerDBType, String containerDBPath, ConfigurationSource conf) throws IOException { Preconditions.checkState(containerID >= 0, "Container ID cannot be negative."); ReferenceCountedDB db = null; Lock slock = stripedLock.get(containerDBPath); slock.lock(); try { lock.lock(); try { db = (ReferenceCountedDB) this.get(containerDBPath); } finally { lock.unlock(); } if (db == null) { MetadataStore metadataStore = MetadataStoreBuilder.newBuilder() .setDbFile(new File(containerDBPath)) .setCreateIfMissing(false) .setConf(conf) .setDBType(containerDBType) .build(); db = new ReferenceCountedDB(metadataStore, containerDBPath); addDB(containerDBPath, db); } db.incrementReference(); return db; } catch (Exception e) { LOG.error("Error opening DB. Container:{} ContainerPath:{}", containerID, containerDBPath, e); throw e; } finally { slock.unlock(); } }

We still have the global lock to protect the ContainerCache structure, but we now have a stripped lock to ensure that the same ContainerPath cannot be initialized at the same time.

I am not sure if we need to worry about another thread calling addDB and adding another instances of the same DB to the cache. Perhaps not a DN initialization time, but at runtime. If that is the case, we would need to add back the logic you had before, where you take the global lock again, and then test to see if an entry for the DB has been added to the cache, and if so close the new instance and return the cached one.

Thinking about this some more, it might be safest to wrap addDB and removeDB with the strippedLock too, as that way we ensure only one DB instance can be initialized and added to the map at any given time.

@adoroszlai might have some better idea than me on this too?

Thanks @sodonnel , I will take a look at the StrippedLock.

sodonnel · 2020-07-15T10:25:33Z

Thanks @bharatviswa504 and @adoroszlai for the review. lock per containerDBPath in ContainerCache is a good idea. I tried it but cannot solve our problem given a volume has nearly 20K containers, it still take nearly an hour for the datanode to finish the initialization.

With a lock per containerPath, where does the bottle neck come from? Is it around the lock, or does a single thread load all the containers for a volume, and therefore its the time taken to init all the DB instances that is the bottleneck?

ChenSammi · 2020-07-16T02:05:12Z

Thanks @bharatviswa504 and @adoroszlai for the review. lock per containerDBPath in ContainerCache is a good idea. I tried it but cannot solve our problem given a volume has nearly 20K containers, it still take nearly an hour for the datanode to finish the initialization.

With a lock per containerPath, where does the bottle neck come from? Is it around the lock, or does a single thread load all the containers for a volume, and therefore its the time taken to init all the DB instances that is the bottleneck?

The DB open takes time. The LOCK makes the DB open action one by one. Once it has more than 10K rocksdb in a volume, the initialization takes a considerable time, say more than 30minutes.
There are several severe consequence of datanode slow initialization,

service unavailable
scm will treat datanode as DEAD(We have raised the DEAD threshold to 30m in our cluster), and close all the related pipeline and containers related with this datanode, which contributed to why there are so many small containers in our cluster. It's a negative cycle.

This reverts commit 15ec4d3.

This reverts commit 5dd9a0f.

sodonnel · 2020-07-17T10:58:39Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/ContainerCache.java

+          // increment the reference before returning the object
+          currentDB.incrementReference();
+          // clean the db created in previous step
+          db.cleanup();


This cleanup() call may be expensive, and ideally it should be outside the lock. However, I think it will be very rare where you hit this scenario and therefore I think the logic is OK as it is now.

sodonnel

These changes with the striped lock LGTM, +1.

One question - have you been able to test this change on your DNs with a large number of containers, and did it give a good improvement?

ChenSammi · 2020-07-17T12:08:17Z

These changes with the striped lock LGTM, +1.

One question - have you been able to test this change on your DNs with a large number of containers, and did it give a good improvement?

Thanks @sodonnel .
Yes, I have tested it on a datanode with 20K+ containers, it costs about 30m to finish the ContainerSet build. Before the patch, time spending is about 10 times longer. If we ignore the lock during ContainerSet build, it costs about 15m.

sodonnel · 2020-07-17T13:35:15Z

These changes with the striped lock LGTM, +1.
One question - have you been able to test this change on your DNs with a large number of containers, and did it give a good improvement?

Thanks @sodonnel .
Yes, I have tested it on a datanode with 20K+ containers, it costs about 30m to finish the ContainerSet build. Before the patch, time spending is about 10 times longer. If we ignore the lock during ContainerSet build, it costs about 15m.

This seems like a good improvement for a relatively simply change. I think removing the lock completely is risky and may cause some other problems, either now or later.

@bharatviswa504 @adoroszlai - have you guys any further comments or concerns with the latest patch?

adoroszlai

Thanks @ChenSammi for updating the patch. The locking part looks good. I only have minor code change suggestions in the tests for better feedback in case of failure.

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/ContainerCache.java

...ainer-service/src/test/java/org/apache/hadoop/ozone/container/common/TestContainerCache.java

...r-service/src/test/java/org/apache/hadoop/ozone/container/ozoneimpl/TestContainerReader.java

adoroszlai

Thanks @ChenSammi for updating the patch.

hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConfigKeys.java

Concurrent opening of same container addressed by using striped lock instead of completely avoiding locks.

ChenSammi · 2020-07-22T06:12:05Z

Thanks @bharatviswa504 @sodonnel and @adoroszlai for the code review and great suggestions.

#1147) (cherry picked from commit 783a18c)

#1147)

apache#1147)

… thousan… (apache#1147)" This reverts commit 783a18c.

sodonnel reviewed Jun 29, 2020

View reviewed changes

arp7 requested a review from bharatviswa504 June 29, 2020 15:36

bharatviswa504 previously requested changes Jun 30, 2020

View reviewed changes

GlenGeng-awx mentioned this pull request Jul 3, 2020

HDDS-3191: switch from SCMPipelineManager to PipelineManagerV2Impl #1151

Merged

ChenSammi added 2 commits July 3, 2020 18:04

HDDS-3892. Datanode initialization is too slow when there are thousan…

95836d2

…ds of containers per volume.

release db resource

9dda984

adoroszlai reviewed Jul 6, 2020

View reviewed changes

address concurrent rocksdb open case

5dd9a0f

ChenSammi force-pushed the HDDS-3892 branch from ee4df79 to 5dd9a0f Compare July 9, 2020 09:58

fix checkstyle

15ec4d3

ChenSammi requested review from adoroszlai, bharatviswa504 and sodonnel July 10, 2020 02:12

sodonnel reviewed Jul 15, 2020

View reviewed changes

ChenSammi added 4 commits July 16, 2020 20:22

Revert "fix checkstyle"

20b694a

This reverts commit 15ec4d3.

Revert "address concurrent rocksdb open case"

fa37b9f

This reverts commit 5dd9a0f.

Use striped lock

71b4571

checkstyle

b4060aa

ChenSammi requested a review from sodonnel July 17, 2020 04:08

sodonnel reviewed Jul 17, 2020

View reviewed changes

sodonnel approved these changes Jul 17, 2020

View reviewed changes

adoroszlai requested changes Jul 17, 2020

View reviewed changes

address comments

f86391b

adoroszlai approved these changes Jul 20, 2020

View reviewed changes

adoroszlai reviewed Jul 20, 2020

View reviewed changes

hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConfigKeys.java Outdated Show resolved Hide resolved

ChenSammi added the 0.6.0 label Jul 20, 2020

change strip to stripe

b53da4d

add ozone.container.cache.lock.stripes in ozone-default.xml

2c9f550

ChenSammi merged commit 783a18c into apache:master Jul 22, 2020

ChenSammi added a commit that referenced this pull request Jul 22, 2020

HDDS-3892. Datanode initialization is too slow when there are thousan… (

2a79c89

#1147) (cherry picked from commit 783a18c)

timmylicheng pushed a commit that referenced this pull request Aug 6, 2020

HDDS-3892. Datanode initialization is too slow when there are thousan… (

093cea1

#1147)

rakeshadr pushed a commit to rakeshadr/hadoop-ozone that referenced this pull request Sep 3, 2020

HDDS-3892. Datanode initialization is too slow when there are thousan… (

652d948

apache#1147)

guihecheng pushed a commit to guihecheng/ozone that referenced this pull request Mar 8, 2021

Revert "HDDS-3892. Datanode initialization is too slow when there are…

3a096ef

… thousan… (apache#1147)" This reverts commit 783a18c.

HDDS-3892. Datanode initialization is too slow when there are thousan… #1147

HDDS-3892. Datanode initialization is too slow when there are thousan… #1147

Uh oh!

Conversation

ChenSammi commented Jun 29, 2020

Uh oh!

sodonnel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bharatviswa504 commented Jun 30, 2020

Uh oh!

ChenSammi commented Jun 30, 2020

Uh oh!

bharatviswa504 left a comment

Choose a reason for hiding this comment

Uh oh!

ChenSammi commented Jul 1, 2020

Uh oh!

ChenSammi commented Jul 3, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi commented Jul 9, 2020

Uh oh!

sodonnel Jul 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sodonnel commented Jul 15, 2020

Uh oh!

ChenSammi commented Jul 16, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sodonnel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi commented Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sodonnel commented Jul 17, 2020

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ChenSammi commented Jul 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sodonnel Jul 15, 2020 •

edited

Loading

sodonnel left a comment •

edited

Loading

ChenSammi commented Jul 17, 2020 •

edited

Loading