Skip to content

Conversation

@ChenSammi
Copy link
Contributor

Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change look mostly good to me. I just have one question about possibly cleaning up the DB object if we don't need to use it, added inline.

As I understand it, before this patch, the expensive call to create the DB Entry was performed under the lock.

Now, with the patch, you check if the DB entry exists and if it does, return it. Otherwise, drop the lock, create the new DB entry outside the lock, and then reclaim the lock. The change also handles the case where another thread may create the DB entry between dropping and re-taking the lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to the ReferenceCountedDB instance created at line 140 if we return here?

Do we need to call db.cleanup() on the newly created instance, and if we don't could this potentially leak db objects? I note that the remove() call in ContainerCache calls db.cleanup when removing the entry, which makes we suspect this is important.

If we do need to call cleanup() on it, we can probably capture the return value inside the lock and then call cleanup() from outside the lock and then return?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. You're right. The new DB instance should be explicitely closed in this case.

@arp7 arp7 requested a review from bharatviswa504 June 29, 2020 15:36
@bharatviswa504
Copy link
Contributor

Hi @ChenSammi
Thanks for the PR.
I too have the same comment as @sodonnel. I think we need to close the DB instance if we get an instance from the cache.

And also do you have this version on your deployed cluster
a2ab8d6

This saves in iterating Db to compute block metadata, and get all the required metadata from db.get(). We have observed this issue during a billion object test.
For older containers created before HDDS-3217, we still iterate and set block metadata. (But this fixed a bug, where we used to iterate thrice for open containers)

@ChenSammi
Copy link
Contributor Author

@bharatviswa504 we have HDDS-3217 deployed.

Copy link
Contributor

@bharatviswa504 bharatviswa504 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see one issue with this approach.
If the database is already opened, and if we try to open again we will get this error.

I think, with this change, we will throw an exception if we try to open the database again an already existing one.

java.io.IOException: Failed init RocksDB, db path : /Users/bviswanadham/workspace/hadoop-ozone/hadoop-hdds/container-service/target/test-dir/xCkBnsLVrc/cont1, exception :org.rocksdb.RocksDBException lock : /Users/bviswanadham/workspace/hadoop-ozone/hadoop-hdds/container-service/target/test-dir/xCkBnsLVrc/cont1/LOCK: No locks available

@ChenSammi
Copy link
Contributor Author

I see one issue with this approach.
If the database is already opened, and if we try to open again we will get this error.

I think, with this change, we will throw an exception if we try to open the database again an already existing one.

java.io.IOException: Failed init RocksDB, db path : /Users/bviswanadham/workspace/hadoop-ozone/hadoop-hdds/container-service/target/test-dir/xCkBnsLVrc/cont1, exception :org.rocksdb.RocksDBException lock : /Users/bviswanadham/workspace/hadoop-ozone/hadoop-hdds/container-service/target/test-dir/xCkBnsLVrc/cont1/LOCK: No locks available

@bharatviswa504 , I get your point. It's an issue, but not introduced by this patch. It's a currenlty existing issue and we need to carefully think about how to fix it with a new JIRA.

@ChenSammi
Copy link
Contributor Author

The failed integration test is irrevelant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see one issue with this approach.
If the database is already opened, and if we try to open again we will get this error.
I think, with this change, we will throw an exception if we try to open the database again an already existing one.
java.io.IOException: Failed init RocksDB, db path : /Users/bviswanadham/workspace/hadoop-ozone/hadoop-hdds/container-service/target/test-dir/xCkBnsLVrc/cont1, exception :org.rocksdb.RocksDBException lock : /Users/bviswanadham/workspace/hadoop-ozone/hadoop-hdds/container-service/target/test-dir/xCkBnsLVrc/cont1/LOCK: No locks available

@bharatviswa504 , I get your point. It's an issue, but not introduced by this patch. It's a currenlty existing issue and we need to carefully think about how to fix it with a new JIRA.

@ChenSammi can you please clarify why you think it's an existing issue? It seems to me that calling build() outside of the lock introduces this problem.

If I understand correctly, the performance issue is due to using a single lock for all containers. Instead of moving the expensive part outside of the lock, how about using a separate lock per containerDBPath?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adoroszlai, OK, I get the point now. During the datanode initialization phase, actually there is no change a DB will be opened concurrently. But since this function is also used in late phase, there is possibility to trigger the exception.

@ChenSammi
Copy link
Contributor Author

Thanks @bharatviswa504 and @adoroszlai for the review. lock per containerDBPath in ContainerCache is a good idea. I tried it but cannot solve our problem given a volume has nearly 20K containers, it still take nearly an hour for the datanode to finish the initialization.

.setDBType(containerDBType)
.build();
db = new ReferenceCountedDB(metadataStore, containerDBPath);
this.put(containerDBPath, db);
Copy link
Contributor

@sodonnel sodonnel Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, you have created a boolean to either lock or not lock in this method. In the DN initialization logic, you pass false here to avoid locking, as the same container will not be initialized by multiple threads at once.

I am still not sure this is safe, specifically the line:

this.put(containerDBPath, db);

ContainerCache is an instance of LRUCache and the JavaDoc says:

Note that LRUMap is not synchronized and is not thread-safe.
If you wish to use this map from multiple threads concurrently, you must use
appropriate synchronization.

Therefore I think you must lock around access and writes to the LRU Cache.

I also worry about removing locking completely incase someone passes "false" mistakenly to this method in the future.

@adoroszlai Earlier suggested using a lock per container path. We might be able to use a Guava Striped lock here with something like 256 buckets.

Could we do something like:

  public ReferenceCountedDB getDB(long containerID, String containerDBType,
                             String containerDBPath, ConfigurationSource conf)
      throws IOException {
    Preconditions.checkState(containerID >= 0,
        "Container ID cannot be negative.");
    ReferenceCountedDB db = null;
    Lock slock = stripedLock.get(containerDBPath);
    slock.lock();
    try {
      lock.lock();
      try {
        db = (ReferenceCountedDB) this.get(containerDBPath);
      } finally {
        lock.unlock();
      }
      if (db == null) {
        MetadataStore metadataStore =
            MetadataStoreBuilder.newBuilder()
                .setDbFile(new File(containerDBPath))
                .setCreateIfMissing(false)
                .setConf(conf)
                .setDBType(containerDBType)
                .build();
        db = new ReferenceCountedDB(metadataStore, containerDBPath);
        addDB(containerDBPath, db);
      }
      db.incrementReference();
      return db;
    } catch (Exception e) {
      LOG.error("Error opening DB. Container:{} ContainerPath:{}",
          containerID, containerDBPath, e);
      throw e;
    } finally {
      slock.unlock();
    }
  }

We still have the global lock to protect the ContainerCache structure, but we now have a stripped lock to ensure that the same ContainerPath cannot be initialized at the same time.

I am not sure if we need to worry about another thread calling addDB and adding another instances of the same DB to the cache. Perhaps not a DN initialization time, but at runtime. If that is the case, we would need to add back the logic you had before, where you take the global lock again, and then test to see if an entry for the DB has been added to the cache, and if so close the new instance and return the cached one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this some more, it might be safest to wrap addDB and removeDB with the strippedLock too, as that way we ensure only one DB instance can be initialized and added to the map at any given time.

@adoroszlai might have some better idea than me on this too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sodonnel , I will take a look at the StrippedLock.

@sodonnel
Copy link
Contributor

Thanks @bharatviswa504 and @adoroszlai for the review. lock per containerDBPath in ContainerCache is a good idea. I tried it but cannot solve our problem given a volume has nearly 20K containers, it still take nearly an hour for the datanode to finish the initialization.

With a lock per containerPath, where does the bottle neck come from? Is it around the lock, or does a single thread load all the containers for a volume, and therefore its the time taken to init all the DB instances that is the bottleneck?

@ChenSammi
Copy link
Contributor Author

Thanks @bharatviswa504 and @adoroszlai for the review. lock per containerDBPath in ContainerCache is a good idea. I tried it but cannot solve our problem given a volume has nearly 20K containers, it still take nearly an hour for the datanode to finish the initialization.

With a lock per containerPath, where does the bottle neck come from? Is it around the lock, or does a single thread load all the containers for a volume, and therefore its the time taken to init all the DB instances that is the bottleneck?

The DB open takes time. The LOCK makes the DB open action one by one. Once it has more than 10K rocksdb in a volume, the initialization takes a considerable time, say more than 30minutes.
There are several severe consequence of datanode slow initialization,

  1. service unavailable
  2. scm will treat datanode as DEAD(We have raised the DEAD threshold to 30m in our cluster), and close all the related pipeline and containers related with this datanode, which contributed to why there are so many small containers in our cluster. It's a negative cycle.

@ChenSammi ChenSammi requested a review from sodonnel July 17, 2020 04:08
// increment the reference before returning the object
currentDB.incrementReference();
// clean the db created in previous step
db.cleanup();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cleanup() call may be expensive, and ideally it should be outside the lock. However, I think it will be very rare where you hit this scenario and therefore I think the logic is OK as it is now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes with the striped lock LGTM, +1.

One question - have you been able to test this change on your DNs with a large number of containers, and did it give a good improvement?

@ChenSammi
Copy link
Contributor Author

ChenSammi commented Jul 17, 2020

These changes with the striped lock LGTM, +1.

One question - have you been able to test this change on your DNs with a large number of containers, and did it give a good improvement?

Thanks @sodonnel .
Yes, I have tested it on a datanode with 20K+ containers, it costs about 30m to finish the ContainerSet build. Before the patch, time spending is about 10 times longer. If we ignore the lock during ContainerSet build, it costs about 15m.

@sodonnel
Copy link
Contributor

These changes with the striped lock LGTM, +1.
One question - have you been able to test this change on your DNs with a large number of containers, and did it give a good improvement?

Thanks @sodonnel .
Yes, I have tested it on a datanode with 20K+ containers, it costs about 30m to finish the ContainerSet build. Before the patch, time spending is about 10 times longer. If we ignore the lock during ContainerSet build, it costs about 15m.

This seems like a good improvement for a relatively simply change. I think removing the lock completely is risky and may cause some other problems, either now or later.

@bharatviswa504 @adoroszlai - have you guys any further comments or concerns with the latest patch?

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ChenSammi for updating the patch. The locking part looks good. I only have minor code change suggestions in the tests for better feedback in case of failure.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ChenSammi for updating the patch.

@adoroszlai adoroszlai dismissed bharatviswa504’s stale review July 21, 2020 09:23

Concurrent opening of same container addressed by using striped lock instead of completely avoiding locks.

@ChenSammi ChenSammi merged commit 783a18c into apache:master Jul 22, 2020
@ChenSammi
Copy link
Contributor Author

Thanks @bharatviswa504 @sodonnel and @adoroszlai for the code review and great suggestions.

ChenSammi added a commit that referenced this pull request Jul 22, 2020
rakeshadr pushed a commit to rakeshadr/hadoop-ozone that referenced this pull request Sep 3, 2020
guihecheng pushed a commit to guihecheng/ozone that referenced this pull request Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants