HDDS-12779. Parallelize table iteration across different ranges #8243

swamirishi · 2025-04-07T10:30:09Z

What changes were proposed in this pull request?

Currently most of the operations performed on the tables are agnostic to order of iteration of keys and can be executed in parallel to speed up most of such operations. The proposal here is to introduce a central functionaliy to achieve this.

The whole idea for parallelizing in this approach is to peek into the sst files metadata of rocksdb to get the keyRanges(smallestKey and largestKey) from LiveFileMetadata corresponding to the rocksdb column family and get an approximate set of range of keys in the table. Thus indirectly the table would be iterated based on the number of sst files as there are on the DB.
For instance consider the case:
1.sst has key range [k1 - k10]
2.sst has key range [k5 - k15]
3.sst has key range [k15 - k30].
Thus with the approach we would be splitting the table into 4 iterators iterating the key range:
[k1 - k5), [k5, k10), [k10, k15), [k15, k30]
Each of these iterators can be split into multiple threads and can be iterated within a threadpool task.
Similar to a pub-sub model an operation can be performed on the key which would be defined by the caller of this function.
The only con I see with this approach is that since all of these table iterators have been initialized independently then consistency would be an issue. E.g. If we are iterating through the file table where a key has been renamed from k to k' while we are iterating through the db. Depending on the iterator initialization order we could either see both k and k' while iterating or see none or see one. From what I understand we might not have use cases where we would be iterating through table ranges where there would be active mutation happening in the ranges(OM background Garbage collection, Recon bootstrap, SCM initialization etc.). Since I only see use cases where consistency is not really a key from my understanding we can live with this.

The patch also includes to have a centralized threadpool which would be responsible for throttling the number of rocksdb table iterators initialized governed by the configuration "hadoop.hdds.db.rocksdb.parallel.iterator.max.pool.size". This has been put in place as a guardrail so that this iteration doesn't impact main system performance while such background iterations are running.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12779

How was this patch tested?

Adding unit tests.

…h different ranges Change-Id: Ib08912cc23a4ff4ccc9898b082366a52dd54c363

szetszwo · 2025-04-08T19:59:02Z

@swamirishi , thanks for working on this!

This is a very big change. Could you describe what you are trying to do in some details? What is the algorithm in your mind? The two-line description is way too simple to cover such a large code change.

Change-Id: Iae11938afde4c28ac6459054a582a4ed9c920ac8

swamirishi · 2025-04-08T22:09:05Z

@swamirishi , thanks for working on this!

This is a very big change. Could you describe what you are trying to do in some details? What is the algorithm in your mind? The two-line description is way too simple to cover such a large code change.

@szetszwo Thank you for starting to review the PR. I have updated the PR description with the details about the patch.

swamirishi · 2025-04-08T22:12:29Z

@swamirishi , thanks for working on this!
This is a very big change. Could you describe what you are trying to do in some details? What is the algorithm in your mind? The two-line description is way too simple to cover such a large code change.

@szetszwo Thank you for starting to review the PR. I have updated the PR description with the details about the patch. In the meanwhile I am also working on @errose28's comment on having a single table iterator and just offloading the proto deserialization to the threadpool. While this approach has no issue with consistency but I see some performance implications with this proposal and thus we would need to benchmark both the implementations. Once I have some figures comparing both the implementations I would publish it here.

errose28 · 2025-04-08T22:46:51Z

The bulk of this change looks related to parallel iteration, not parallel deserialization and consumption. I would recommend:

Initial PR to implement general purpose serial iteration with deserialization and consumption pushed off to a thread pool.
If benchmarks show significant improvements, add support for parallelizing the iteration step using a different thread pool from the consumers.
- This would probably be good to expose as a parallel stream, for example by implementing it as a Spliterator and returning a stream from StreamSupport#stream.
Migrate each use case to the new implementation in individual PRs
- In HDDS-11777 we would probably create a wrapper around this utility to allow processing all locations in an FSO tree for debug or repair efficiently cc @dombizita

swamirishi · 2025-04-09T03:36:26Z

FYI even in this patch we work with a rocksdb snapshot which is very cheap to take, which would just take a snapshot of the db for the iterator. This snapshot can be release whenever we are done with the entire iteration making the entire iteration over a consistent view.
https://github.com/facebook/rocksdb/blob/d9a441113e6e2942a7a2c2eb459e45611261c91c/java/src/main/java/org/rocksdb/RocksDB.java#L3385-L3402
cc: @umamaheswararao

szetszwo · 2025-04-09T16:20:11Z

... Thus indirectly the table would be iterated based on the number of sst files as there are on the DB. ...

What are the assumptions of the performance improvement?

Reading multiple files using multiple threads is faster than using a single thread?

We have

iterator reading files (rocksdb) -> processing the entries (Ozone)

Instead of having multi-thread reading files, it is better to have multi-thread processing data.

Rocksdb itself is already very good for parallelism. It is unlikely Ozone could use the internal details in rocksdb to improve the performance. Also, Ozone should use only the public APIs in Rocksdb. It is hard to maintain such code. It may even causes data corruption silently.

BTW, you may consider parallelizing your pull requests -- having multiple small PRs instead of having a single large PR. Then, different people can review different PRs at the same time.

swamirishi · 2025-04-12T00:33:37Z

ti-thread processing data.

Rocksdb itself is already very good for parallelism. It is unlikely Ozone could use the internal details in rocksdb to improve the performance. Also, Ozone should use only the public APIs in Rocksdb. It is hard to maintain such code. It may even causes data corruption silently.
As part of this patch we are just exposing rocksdb exposed public api

ozone/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RocksDatabase.java

Lines 824 to 829 in 194077a

@VisibleForTesting

public List<LiveFileMetaData> getSstFileList() throws IOException {

try (UncheckedAutoCloseable ignored = acquire()) {

return db.get().getLiveFilesMetaData();

}

}

The eventual plan is to break this into small PRs. The intention here was to present the overall idea for increasing throughput of the rocksdb iterators.

szetszwo · 2025-04-12T17:14:49Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/metadata/BigIntegerCodec.java

 public final class BigIntegerCodec implements Codec<BigInteger> {

  private static final Codec<BigInteger> INSTANCE = new BigIntegerCodec();
+  private static final Comparator<BigInteger> COMPARATOR = (o1, o2) -> Objects.compare(o1, o2, BigInteger::compareTo);


@swamirishi , it is hard to make sure the comparators are consistent. It failed below.

static final Codec<Object> WRAPPER = new Codec<Object>() { @Override public Class<Object> getTypeClass() { return null; } @Override public byte[] toPersistedFormat(Object object) throws IOException { return BigIntegerCodec.get().toPersistedFormat((BigInteger) object); } @Override public Object fromPersistedFormat(byte[] rawData) { return null; } @Override public Object copyObject(Object object) { return object; } }; static final Comparator<Object> codecComparator = WRAPPER.comparator(); static final Comparator<BigInteger> bigIntegerComparator = BigIntegerCodec.get().comparator(); static void assertConsistency(BigInteger left, BigInteger right) { final int codecDiff = codecComparator.compare(left, right); final int bigIntegerDiff = bigIntegerComparator.compare(left, right); System.out.println("codecDiff=" + codecDiff + ", bigIntegerDiff=" + bigIntegerDiff); if (codecDiff == bigIntegerDiff) { return; } if (codecDiff * (long) bigIntegerDiff <= 0L) { // == 0: cannot be both 0 since they are unequal // < 0: different signs throw new IllegalStateException("INCONSISTENCY: codecDiff=" + codecDiff + " but bigIntegerDiff=" + bigIntegerDiff + ", where left=" + left + " and right=" + right); } } public static void main(String[] args) { assertConsistency(BigInteger.ZERO, BigInteger.ONE); // It won't work for negative numbers !!! assertConsistency(BigInteger.ZERO, BigInteger.ONE.negate()); }

For your convenience: https://issues.apache.org/jira/secure/attachment/13075971/8243_bigIntegerComparator_bug.patch

@szetszwo If this is the case our prefix based iterator is broken. Since we rely on rocksdb's ordering to be same as the KeyCodec's ordering to be the same. Look at

ozone/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBStoreAbstractIterator.java

Line 98 in 194077a

(prefix == null || startsWithPrefix(key()));

We break the TypedTableIterator saying raw iterator doesn't have any more entries corresponding to the typed prefix.

ozone/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/TypedTable.java

Lines 681 to 684 in e5ef35d

public boolean hasNext() {

return rawIterator.hasNext();

}

Yes I agree we don't have prefix based iterator use case for BigInteger but we have to be careful in making the claim that prefix based iterator is fool proof and would work with any defined Codec

Take BigInteger as an example, the iterator always uses byte ordering but not BigInteger ordering. So, there is no problems.

@swamirishi , yes, UTF_16LE does not work but we are not using it. So, what the point?

We are doing prefix search -- ah, you are saying that if a codec moving the bit around, it won't work. Yes, totally agree.

We do similar Long comparisons for one of the recon api we are only fine because we have the container long value>0. But I am not sure if we are handling overflow cases properly (I know it is practically impossible to run out of long values) just wondering if we are handling things fine for any Integer and Long comparisons.

... container long value>0 ...

@swamirishi , are we doing prefix search for container id? If not, it is fine even for negative numbers. Iterator will work fine, it just use a wired ordering but not the numerical ordering.

As a summary,

prefix search requires a prefix-preserving codec.

As an analogy,

binary search requires a sorted list.

swamirishi · 2025-04-14T20:29:43Z

Closing this PR in favour breaking into small PRs

HDDS-12779. Parallelize table iteration by parallely iterating throug…

6cce2bb

…h different ranges Change-Id: Ib08912cc23a4ff4ccc9898b082366a52dd54c363

swamirishi requested review from hemantk-12, sumitagrawl and szetszwo and removed request for sumitagrawl April 7, 2025 10:30

adoroszlai changed the title ~~HDDS-12779. Parallelize table iteration by parallely iterating through different ranges~~ HDDS-12779. Parallelize table iteration across different ranges Apr 7, 2025

swamirishi requested review from errose28 and kerneltime April 7, 2025 22:18

HDDS-12779. Fix checkstyle

af757db

Change-Id: Iae11938afde4c28ac6459054a582a4ed9c920ac8

szetszwo reviewed Apr 12, 2025

View reviewed changes

swamirishi closed this Apr 14, 2025

szetszwo mentioned this pull request Jun 2, 2025

HDDS-12779. Parallelize table Spliterator with multiple multiple iterators across different ranges #8517

Closed

HDDS-12779. Parallelize table iteration across different ranges #8243

HDDS-12779. Parallelize table iteration across different ranges #8243

Uh oh!

Conversation

swamirishi commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

szetszwo commented Apr 8, 2025

Uh oh!

swamirishi commented Apr 8, 2025

Uh oh!

swamirishi commented Apr 8, 2025

Uh oh!

errose28 commented Apr 8, 2025

Uh oh!

swamirishi commented Apr 9, 2025

Uh oh!

szetszwo commented Apr 9, 2025

Uh oh!

swamirishi commented Apr 12, 2025

Uh oh!

szetszwo Apr 12, 2025

Choose a reason for hiding this comment

Uh oh!

szetszwo Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swamirishi Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swamirishi Apr 12, 2025

Choose a reason for hiding this comment

Uh oh!

szetszwo Apr 12, 2025

Choose a reason for hiding this comment

Uh oh!

szetszwo Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

szetszwo Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

swamirishi Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

szetszwo Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

szetszwo Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swamirishi commented Apr 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

swamirishi commented Apr 7, 2025 •

edited

Loading

szetszwo Apr 12, 2025 •

edited

Loading

swamirishi Apr 12, 2025 •

edited

Loading

szetszwo Apr 14, 2025 •

edited

Loading