uses LruCache instead of InversionTree for caching data decode matrices #104

behzadnouri · 2022-09-03T18:57:50Z

Current implementation of LRU eviction policy on InversionTree is wrong
and inefficient.
The commit removes InversionTree and instead uses LruCache to cache data
decode matrices.

behzadnouri · 2022-09-03T18:59:42Z

@nazar-pc can you please take a look? thank you.
I am seeing decent performance improvements on our codebase using LruCache instead of the InversionTree.

nazar-pc · 2022-09-03T19:38:00Z

Thanks, I'll try to take a look and bench it later this week.
Can you elaborate on what was wrong?

behzadnouri · 2022-09-03T20:02:54Z

Can you elaborate on what was wrong?

#74 have some context why some eviction policy is needed. LRU eviction was added in #83.

But as is implemented, incrementing used field here:
https://github.com/rust-rse/reed-solomon-erasure/blob/eb1f66f47/src/inversion_tree.rs#L157
does not actually result in LRU. Once for a specific entry used becomes large enough, the entry won't get evicted even if it has not been used recently. Instead other more recently used entries will be evicted.

The eviction code implemented here:
https://github.com/rust-rse/reed-solomon-erasure/blob/eb1f66f47/src/inversion_tree.rs#L209-L238
is pretty inefficient. It requires several times sorting and a full traversal of the entire tree:
https://github.com/rust-rse/reed-solomon-erasure/blob/eb1f66f47/src/inversion_tree.rs#L111-L126
This all can instead be done O(1) as is implemented by lru::LruCache.

The two atomic operations and the mutex lock here:
https://github.com/rust-rse/reed-solomon-erasure/blob/eb1f66f47/src/inversion_tree.rs#L87-L97
are not thread-safe and are subject to race conditions in the time between them.

The InversionTree and the InversionNode:
https://github.com/rust-rse/reed-solomon-erasure/blob/eb1f66f47/src/inversion_tree.rs#L23-L36
have a lot of overhead compared to just using a hash-map.

The combination of all these causes that when you have a persistent instance of ReedSolomon encoder/decoder, and you keep calling into reconstruct_data with large enough shards (say 32 data and 32 code), and different indices are missing in each call, the code spends a lot of time just traversing the tree and evicting entries which is very suboptimal.

nazar-pc

I have done a simple benchmark and it is actually slower than older code.

Before:

reconstruct :
    shards           : 10 / 2
    shard length     : 1048576
    time taken       : 0.12414031
    byte count       : 524288000
    MB/s             : 4027.700591371167
reconstruct :
    shards           : 10 / 10
    shard length     : 1048576
    time taken       : 0.133304495
    byte count       : 524288000
    MB/s             : 3750.811253589011

After:

reconstruct :
    shards           : 10 / 2
    shard length     : 1048576
    time taken       : 0.150660363
    byte count       : 524288000
    MB/s             : 3318.722921170713
reconstruct :
    shards           : 10 / 10
    shard length     : 1048576
    time taken       : 0.154363785
    byte count       : 524288000
    MB/s             : 3239.1017102878113

I'm now wondering if we should just have a small-ish hashmap (like 256 entries) and when it gets full just stop adding more entries since it is likely that application will have random inputs all the time anyway, rendering caching useless.

Then we don't need LRU at all. What do you think?

cc @darrenldl @mvines

nazar-pc · 2022-09-09T16:59:58Z

Cargo.toml

@@ -43,6 +43,7 @@ parking_lot = { version = "0.11.2", optional = true }
 smallvec = "1.2"
 # `Mutex` implementation for `no_std` environment with the same high-level API as `parking_lot`
 spin = { version = "0.9.2", default-features = false, features = ["spin_mutex"] }
+lru = "0.7.8"


Please keep dependencies sorted.

src/core.rs

nazar-pc · 2022-09-09T17:04:09Z

src/core.rs

 use super::Field;
 use super::ReconstructShard;

+const DATA_DECODE_MATRIX_CACHE_CAPACITY: usize = 254;


This is an odd number. Can you elaborate on the rationale behind this number, maybe leaving a comment in the code too?

I just preserved the current cache capacity from the existing code, not to introduce any change of behavior or performance due to cache size change:
https://github.com/rust-rse/reed-solomon-erasure/blob/eb1f66f47/src/inversion_tree.rs#L15

nazar-pc · 2022-09-09T17:15:23Z

Cargo.toml

@@ -43,6 +43,7 @@ parking_lot = { version = "0.11.2", optional = true }
 smallvec = "1.2"
 # `Mutex` implementation for `no_std` environment with the same high-level API as `parking_lot`
 spin = { version = "0.9.2", default-features = false, features = ["spin_mutex"] }
+lru = "0.7.8"


Please keep dependencies sorted.

behzadnouri · 2022-09-09T19:21:51Z

I have done a simple benchmark and it is actually slower than older code.

It can depend on the setup and the actual benchmark. In particular how many data/parity shards are there and how often you hit or miss the cache, or have to evict entries from the cache.
In our setup:

There are usually 32 data shards and 32 parity shards.
Missing shards are very random.

So invalid_indices (the key used for caching data decode matrices in the inversion-tree) is ~32 random indices; and so:

It takes a lot of time to traverse the inverstion-tree in order to load and store matrices in the tree.
When looking up from the inversion-tree it will almost never hit an already cached entry.

So effectively we pay the price to traverse the tree and cache decode matrices, but not only there is no cache hit and so no gain, but also it is constantly evicting entries from the cache using a very inefficient eviction code as I explained earlier.

In our code-base I see a very significant gain (50%+) in some of our metrics by just switching from inversion-tree to lru-cache.

I am guessing the setup for your benchmark hits a sweet spot for the current implementation of inversion-tree. If you share the code for the benchmark I can look more into it.

Also please note that besides performance there are other issues with current implementation as mentioned earlier here: #104 (comment)

I'm now wondering if we should just have a small-ish hashmap (like 256 entries) and when it gets full just stop adding more entries since it is likely that application will have random inputs all the time anyway, rendering caching useless.

Then we don't need LRU at all. What do you think?

That sounds fair to me; as long as the change of behavior is not a concern. My objective here was to keep the behavior similar to current intended design to the extent possible, so preserving the lru policy.

nazar-pc · 2022-09-09T19:24:55Z

https://github.com/rust-rse/rse-benchmark is the benchmark I was using.

If change is behavior is positive, I see no reason to not do it. But I'd like to hear some feedback from other maintainers first.

behzadnouri · 2022-09-12T19:13:50Z

https://github.com/rust-rse/rse-benchmark is the benchmark I was using.

Looking at that benchmark code, it is always only removing the first shard:
https://github.com/rust-rse/rse-benchmark/blob/e241b2c8c/src/main.rs#L133

With that setting:

InversionTree navigation to find the cached entry is very short.
Cache never becomes full, and the cache eviction code is never invoked.

So that benchmark never really hits performance bottlenecks of the current implementation.

I added a benchmark code in this same pull request as a separate commit. You can run the benchmark by

cargo bench bench_reconstruct

with and without the last commit to compare.

On my machine, with current InversionTree implementation:

test bench_reconstruct ... bench:   3,779,956 ns/iter (+/- 2,025,730)

whereas with the LruCache:

test bench_reconstruct ... bench:     501,518 ns/iter (+/- 26,591)

So there is a massive improvement by switching from InversionTree to LruCache.

Current implementation of LRU eviction policy on InversionTree is wrong and inefficient. The commit removes InversionTree and instead uses LruCache to cache data decode matrices.

behzadnouri · 2022-09-19T14:48:34Z

@nazar-pc I updated benchmark code to test different combinations of number of data shards, and number of parity shards.
As shown below, as the number of parity shards grows, current code using InversionTree shows massive slow down, and the LruCache is significantly faster.
You may rerun this benchmarks using

cargo bench bench_reconstruct

With current InversionTree implementation:
The 1st number is number of data shards, the 2nd number is number of parity shards.

test bench_reconstruct_2_2   ... bench:       1,810 ns/iter (+/- 118)
test bench_reconstruct_4_2   ... bench:       3,797 ns/iter (+/- 120)
test bench_reconstruct_4_4   ... bench:       6,856 ns/iter (+/- 544)
test bench_reconstruct_8_2   ... bench:       7,873 ns/iter (+/- 388)
test bench_reconstruct_8_4   ... bench:      14,300 ns/iter (+/- 483)
test bench_reconstruct_8_8   ... bench:     295,623 ns/iter (+/- 63,375)
test bench_reconstruct_16_2  ... bench:      16,300 ns/iter (+/- 1,254)
test bench_reconstruct_16_4  ... bench:      30,483 ns/iter (+/- 1,584)
test bench_reconstruct_16_8  ... bench:   1,054,134 ns/iter (+/- 782,379)
test bench_reconstruct_16_16 ... bench:   1,888,967 ns/iter (+/- 1,247,035)
test bench_reconstruct_32_2  ... bench:      33,170 ns/iter (+/- 2,529)
test bench_reconstruct_32_4  ... bench:     811,539 ns/iter (+/- 493,022)
test bench_reconstruct_32_8  ... bench:   1,126,356 ns/iter (+/- 1,029,006)
test bench_reconstruct_32_16 ... bench:   2,063,159 ns/iter (+/- 1,576,207)
test bench_reconstruct_32_32 ... bench:   3,234,665 ns/iter (+/- 2,084,087)

Using LruCache instead:

test bench_reconstruct_2_2   ... bench:       1,789 ns/iter (+/- 149)
test bench_reconstruct_4_2   ... bench:       3,796 ns/iter (+/- 340)
test bench_reconstruct_4_4   ... bench:       6,793 ns/iter (+/- 745)
test bench_reconstruct_8_2   ... bench:       7,731 ns/iter (+/- 475)
test bench_reconstruct_8_4   ... bench:      15,823 ns/iter (+/- 791)
test bench_reconstruct_8_8   ... bench:      29,180 ns/iter (+/- 4,356)
test bench_reconstruct_16_2  ... bench:      16,180 ns/iter (+/- 1,063)
test bench_reconstruct_16_4  ... bench:      39,291 ns/iter (+/- 2,881)
test bench_reconstruct_16_8  ... bench:      66,470 ns/iter (+/- 2,748)
test bench_reconstruct_16_16 ... bench:     115,723 ns/iter (+/- 11,446)
test bench_reconstruct_32_2  ... bench:      46,704 ns/iter (+/- 2,409)
test bench_reconstruct_32_4  ... bench:      98,911 ns/iter (+/- 9,110)
test bench_reconstruct_32_8  ... bench:     169,096 ns/iter (+/- 10,051)
test bench_reconstruct_32_16 ... bench:     291,011 ns/iter (+/- 44,759)
test bench_reconstruct_32_32 ... bench:     502,477 ns/iter (+/- 44,652)

mvines · 2022-09-23T16:42:47Z

fwiw, I'm fine with this change. I'm also heavily biased as @behzadnouri is on my team

nazar-pc

Makes sense to me and thanks for adding reconstruction bench, ideally we'd move all the benches in here, maybe with criterion for easier testing in the future.

behzadnouri · 2022-09-23T20:01:59Z

@nazar-pc thanks for approving the pr.
what would be the process to merge this change, and then have a new release with this change pushed to crates.io?

nazar-pc · 2022-09-23T20:05:07Z

@mvines can you update changelog, version and do another release? I'm lazy today 🙂

mvines · 2022-09-23T20:06:47Z

heh, yep. that seems fair to me :)

mvines · 2022-09-23T20:20:36Z

v6.0.0 is now on crates.io

behzadnouri · 2022-09-23T20:21:24Z

Thank you both

Need to pick up: rust-rse/reed-solomon-erasure#104 in order to unblock: solana-labs#27510

Need to pick up: rust-rse/reed-solomon-erasure#104 in order to unblock: #27510

Need to pick up: rust-rse/reed-solomon-erasure#104 in order to unblock: #27510 (cherry picked from commit f02fe9c) # Conflicts: # ledger/Cargo.toml

…#28048) * updates reed-solomon-erasure crate version to 6.0.0 (#28033) Need to pick up: rust-rse/reed-solomon-erasure#104 in order to unblock: #27510 (cherry picked from commit f02fe9c) # Conflicts: # ledger/Cargo.toml * removes mergify merge conflicts Co-authored-by: behzad nouri <[email protected]>

behzadnouri force-pushed the rm-inversion-tree branch from f1adaa9 to 6a9e3dd Compare September 3, 2022 19:08

nazar-pc self-requested a review September 3, 2022 19:37

behzadnouri mentioned this pull request Sep 8, 2022

caches reed-solomon encoder/decoder instance solana-labs/solana#27510

Merged

nazar-pc reviewed Sep 9, 2022

View reviewed changes

behzadnouri force-pushed the rm-inversion-tree branch from 6a9e3dd to 90174e9 Compare September 9, 2022 18:36

behzadnouri force-pushed the rm-inversion-tree branch from 90174e9 to 37c06a1 Compare September 12, 2022 19:01

behzadnouri force-pushed the rm-inversion-tree branch from 37c06a1 to 26b9918 Compare September 12, 2022 19:17

behzadnouri added 2 commits September 19, 2022 10:32

adds benchmark for reconstruct

c505289

uses LruCache instead of InversionTree for caching data decode matrices

3660ce1

Current implementation of LRU eviction policy on InversionTree is wrong and inefficient. The commit removes InversionTree and instead uses LruCache to cache data decode matrices.

behzadnouri force-pushed the rm-inversion-tree branch from 26b9918 to 3660ce1 Compare September 19, 2022 14:40

behzadnouri requested a review from nazar-pc September 23, 2022 16:19

nazar-pc approved these changes Sep 23, 2022

View reviewed changes

nazar-pc merged commit 160bcff into rust-rse:master Sep 23, 2022

behzadnouri deleted the rm-inversion-tree branch September 23, 2022 20:21

behzadnouri added a commit to behzadnouri/solana that referenced this pull request Sep 23, 2022

updates reed-solomon-erasure crate version

f32856e

Need to pick up: rust-rse/reed-solomon-erasure#104 in order to unblock: solana-labs#27510

behzadnouri added a commit to behzadnouri/solana that referenced this pull request Sep 23, 2022

updates reed-solomon-erasure crate version to 6.0.0

f6332b6

Need to pick up: rust-rse/reed-solomon-erasure#104 in order to unblock: solana-labs#27510

behzadnouri mentioned this pull request Sep 23, 2022

updates reed-solomon-erasure crate version to 6.0.0 solana-labs/solana#28033

Merged

behzadnouri added a commit to solana-labs/solana that referenced this pull request Sep 24, 2022

updates reed-solomon-erasure crate version to 6.0.0 (#28033)

f02fe9c

Need to pick up: rust-rse/reed-solomon-erasure#104 in order to unblock: #27510

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uses LruCache instead of InversionTree for caching data decode matrices #104

uses LruCache instead of InversionTree for caching data decode matrices #104

behzadnouri commented Sep 3, 2022

behzadnouri commented Sep 3, 2022

nazar-pc commented Sep 3, 2022

behzadnouri commented Sep 3, 2022 •

edited

Loading

nazar-pc left a comment

nazar-pc Sep 9, 2022

behzadnouri Sep 9, 2022

nazar-pc Sep 9, 2022

behzadnouri Sep 9, 2022

nazar-pc Sep 9, 2022

behzadnouri Sep 9, 2022

behzadnouri commented Sep 9, 2022

nazar-pc commented Sep 9, 2022

behzadnouri commented Sep 12, 2022

behzadnouri commented Sep 19, 2022 •

edited

Loading

mvines commented Sep 23, 2022

nazar-pc left a comment

behzadnouri commented Sep 23, 2022

nazar-pc commented Sep 23, 2022

mvines commented Sep 23, 2022

mvines commented Sep 23, 2022

behzadnouri commented Sep 23, 2022

uses LruCache instead of InversionTree for caching data decode matrices #104

uses LruCache instead of InversionTree for caching data decode matrices #104

Conversation

behzadnouri commented Sep 3, 2022

behzadnouri commented Sep 3, 2022

nazar-pc commented Sep 3, 2022

behzadnouri commented Sep 3, 2022 • edited Loading

nazar-pc left a comment

Choose a reason for hiding this comment

nazar-pc Sep 9, 2022

Choose a reason for hiding this comment

behzadnouri Sep 9, 2022

Choose a reason for hiding this comment

nazar-pc Sep 9, 2022

Choose a reason for hiding this comment

behzadnouri Sep 9, 2022

Choose a reason for hiding this comment

nazar-pc Sep 9, 2022

Choose a reason for hiding this comment

behzadnouri Sep 9, 2022

Choose a reason for hiding this comment

behzadnouri commented Sep 9, 2022

nazar-pc commented Sep 9, 2022

behzadnouri commented Sep 12, 2022

behzadnouri commented Sep 19, 2022 • edited Loading

mvines commented Sep 23, 2022

nazar-pc left a comment

Choose a reason for hiding this comment

behzadnouri commented Sep 23, 2022

nazar-pc commented Sep 23, 2022

mvines commented Sep 23, 2022

mvines commented Sep 23, 2022

behzadnouri commented Sep 23, 2022

behzadnouri commented Sep 3, 2022 •

edited

Loading

behzadnouri commented Sep 19, 2022 •

edited

Loading