new blooms database by debris · Pull Request #8712 · openethereum/parity-ethereum

debris · 2018-05-25T03:04:14Z

This pr introduces new database for header blooms.

Why?

The only time when we access blooms is when we are filtering logs. Filtering logs requires iteration over all blooms in a database and this is very inefficient with RocksDB. As blockchain got bigger and bigger it took more and more time to effectively filter all blooms starting from the genesis block.

How?

I built increment only, sequential database for storing blooms. It consists of three database layers. top, mid, bot.

bot layer contains all header blooms
mid layer contains blooms created from 16 consecutive blooms of the bot layer.
top layer contains blooms from 16 consecutive blooms from the mid layer.

Blooms at each layer are placed next to each other on a disk. So it's very efficient to find O(1)~ and iterate O(1)~ over them. It's a huge improvement compared to the previous database implementation where every lookup and step of iteration was probably O(log n).

size of a database (for 1000000 blocks)

bot layer - 256mb (1000000 * size_of_bloom (256b))
mid layer - 16mb (bot / 16)
top layer - 1mb (mid / 16)

Other issues addressed

Blooms recomputation in case of a fork is very slow #8552 - this pr also addresses Blooms recomputation in case of a fork is very slow #8552 by simply not recomputing blooms at higher levels. Why? Because in case of a fork, it's likely that the same transaction on a different chain will be included in block of the same number or in one of the following blocks. As long as the distance between block on chain a, and block on chain b, is on average significantly lower than 128, false-positive ration of bloom filter should not increase. And even if the top and mid bloom return the false-positive, the underlying bottom bloom is replaced with a new one, so this will not lead to incorrect results.

https://github.com/paritytech/parity/blob/343b29866c4cc592afa9038995217b6ba99e70fa/util/blooms-db/src/db.rs#L96-L98
this pr also addresses all issues with bloom recomputation on forks. Recomputation does not happen and there is no in-memory cache, so blooms are always valid

todo list (in this pr)

benchmarks

So far I run only a single benchmark for blooms-db. The results are promising, but I believe that they do not fully show how much everything has improved.

benchmark from old stackoverflow thread

time curl -X POST --data '{"id":8,"jsonrpc":"2.0","method":"eth_getLogs","params":[{"fromBlock":"0x0","toBlock":"0xf0be2", "address": "0x33990122638b9132ca29c723bdf037f1a891a70c"}]}' -H "Content-Type: application/json" http://127.0.0.1:8545 >> /dev/null

old parity:

0.77s

this branch:

0.01s

This pr, should also reduce disk writes by roughly 70gb when doing full sync

3 * 5670000 * 16 * 256 bytes = 69.67296 gigabytes

3 - bloom levels
5670000 - number of blocks
16 - currently blooms are stored in groups containing 16 blooms
256 - size of a bloom

tjayrush · 2018-05-25T13:16:31Z

A couple of questions.

Won't the higher level blooms, given that they are the same width and are getting filled with more and more data, get quickly saturated? In effect, making them less useful because they would tend to report higher numbers of false positives? Did you think about widening the mid-layer, top-layer blooms so as to keep a more consistent false-positive rate?
How hard would it be to enhance the blooms to note every single address in a block (as opposed to the current list which, I think, doesn't include every address). The reason I'm asking is because I want to use the bloom to answer more questions than just "Is there an event on my address in this block." I want the blooms to answer "Is my address involved in any possible way in this block." With the ability to answer that question, one could easily build lists of all transaction an account was every involved in.

I wrote about this here: Adaptive Enhanced Bloom Filters.

niklasad1 · 2018-05-26T15:18:24Z

 	use hash::keccak;
-	use kvdb::{KeyValueDB, DBTransaction};
-	use kvdb_memorydb;
+	use kvdb::{DBTransaction};


Remove blocks around a single import

niklasad1 · 2018-05-26T15:19:03Z

 use rand::Rng;

-use kvdb::{KeyValueDB, DBValue};
+use kvdb::{DBValue};