trie: make fullnode children hash calculation concurrently#15131
Conversation
|
@fjl PTAL |
|
Hey there! I've looked through the PR, the idea is good, but the solution is less so. The main problem is that you spin up a new goroutine for every full-node child. Now, with my new sync code, we have instances when I hash tries consisting of 20-60K nodes. I haven't run the numbers, but I can easily imagine the trie containing at least a few thousand nodes. This means a few thousand goroutines are cycled, and also a few thousand generators are allocated, which hits the GC too. I'll try to put together a benchmark with some large trie to make a stronger case, but even in your own benchmark I think this can be seen as running on 4x cores only gives you a 28% edge, which suggests the cores are still idle. It's important to investigate how much time hashing a single trie node takes? As nodes are around 500 bytes, I'd assume some +- insignificant amount. If so, this means that the goroutine and memory allocation overhead might be quite large compared to the actual work being done. As such, we need to do a lot more work per thread to make the concurrency worthwhile. Given that tries are mostly storing SHA3 images (as the keys), they can be assumed to be evenly distributed. A possible quick and dirty solution would be to only ever thread out on the topmost full node. That way we still do parallel processing without spinning up too many goroutines. Of course this may not perform the best during block processing since the trie loaded into memory may well not be balanced. Note, I've had a solution from a year or so ago which used a hasher pool and any hashing operations sent the task to the pool and processed the result, but the sync overhead was greater than the concurrency gain, especially since the buffer handling becomes more complex (like in your case). Any solution we come up with will need a much nicer benchmark suite and also needs to map the memory allocs in the benchmark to ensure we don't accidentally swap sha3 time for gc time. |
|
Thanks for your suggestions. As you mentioned, the solution is quite unconsidered. Maybe the topmost parallelism plan is feasible, since only the tree is large enough, the speed up got by parallelism is we really care about. And the imbalance will not exist in a "large" tree. |
|
I've opened a PR to fix the old benchmark. The numbers are still in your favor btw, but now the benchmark is a bit more realistic: Old: This PR: That being said, running on 8 cores, I'd expect a bit more speedup. |
|
Not so good about the performance. I'll try to topmost parallelism to see if better performance can obtain. Since this approach seems more stable. |
|
I don't really have a good suggestion, just consider my post a brain dump :D |
|
I just "hacked" my top level idea into your impl, and I seem to get double performance out of it, might be worth investigating: diff --git a/trie/hasher.go b/trie/hasher.go
index bfcf0b3..1f3ae42 100644
--- a/trie/hasher.go
+++ b/trie/hasher.go
@@ -64,7 +64,7 @@ func (h *hasher) returnCalculator(calculator *calculator) {
// hash collapses a node down into a hash node, also returning a copy of the
// original node initialized with the computed hash to replace the original one.
-func (h *hasher) hash(n node, db DatabaseWriter, force bool) (node, node, error) {
+func (h *hasher) hash(n node, db DatabaseWriter, force bool, threaded bool) (node, node, error) {
// If we're not storing the node, just hashing, use available cached data
if hash, dirty := n.cache(); hash != nil {
if db == nil {
@@ -81,7 +81,7 @@ func (h *hasher) hash(n node, db DatabaseWriter, force bool) (node, node, error)
}
}
// Trie not processed yet or needs storage, walk the children
- collapsed, cached, err := h.hashChildren(n, db)
+ collapsed, cached, err := h.hashChildren(n, db, threaded)
if err != nil {
return hashNode{}, n, err
}
@@ -111,7 +111,7 @@ func (h *hasher) hash(n node, db DatabaseWriter, force bool) (node, node, error)
// hashChildren replaces the children of a node with their hashes if the encoded
// size of the child is larger than a hash, returning the collapsed node as well
// as a replacement for the original node with the child hashes cached in.
-func (h *hasher) hashChildren(original node, db DatabaseWriter) (node, node, error) {
+func (h *hasher) hashChildren(original node, db DatabaseWriter, threaded bool) (node, node, error) {
var err error
switch n := original.(type) {
@@ -122,7 +122,7 @@ func (h *hasher) hashChildren(original node, db DatabaseWriter) (node, node, err
cached.Key = common.CopyBytes(n.Key)
if _, ok := n.Val.(valueNode); !ok {
- collapsed.Val, cached.Val, err = h.hash(n.Val, db, false)
+ collapsed.Val, cached.Val, err = h.hash(n.Val, db, false, threaded)
if err != nil {
return original, original, err
}
@@ -138,15 +138,23 @@ func (h *hasher) hashChildren(original node, db DatabaseWriter) (node, node, err
collapsed, cached := n.copy(), n.copy()
for i := 0; i < 16; i++ {
if n.Children[i] != nil {
- wg.Add(1)
- go func(i int) {
+ if !threaded {
+ wg.Add(1)
+ go func(i int) {
+ var herr error
+ collapsed.Children[i], cached.Children[i], herr = h.hash(n.Children[i], db, false, true)
+ if herr != nil {
+ err = herr
+ }
+ wg.Done()
+ }(i)
+ } else {
var herr error
- collapsed.Children[i], cached.Children[i], herr = h.hash(n.Children[i], db, false)
+ collapsed.Children[i], cached.Children[i], herr = h.hash(n.Children[i], db, false, true)
if herr != nil {
err = herr
}
- wg.Done()
- }(i)
+ }
} else {
collapsed.Children[i] = valueNode(nil) // Ensure that nil children are encoded as empty strings.
}
diff --git a/trie/proof.go b/trie/proof.go
index 298f648..81f91b3 100644
--- a/trie/proof.go
+++ b/trie/proof.go
@@ -72,7 +72,7 @@ func (t *Trie) Prove(key []byte) []rlp.RawValue {
for i, n := range nodes {
// Don't bother checking for errors here since hasher panics
// if encoding doesn't work and we're not writing to any database.
- n, _, _ = hasher.hashChildren(n, nil)
+ n, _, _ = hasher.hashChildren(n, nil, false)
hn, _ := hasher.store(n, nil, false)
if _, ok := hn.(hashNode); ok || i == 0 {
// If the node's database encoding is a hash (or is the
diff --git a/trie/trie.go b/trie/trie.go
index c211e75..e8c8fd5 100644
--- a/trie/trie.go
+++ b/trie/trie.go
@@ -501,5 +501,5 @@ func (t *Trie) hashRoot(db DatabaseWriter) (node, node, error) {
return hashNode(emptyRoot.Bytes()), nil, nil
}
h := newHasher(t.cachegen, t.cachelimit)
- return h.hash(t.root, db, true)
+ return h.hash(t.root, db, true, false)
}EDIT: Of course the above diff is very very ugly code, just a PoC :) |
4cf83c9 to
2af4feb
Compare
|
@karalabe Updated |
There was a problem hiding this comment.
Since this is a utility type used by the hasher, it's cleaner to have it (and its methods) defined at the top of the file. Please bubble them up. Also please add docs for the type to state what it does.
There was a problem hiding this comment.
You only use this WaitGroup in a deeper nesting. Please move it inside, there's no need to declare it out here.
There was a problem hiding this comment.
I'm a bit reluctant about this code construct. Seems very brittle that we have the same code duplicated, it's asking for trouble down the line when someone wants to modify it. Couldn't we somehow unify the threaded and non-threaded version?
There was a problem hiding this comment.
This construct is racey, since multiple goroutines may set the same error.
2af4feb to
cfb622a
Compare
|
@karalabe Updated |
There was a problem hiding this comment.
Callbacks are not the nicest option :) A cleaner solution is to have the *sync.Waitgroup passed as teh parameter. Then if it's not nil, you can do a defer wg.Done. It's the same thing in essence, just a bit less boilerplate involved.
cfb622a to
7fd6a8d
Compare
8f46b6d to
d7f1cf1
Compare
d7f1cf1 to
4863cc7
Compare
…thereum#15131)" (ethereum#15889) This reverts commit 0f7fbb8.
Motivation:
For fullnode in trie, the children hash calculation was serial before. However, the hash calculation between any child nodes does not affect each other. So i make this concurrently.
Performance:
I run the
BenchmarkHashBEandBenchmarkHashLEon my server machine(8cores 3.7GHZ) for performance checking.for parallel:
for serial:
Performance doubled when hash large tree.