-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize MTrie Checkpoint (regCount & regSize): -9GB alloc/op, -110 milllion allocs/op, -4GB file size #2126
Conversation
Reduce RAM use, file size, and improve speed. E.g. reduce checkpoint file size by over 4GB and RAM usage by a similar amount. Speed will also benefit from these changes (benchmarks coming.) lowestHeightTouched returned by update() is the lowest height reached during recursive update. Unlike maxDepth, lowestHeightTouched isn't affected by prune flag. It's mostly new/updated node height. It can also be height of new node created from compact leaf at a higher height. Changes include: - remove Node.regCount (8 bytes) and Node.maxDepth (2 bytes) - remove regCount (8 bytes) and maxDepth (2 bytes) from checkpoint node serialization - add MTrie.regCount (8 bytes) and MTrie.regSize (8 bytes) - add regCount (8 bytes) and regSize (b bytes) to checkpoint trie serialization - modify trie update() to return regCountDelta, regSizeDelta, and lowestHeightTouched so that NewTrieWithUpdatedRegisters() can compute regCount, regSize, and maxDepthTouched for the updated trie - fix Payload.Equals() crash bug when p is nil - fix Payload.Equals() bug when comparing nil payload with empty payload
Codecov Report
@@ Coverage Diff @@
## master #2126 +/- ##
==========================================
+ Coverage 57.50% 57.52% +0.02%
==========================================
Files 637 641 +4
Lines 37261 37884 +623
==========================================
+ Hits 21426 21794 +368
- Misses 13151 13328 +177
- Partials 2684 2762 +78
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Output variables shared outside goroutine are allocated on the heap. Use channel to transfer output variables to avoid heap allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the cleanup. The only aspect I am concerned about is the different convention introduced in this PR.
Existing conventions
- In our abstract storage model, we have registers
- The Trie extends the convention of allocated registers:
- The Trie must store allocated registers. An allocated register has a hash that is necessarily different than the default hash, because an allocated register has a non-empty value, which is part of the hash. Furthermore, also the hash of a sub-trie with any allocated registers has a non-default hash.
- The trie may store unallocated registers. We specifically designed the hash function such that it returns the default hash for an unallocated register.
Based on this convention, the Trie distinguishes registers only based on whether or not they have an empty Value
.
Now, lets look at regCount
(I'll get to regSize
later). There are 3 possible conventions that are consistent with the conventions above:
- Option 1: Theoretically, we could count all registers in the storage model (no matter whether allocated or not). That would be a constant number of 2^256. Not very interesting.
- Option 2: We only count the allocated registers. That would tell us how much registers the application layer is actively using; but not provide any information about the number of registers that are explicitly represented in the trie despite being unallocated.
- Option 3: We count all registers that explicitly represented in the trie (no matter whether they are allocated or unallocated).
For practical purposes, I would expect that Option 2 and Option 3 are close to identical. All tries that we keep in memory are pruned, so there should be no unallocated registers represented in the tree. Not sure what your intended usage is for regCount
(?). Given that there is little practical difference, my gut feeling is that option 3 would be easiest to implement consistently.
Comparison to convention introduced in this PR
- In this PR you start a different convention by working with the
Payload.Size()
. This is inconsistent with the existing convention because:- In the existing convention, the trie algorithms only destingush between allocated and unallocated registers. The
Payload.Keys
is auxiliary data which the higher-level business logic can use arbitrarily; the trie does not make any assumptions about howPayload.Keys
is used. - In your implementation, the treatment of unallocated registers suddenly changes depending on whether or not
Payload.Keys
is set. This creates dependencies / assumptions about usage patters of the application logic (more inter-dependencies -> more brittle code). Furthermore, it significantly increases the complexity of the logic, because we suddenly move from a two-state register attribute of{allocated; unallocated}
to a 4-state attribute<Payload.Keys present? Yes or No> x <Payload.Value present? Yes or No>
.
- In the existing convention, the trie algorithms only destingush between allocated and unallocated registers. The
Suggestion:
I would suggest to implement Option 3:
- If we add a new leaf,
regCount
increases by one; no matter whether or not the register is allocated. The register exists in the tree, so it takes up storage and hence is worth-while to monitor. - This convention also makes
regSize
easy to monitor, because you don't have to distinguish between allocated and unallocated registers either. As soon as we represent a register in the tree, we countpayloads[0].Size()
as its storage footprint.
ledger/complete/mtrie/trie/trie.go
Outdated
// runtime optimization: process the left child is a separate thread | ||
wg := sync.WaitGroup{} | ||
wg.Add(1) | ||
go func() { | ||
defer wg.Done() | ||
lChild = update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune) | ||
}() | ||
rChild = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune) | ||
wg.Wait() | ||
|
||
// Since we're receiving 4 items from goroutine, use a | ||
// channel to prevent them going on the heap. | ||
type updateResult struct { | ||
child *node.Node | ||
regCountDelta int64 | ||
regSizeDelta int64 | ||
lowestHeightTouched int | ||
} | ||
results := make(chan updateResult, 1) | ||
go func(retChan chan<- updateResult) { | ||
child, regCountDelta, regSizeDelta, lowestHeightTouched := update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune) | ||
retChan <- updateResult{child, regCountDelta, regSizeDelta, lowestHeightTouched} | ||
}(results) | ||
|
||
rChild, rRegCountDelta, rRegSizeDelta, rLowestHeightTouched = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune) | ||
|
||
// Wait for results from goroutine. | ||
ret := <-results | ||
lChild, lRegCountDelta, lRegSizeDelta, lLowestHeightTouched = ret.child, ret.regCountDelta, ret.regSizeDelta, ret.lowestHeightTouched |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you benchmarked the construction with the channels here? This is very performance-sensitive code. I would guess that a waitgroup is much more performant than a channel (there are multiple synchronization paradigms involved in a channel; here the implementation).
Suggestion:
I think this synchronization can be implemented as easily using WaitGroup:
- Create an instance of
updateResult
and awg:= sync.WaitGroup{}
- spawn the auxiliary go routine
- The memory model guarantees that step 1 happens before any statement within the routine
- write the return values to
updateResult
- in the main thread
wg.Wait()
for the auxiliary thread. This guarantees that the operations in the auxiliary thread happened before theWait()
returns.- Note: While Go's memory model does not talk about
WaitGroups
, they also establish a happens-before. Hence,
- Note: While Go's memory model does not talk about
In summary, I think the entire else-branch can be written as follows
// runtime optimization: process the left child is a separate thread | |
wg := sync.WaitGroup{} | |
wg.Add(1) | |
go func() { | |
defer wg.Done() | |
lChild = update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune) | |
}() | |
rChild = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune) | |
wg.Wait() | |
// Since we're receiving 4 items from goroutine, use a | |
// channel to prevent them going on the heap. | |
type updateResult struct { | |
child *node.Node | |
regCountDelta int64 | |
regSizeDelta int64 | |
lowestHeightTouched int | |
} | |
results := make(chan updateResult, 1) | |
go func(retChan chan<- updateResult) { | |
child, regCountDelta, regSizeDelta, lowestHeightTouched := update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune) | |
retChan <- updateResult{child, regCountDelta, regSizeDelta, lowestHeightTouched} | |
}(results) | |
rChild, rRegCountDelta, rRegSizeDelta, rLowestHeightTouched = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune) | |
// Wait for results from goroutine. | |
ret := <-results | |
lChild, lRegCountDelta, lRegSizeDelta, lLowestHeightTouched = ret.child, ret.regCountDelta, ret.regSizeDelta, ret.lowestHeightTouched | |
// runtime optimization: process the left child is a separate thread | |
wg := sync.WaitGroup{} | |
wg.Add(1) | |
go func() { | |
lChild, lRegCountDelta, lRegSizeDelta, lLowestHeightTouched = update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune) | |
}() | |
rChild, rRegCountDelta, rRegSizeDelta, rLowestHeightTouched = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune) | |
wg.Wait() // WaitGroup induces happens-before relationship: https://groups.google.com/g/golang-nuts/c/5oHzhzXCcmM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you benchmarked the construction with the channels here? This is very performance-sensitive code.
Yes. Last week, I ran non-micro benchmarks. Checkpoint creation benchmark using mainnet data (replays 264881 updates from WALs). Using channel
instead of WaitGroup
was faster and reduced allocs by hundreds of millions.
I also ran micro benchmarks yesterday and they confirm the suggested WaitGroup
approach can be slower than this PR by about 12%-21% while also causing extra 1-4 allocs/op depending on details (see below).
I would guess that a waitgroup is much more performant than a channel
Although WaitGroup
approach can be faster (especially with 2+ goroutines), we only use 1 goroutine and need to communicate results from it. So using channel
is faster and uses fewer allocs/op
in this case.
Suggestion:
I think this synchronization can be implemented as easily using WaitGroup:
- Create an instance of
updateResult
and awg:= sync.WaitGroup{}
- spawn the auxiliary go routine
...
In summary, I think the entire else-branch can be written as follows
The suggested approach in the code snippet (screenshot) can be slower by around 21% and also cause an extra 4 allocs/op.
The extra allocs/op (lChild
, lRegCountDelta
, lRegSizeDelta
, lLowestHeightTouched
) is incurred for all recursive calls, even for calls that use the "if-else" branch avoiding the goroutine.
- Create an instance of
updateResult
and awg:= sync.WaitGroup{}
If the 4 values are put into updateResult
struct, then using the suggested WaitGroup
approach still causes 1 extra alloc/op and it's still slower than using channel
in this PR by around 12%.
More Details
For micro benchmarks today, I used Go 1.17.8 on linux_amd64 (Haswell Xeon) with this code:
https://gist.github.com/fxamacker/4cec13885dcf2c19d055b13f2b51289b
There are two groups of benchmarks:
- WaitGroup vs channel (signal only, no data values communicated)
- communicating values from goroutine
WaitGroup vs channel
With one goroutine, it is faster to use channel.
name old time/op new time/op delta
WaitGroup-8 471ns ± 0% 448ns ± 0% -4.79% (p=0.000 n=20+20)
name old alloc/op new alloc/op delta
WaitGroup-8 32.0B ± 0% 112.0B ± 0% +250.00% (p=0.000 n=20+20)
name old allocs/op new allocs/op delta
WaitGroup-8 2.00 ± 0% 2.00 ± 0% ~ (all equal)
Communicating values from goroutine
Communicating 4 ints using referenced variables with WaitGroup causes 4 integers to allocate on the heap, and is also slower.
> go-benchrun WaitGroupReferenceFourInts ChannelResult -benchmem -count=20
name old time/op new time/op delta
WaitGroupReferenceFourInts-8 580ns ± 0% 457ns ± 0% -21.15% (p=0.000 n=19+20)
name old alloc/op new alloc/op delta
WaitGroupReferenceFourInts-8 96.0B ± 0% 112.0B ± 0% +16.67% (p=0.000 n=20+20)
name old allocs/op new allocs/op delta
WaitGroupReferenceFourInts-8 6.00 ± 0% 2.00 ± 0% -66.67% (p=0.000 n=20+20)
Communicating 1 wrapper referenced struct with WaitGroup requires allocating struct on the heap, and is slower.
> go-benchrun WaitGroupReferenceOneStruct ChannelResult -benchmem -count=20
name old time/op new time/op delta
WaitGroupReferenceOneStruct-8 520ns ± 0% 458ns ± 0% -11.91% (p=0.000 n=19+17)
name old alloc/op new alloc/op delta
WaitGroupReferenceOneStruct-8 72.0B ± 0% 112.0B ± 0% +55.56% (p=0.000 n=20+20)
name old allocs/op new allocs/op delta
WaitGroupReferenceOneStruct-8 3.00 ± 0% 2.00 ± 0% -33.33% (p=0.000 n=20+20)
Although WaitGroup
approach can be faster (especially with 2+ goroutines), we only use 1 goroutine and need to communicate results from it. So using channel
is faster and uses fewer allocs/op
in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed benchmarking result. Makes sense. Could you add a brief TLDR to the code, so that we document this consideration, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 051ba8e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting results! I didn't expect channel
to be faster than waitGroup
in this case.
When compactLeaf from a higher height needs to be created at a lower height during update, its payload doesn't need to deep copied because we are reusing the trie internal reference and payloads of leafs are immutable.
A register is allocated if and only if it has a non-empty value. A register is unallocated if and only if its value is empty. In update(), use empty payload value to distinguish allocated vs unallocated registers to adjust allocatedRegCountDelta and allocatedRegSizeDelta.
@AlexHentschel Thanks for the detailed explanation! 😄 It cleared up a misunderstanding. I was using payload size instead of payload value size to test whether a register is allocated.
I considered Options 2 & 3 before opening this PR, and created this PR for Option 2 (only track allocated registers). I didn't pick Option 3 (track both allocated and unallocated registers in the trie) because it seemed keeping track of all registers in the trie requires keeping track of unallocated registers which might be removed by pruning. So option 3 shifts complexity from detecting "allocated vs unallocated" to detecting "whether a pruned child node was just unallocated". Please let me know if I missed anything that should be considered when choosing between Option 2 & 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, thanks for doing it, looks good to me.
In v0, length of encoded payload value is encoded in 8 bytes. In v1, length of encoded payload value is encoded in 4 bytes. Payload decoding handles both v0 and v1. Payload is encoded according to version received as parameter. This allows gradual migration of other encoding formats using payload, such as TrieUpdate and TrieProof.
Bump version of checkpoint file to v5. Support reading v4 for compatibility. Add tests and test data to read v4.
ledger/complete/mtrie/trie/trie.go
Outdated
} | ||
|
||
// update traverses the subtree and updates the stored registers | ||
// update traverses the subtree, updates the stored registers, and returns: | ||
// * new or orignial node (n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// * new or orignial node (n) | |
// * new or original node (n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 953571a.
ledger/complete/mtrie/trie/trie.go
Outdated
if payloads[i].IsEmpty() { | ||
// Old payload is not empty while new payload is empty. | ||
// Allocated register will be unallocated. | ||
allocatedRegCountDelta = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❗
I am not sure this logic here is correct in all cases. Depending on whether the prune
parameter is set or not, we might add registers with empty Value
Consider the following sequence of events:
- a register with empty Value is added with a
NewTrieWithUpdatedRegisters
call withprune = false
flow-go/ledger/complete/mtrie/trie/trie.go
Lines 350 to 353 in ae98303
if payloads[0].IsEmpty() { // Unallocated register doesn't affect allocatedRegCountDelta and allocatedRegSizeDelta. return n, 0, 0, nodeHeight }
this will not increase theregCount
- In a second call, with
NewTrieWithUpdatedRegisters
call withprune = true
we remove the leaf. This will decreaseregCount
.
Due to the existence of the prune
flag, you can't make any assumption whether an existing leaf represents an allocated or an unallocated register, because with prune = false
, our trie will store unallocasted registers. Hence, whenever you change the value (meaning the register was already explicitly represented in the Trie before the update), you have to consider all 4 possibilities:
{Before the update, the leaf represented an allocated or unallocated register?}
x {After the update, the leaf represents an allocated or unallocated register?}
You could extract the logic into a sub-function, which would also increase readability of the code:
func computeDeltas(oldNode, newNode *node.Node) (allocatedRegCountDelta int64, allocatedRegSizeDelta int64) {
var oldNodeIsAllocated, newNodeIsAlocated int64
var oldNodeSize, newNodeSize int
if oldNode != nil && !oldNode.Payload().IsEmpty() {
oldNodeIsAllocated = 1
oldNodeSize = oldNode.Payload().Size()
}
if newNode != nil && !newNode.Payload().IsEmpty() {
newNodeIsAlocated = 1
newNodeSize = newNode.Payload().Size()
}
allocatedRegCountDelta = newNodeIsAlocated - oldNodeIsAllocated
allocatedRegSizeDelta = int64(newNodeSize - oldNodeSize)
return
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this logic here is correct in all cases. Depending on whether the prune parameter is set or not, we might add registers with empty Value
Consider the following sequence of events:
- a register with empty Value is added with a NewTrieWithUpdatedRegisters call with prune = false
...
this will not increase the regCount- In a second call, with NewTrieWithUpdatedRegisters call with prune = true we remove the leaf. This will decrease regCount.
I think the logic correctly handles the sequence of events mentioned. Because in step 2, we don't decrease the reg count when prune = true
. Pruning is not a consideration since reg count is decreased only when allocated payload is unallocated (not pruned).
To confirm, I added TestTrieAllocatedRegCountRegSizeWithMixedPruneFlag
just now in d9573d0 and cbb8805.
TestTrieAllocatedRegCountRegSizeWithMixedPruneFlag tests allocated
register count and size for updated trie with mixed pruning flag.
It tests the following updates:
* step 1 : update empty trie with new paths and payloads
(255 allocated registers)
* step 2 : remove a payload without pruning
(254 allocated registers)
* step 3a: remove previously removed payload with pruning
(254 allocated registers)
* step 3b: update trie from step 2 with a new payload (sibling of
removed payload) with pruning (255 allocated registers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could extract the logic into a sub-function, which would also increase readability of the code:
func computeDeltas(...
I agree and extracted to two functions in 4bcc1c8
// computeAllocatedRegDeltasFromHigherHeight returns the deltas needed
// to compute the allocated reg count and reg size when a payload
// needs to be updated or unallocated at a lower height.
func computeAllocatedRegDeltasFromHigherHeight(oldPayload *ledger.Payload)
(allocatedRegCountDelta, allocatedRegSizeDelta int64)
// computeAllocatedRegDeltas returns the allocated reg count and reg size deltas
// computed from old payload and new payload.
// CAUTION: !oldPayload.Equals(newPayload)
func computeAllocatedRegDeltas(oldPayload, newPayload *ledger.Payload)
(allocatedRegCountDelta, allocatedRegSizeDelta int64)
UPDATE: added "compute" prefix to the function names
ledger/complete/mtrie/trie/trie.go
Outdated
// runtime optimization: process the left child is a separate thread | ||
wg := sync.WaitGroup{} | ||
wg.Add(1) | ||
go func() { | ||
defer wg.Done() | ||
lChild = update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune) | ||
}() | ||
rChild = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune) | ||
wg.Wait() | ||
|
||
// Since we're receiving 4 items from goroutine, use a | ||
// channel to prevent them going on the heap. | ||
type updateResult struct { | ||
child *node.Node | ||
regCountDelta int64 | ||
regSizeDelta int64 | ||
lowestHeightTouched int | ||
} | ||
results := make(chan updateResult, 1) | ||
go func(retChan chan<- updateResult) { | ||
child, regCountDelta, regSizeDelta, lowestHeightTouched := update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune) | ||
retChan <- updateResult{child, regCountDelta, regSizeDelta, lowestHeightTouched} | ||
}(results) | ||
|
||
rChild, rRegCountDelta, rRegSizeDelta, rLowestHeightTouched = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune) | ||
|
||
// Wait for results from goroutine. | ||
ret := <-results | ||
lChild, lRegCountDelta, lRegSizeDelta, lLowestHeightTouched = ret.child, ret.regCountDelta, ret.regSizeDelta, ret.lowestHeightTouched |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed benchmarking result. Makes sense. Could you add a brief TLDR to the code, so that we document this consideration, please.
Thanks for your explanation why you went with Option 2.
I think you are right. I gave it some thought and in fact it seems Option 2 would be even more complex, as it interacts with the pruning logic. Great insight. Thanks. |
TestTrieAllocatedRegCountRegSizeWithMixedPruneFlag tests allocated register count and size for updated trie with mixed pruning flag. It tests the following updates with prune flag set to true: * step 1 : update empty trie with new paths and payloads (255 allocated registers) * step 2 : remove a payload without pruning (254 allocated registers) * step 3a: remove previously removed payload with pruning (254 allocated registers) * step 3b: update trie from step 2 with a new payload (sibling of removed payload) with pruning (255 allocated registers)
- computeAllocatedRegDeltasFromHigherHeight - computeAllocatedRegDeltas
Update checkpoint file format version from v4 to v5
…ad-value-length-in-checkpoint Reduce checkpoint file size by using fewer bytes to encode length of encoded payload value
} | ||
oldPayloadSize := oldPayload.Size() | ||
allocatedRegSizeDelta -= int64(oldPayloadSize) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for preferring subtractions over direct assignment? (I guess the compiler may generate the same code in this case)
allocatedRegCountDelta = -1
and
allocatedRegSizeDelta = -int64(oldPayloadSize)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tarakby Good question! No preference, it was just extracted code from a larger function I didn't bother modifying. 😄
Go 1.17 compiler produces the same binary. See code snippet comparison at godbolt.org.
Unfortunately, Compiler Explorer forgets the scrolled position of the 2 compiler output views. It's useful to see side-by-side while hovering over the source code window.
ledger/complete/mtrie/trie/trie.go
Outdated
// runtime optimization: process the left child is a separate thread | ||
wg := sync.WaitGroup{} | ||
wg.Add(1) | ||
go func() { | ||
defer wg.Done() | ||
lChild = update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune) | ||
}() | ||
rChild = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune) | ||
wg.Wait() | ||
|
||
// Since we're receiving 4 items from goroutine, use a | ||
// channel to prevent them going on the heap. | ||
type updateResult struct { | ||
child *node.Node | ||
regCountDelta int64 | ||
regSizeDelta int64 | ||
lowestHeightTouched int | ||
} | ||
results := make(chan updateResult, 1) | ||
go func(retChan chan<- updateResult) { | ||
child, regCountDelta, regSizeDelta, lowestHeightTouched := update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune) | ||
retChan <- updateResult{child, regCountDelta, regSizeDelta, lowestHeightTouched} | ||
}(results) | ||
|
||
rChild, rRegCountDelta, rRegSizeDelta, rLowestHeightTouched = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune) | ||
|
||
// Wait for results from goroutine. | ||
ret := <-results | ||
lChild, lRegCountDelta, lRegSizeDelta, lLowestHeightTouched = ret.child, ret.regCountDelta, ret.regSizeDelta, ret.lowestHeightTouched |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting results! I didn't expect channel
to be faster than waitGroup
in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to see the register count and max depth removed from the node struct. Nice work!
Description
Node.regCount
andNode.maxDepth
to reduce file size and memoryCloses #1747
Closes #1748
Closes #2125
Updates #1744
Updates #1884 https://github.com/dapperlabs/flow-go/issues/6114
Big thanks to @ramtinms for opening issues #1747, #1748, and for clarifying things. 👍
Changes include:
remove Node.regCount (8 bytes) and Node.maxDepth (2 bytes)
remove regCount (8 bytes) and maxDepth (2 bytes) from checkpoint node serialization
add MTrie.regCount (8 bytes) and MTrie.regSize (8 bytes)
add regCount (8 bytes) and regSize (b bytes) to checkpoint trie serialization
modify trie update() to return regCountDelta, regSizeDelta, and lowestHeightTouched so that NewTrieWithUpdatedRegisters() can compute regCount, regSize, and maxDepthTouched for the updated trie
fix Payload.Equals() crash bug when p is nil
fix Payload.Equals() bug when comparing nil payload with empty payload
optimize MTrie update() to reduce heap allocations
Note
lowestHeightTouched returned by update() is the lowest height reached during recursive update. Unlike maxDepth, lowestHeightTouched isn't affected by prune flag. It's mostly new/updated node height. It can also be height of new node created from compact leaf at a higher height.
Preliminary Results
Unoptimized Checkpoint Creation
After the first 30 minutes and for next 11+ hours on benchnet (15-17+ hours on EN3):
Optimized Checkpoint Creation (PR #1944)
Finishes in about 15 minutes and peaks in the last minute at:
Optimized Checkpoint Creation (PR #1944 + PR #2126)
Finishes in 14 - 15 minutes and peaks in the last minute at:
New Checkpoint (Load Old + Replay 41 WALs + Create New)
PR #1944 vs PR #2126
Mainnet vs (PR #1944 + PR #2126)
Go 1.16 on benchnet (the big one)
NOTE: File system cache can make the duration fluctuate. The first run after OS reboot can take 20-60 secs longer due to cache and other reasons (e.g. not waiting long enough after OS boots).