-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EN Performance] [POC] Reduce operational RAM by 152+ GB and checkpoint duration by 24 mins by reusing ledger state #2770
Conversation
Avoid creating separate ledger state during checkpointing. Based on EN2 logs (July 8, 2022), this will - reduce operational memory by at least 152GB - reduce memory allocations (TBD) - reduce checkpoint duration by 24 mins (from 45 mins)
|
||
go func() { | ||
walChan <- l.wal.RecordUpdate(trieUpdate) | ||
segmentNum, err := l.wal.RecordUpdate(trieUpdate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After writing the update to WAL, we return the segment num that the update was written to.
If two updates are written to the same segment file, then the same segmentNum will be returned again, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If two updates are written to the same segment file, then the same segmentNum will be returned again, right?
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to return two numbers, the segmentNum and the index of the trie update included in the segment?
So that when subscribing the SegmentTrieUpdate, we can double check the order:
SegmentTrie{trie: trie1, segumentNum: 10, indexInSegument: 4}
SegmentTrie{trie: trie3, segumentNum: 10, indexInSegument: 6}
SegmentTrie{trie: trie2, segumentNum: 10, indexInSegument: 5}
If we receive the trie updates in the above order, then we know it must be inconsistent with the order in the WAL files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return ledger.State(hash.DummyHash), nil, fmt.Errorf("cannot get updated trie: %w", err) | ||
} | ||
|
||
l.trieUpdateCh <- &wal.SegmentTrie{Trie: trie, SegmentNum: walResult.segmentNum} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the update has been written to WAL, and a new trie is created, we make a SegmentTrie that contains both the new trie and the segment num that contains the update, and push it to the channel, so that the channel subscriber will process it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because tries in checkpoints need to be in sync with tries created by updates in WAL segments. So we need to know new trie and the segment num to reuse trie for checkpointing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the Set method be called concurrently?
Since pushing the trie updates to the channel is concurrent, is it possible that the order of the updates we read from the channel will be different from the order in the WAL file?
What I'm afraid is that imaging the following updates are called concurrently:
- SegmentTrie{trie: trie1, segumentNum: 10}
- SegmentTrie{trie: trie3, segumentNum: 11}
- SegmentTrie{trie: trie2, segumentNum: 10}
Then it's possible that the cache will be inconsistent if trie2 is not included in the previousSegment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the Set method be called concurrently?
Yes, I asked Maks the same question yesterday and he replied "If fork happens and first collection finishes execution at the same time. Pretty unlikely but can theoretically happen."
Since pushing the trie updates to the channel is concurrent, is it possible that the order of the updates we read from the channel will be different from the order in the WAL file?
Yes, you're correct. Although this POC doesn't handle Set
in parallel, PR #2792 handles parallel execution of Set
.
This PR #2770 is just a proof-of-concept. It was superceded by PR #2792 which supports parallel Set
and memory improvements.
} | ||
|
||
// Add to active segment tries cache | ||
if segmentNum == c.tries.segmentNum { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cache an active segment trie in memory, which contains all the trie root nodes which are made from the updates that is written to the same segment file.
When there is a new trie with a different segment num, it means the new trie is written to a different segment file with a different segment num. In this case, we must have seen all the trie root nodes that are written in a segment file, and no more trie node will be written to that segment file. So we return the previous segment trie.
When the previous segment trie is returned, it should be consistent with all the tries written in the segment file. Since it's consistent, we can return the previous segment trie, instead of reading them again from the segment file, which would increase operational memory. And this is the main idea of the optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and very well said! 👍
c.Lock() | ||
defer c.Unlock() | ||
|
||
err := c.forest.AddTries(tries) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding all the trie nodes of a segment to the forest, and check whether checkpointing should be triggered by comparing with the checkpoint distance config.
If checkpoint should be triggered, then it should return the last X trie nodes to be included in a new checkpoint file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, we assume that the tries are consistent with those in the WAL files.
Consistent means both the total number and the order are consistent.
I wonder if possible that we can sanity check that assumption with low cost.
How is the WAL file encoded? Is it possible to scan through a range of WAL files and only read the file header (or footer) about the number of trie nodes, and their root hash without loading the entire file into memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the WAL file encoded? Is it possible to scan through a range of WAL files and only read the file header (or footer) about the number of trie nodes, and their root hash without loading the entire file into memory?
I think WAL file uses proprietary encoding for records. I don't think there is a quick way to scan information without reading the entire file.
At this point, we assume that the tries are consistent with those in the WAL files.
I wonder if possible that we can sanity check that assumption with low cost.
I think PR #2792 enforce this assumption with communication between Compactor
and Ledger.Set
.
Also TestCompactorAccuracy
in PR #2792 passes. It creates ~30 segments and triggers checkpointing every 5 segments. The tests expects checkpointed tries to match replayed tries. Replayed tries are tries updated by replaying all WAL segments (from segment 0, ignoring prior checkpoints) to the checkpoint number. This verifies that checkpointed tries are snapshot of segments and at segment boundary.
TestCompactorConcurrency
also tests Set
in parallel.
|
||
go func() { | ||
walChan <- l.wal.RecordUpdate(trieUpdate) | ||
segmentNum, err := l.wal.RecordUpdate(trieUpdate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing to WAL files happens concurrently, we need a way to ensure the order of trie updates written to the file and the updates pushed to trieUpdateCh is consistent. The current implementation can not guarantee that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation can not guarantee that
The current implementation is not PR 2770 because it was superceded by PR #2792 days ago.
In PR #2792, when Compactor
receives trieUpdate
from channel, Compactor
- writes update in WAL,
- signals to
ledger.Set
when WAL update is completed, - waits for trie update completion (adding new trie ledger state)
Because all these step, the order of trie updates and WAL updates are consistent.
Closing this because PR #2792 was approved. I kept this open just in case Maks and Ramtin discovered a blocker requiring us to reconsider using some of this approach. |
Changes
Avoid creating separate ledger state during checkpointing.
Updates #2286
Big thanks to @ramtinms for taking a look and providing useful insights.
Impact
Based on EN2 logs (July 8, 2022), reusing ledger state should:
Context
Recent increase in transactions is causing WAL files to get created more frequently, which causes checkpoints to happen more frequently, increases checkpoint file size, and increases ledger state size in memory.
Earlier this year:
As of July 8, 2022:
Design
Goal is to reduce operational RAM, reduce allocations, and speed up checkpointing by not creating a separate ledger state.
To achieve these goals, this PR doesn't create a separate ledger state and:
Three new goroutines are used by this PR:
activeSegmentTrieCompactor
:Ledger
segmentsTrieCompactor
when active segment is finalized (new segment is created)segmentsTrieCompactor
:CachedCompactor
:activeSegmentTrieCompactor
andsegmentsTrieCompactor
Other Approaches
I will open a separate PR with a different approach. It will have looser coupling between layers and fewer edge cases to handle, but the tradeoff is the introduction of one or two locks.
TODO: