[EN Performance] [POC] Reduce operational RAM by 152+ GB and checkpoint duration by 24 mins by reusing ledger state #2770

fxamacker · 2022-07-08T16:12:58Z

Changes

Avoid creating separate ledger state during checkpointing.

⚠️ DO NOT REVIEW YET, this PR is work in progress POC that might be replaced with new PR. I will open another PR using an alternate approach the has more loose coupling between layers and fewer edge cases to handle but it requires locks. ⚠️

Updates #2286

Big thanks to @ramtinms for taking a look and providing useful insights.

Impact

Based on EN2 logs (July 8, 2022), reusing ledger state should:

reduce operational memory by at least 152GB
reduce memory allocations (TBD)
reduce checkpoint duration by 24 mins (more than half)

Context

Recent increase in transactions is causing WAL files to get created more frequently, which causes checkpoints to happen more frequently, increases checkpoint file size, and increases ledger state size in memory.

Earlier this year:

checkpoint frequency: about 0-2 times per day
checkpoint file size: 53GB (if using today's file format)

As of July 8, 2022:

checkpoint frequency: about every 2 hours
checkpoint file size: 126GB

Design

Goal is to reduce operational RAM, reduce allocations, and speed up checkpointing by not creating a separate ledger state.

To achieve these goals, this PR doesn't create a separate ledger state and:

avoids blocking with communications between main ledger state and cached ledger state
avoids blocking with time-consuming tasks such as creating checkpoint
retains as few cached mtries as possible before checkpointing

Three new goroutines are used by this PR:

activeSegmentTrieCompactor:

receives updated mtrie and segment number from Ledger
caches mtries for active segment
sends batched mtries to segmentsTrieCompactor when active segment is finalized (new segment is created)

segmentsTrieCompactor:

receives batched mtries for finalized segment
caches mtries in forest for finalized segments
lauch checkpointing goroutine when enough finalized segments are accumulated

CachedCompactor:

starts and stops activeSegmentTrieCompactor and segmentsTrieCompactor
calls observers when needed
deletes extra checkpoint files

Other Approaches

I will open a separate PR with a different approach. It will have looser coupling between layers and fewer edge cases to handle, but the tradeoff is the introduction of one or two locks.

TODO:

add more tests
handle edge cases
handle errors

Avoid creating separate ledger state during checkpointing. Based on EN2 logs (July 8, 2022), this will - reduce operational memory by at least 152GB - reduce memory allocations (TBD) - reduce checkpoint duration by 24 mins (from 45 mins)

zhangchiqing · 2022-07-21T00:47:36Z

ledger/complete/ledger.go


 	go func() {
-		walChan <- l.wal.RecordUpdate(trieUpdate)
+		segmentNum, err := l.wal.RecordUpdate(trieUpdate)


After writing the update to WAL, we return the segment num that the update was written to.

If two updates are written to the same segment file, then the same segmentNum will be returned again, right?

If two updates are written to the same segment file, then the same segmentNum will be returned again, right?

Yes.

Is it possible to return two numbers, the segmentNum and the index of the trie update included in the segment?

So that when subscribing the SegmentTrieUpdate, we can double check the order:

SegmentTrie{trie: trie1, segumentNum: 10, indexInSegument: 4} SegmentTrie{trie: trie3, segumentNum: 10, indexInSegument: 6} SegmentTrie{trie: trie2, segumentNum: 10, indexInSegument: 5}

If we receive the trie updates in the above order, then we know it must be inconsistent with the order in the WAL files.

It's possible to include index in PR #2792. But since PR #2792 handles concurrency and ensures the WAL update and ledger state update are in sync, maybe we don't need to include index to detect inconsistency in #2792.

zhangchiqing · 2022-07-21T00:53:28Z

ledger/complete/ledger.go

+			return ledger.State(hash.DummyHash), nil, fmt.Errorf("cannot get updated trie: %w", err)
+		}
+
+		l.trieUpdateCh <- &wal.SegmentTrie{Trie: trie, SegmentNum: walResult.segmentNum}


After the update has been written to WAL, and a new trie is created, we make a SegmentTrie that contains both the new trie and the segment num that contains the update, and push it to the channel, so that the channel subscriber will process it

Yes, because tries in checkpoints need to be in sync with tries created by updates in WAL segments. So we need to know new trie and the segment num to reuse trie for checkpointing.

Would the Set method be called concurrently?

Since pushing the trie updates to the channel is concurrent, is it possible that the order of the updates we read from the channel will be different from the order in the WAL file?

What I'm afraid is that imaging the following updates are called concurrently:

SegmentTrie{trie: trie1, segumentNum: 10}

SegmentTrie{trie: trie3, segumentNum: 11}

SegmentTrie{trie: trie2, segumentNum: 10}

Then it's possible that the cache will be inconsistent if trie2 is not included in the previousSegment

Would the Set method be called concurrently?

Yes, I asked Maks the same question yesterday and he replied "If fork happens and first collection finishes execution at the same time. Pretty unlikely but can theoretically happen."

Since pushing the trie updates to the channel is concurrent, is it possible that the order of the updates we read from the channel will be different from the order in the WAL file?

Yes, you're correct. Although this POC doesn't handle Set in parallel, PR #2792 handles parallel execution of Set.

This PR #2770 is just a proof-of-concept. It was superceded by PR #2792 which supports parallel Set and memory improvements.

zhangchiqing · 2022-07-21T00:56:56Z

ledger/complete/wal/cached_compactor.go

+	}
+
+	// Add to active segment tries cache
+	if segmentNum == c.tries.segmentNum {


We cache an active segment trie in memory, which contains all the trie root nodes which are made from the updates that is written to the same segment file.

When there is a new trie with a different segment num, it means the new trie is written to a different segment file with a different segment num. In this case, we must have seen all the trie root nodes that are written in a segment file, and no more trie node will be written to that segment file. So we return the previous segment trie.

When the previous segment trie is returned, it should be consistent with all the tries written in the segment file. Since it's consistent, we can return the previous segment trie, instead of reading them again from the segment file, which would increase operational memory. And this is the main idea of the optimization.

Yes and very well said! 👍

zhangchiqing · 2022-07-21T01:08:16Z

ledger/complete/wal/cached_compactor.go

+	c.Lock()
+	defer c.Unlock()
+
+	err := c.forest.AddTries(tries)


Adding all the trie nodes of a segment to the forest, and check whether checkpointing should be triggered by comparing with the checkpoint distance config.

If checkpoint should be triggered, then it should return the last X trie nodes to be included in a new checkpoint file.

Yes, exactly! 👍

At this point, we assume that the tries are consistent with those in the WAL files.

Consistent means both the total number and the order are consistent.

I wonder if possible that we can sanity check that assumption with low cost.

How is the WAL file encoded? Is it possible to scan through a range of WAL files and only read the file header (or footer) about the number of trie nodes, and their root hash without loading the entire file into memory?

How is the WAL file encoded? Is it possible to scan through a range of WAL files and only read the file header (or footer) about the number of trie nodes, and their root hash without loading the entire file into memory?

I think WAL file uses proprietary encoding for records. I don't think there is a quick way to scan information without reading the entire file.

At this point, we assume that the tries are consistent with those in the WAL files.
I wonder if possible that we can sanity check that assumption with low cost.

I think PR #2792 enforce this assumption with communication between Compactor and Ledger.Set.

Also TestCompactorAccuracy in PR #2792 passes. It creates ~30 segments and triggers checkpointing every 5 segments. The tests expects checkpointed tries to match replayed tries. Replayed tries are tries updated by replaying all WAL segments (from segment 0, ignoring prior checkpoints) to the checkpoint number. This verifies that checkpointed tries are snapshot of segments and at segment boundary.

TestCompactorConcurrency also tests Set in parallel.

zhangchiqing · 2022-07-21T23:34:44Z

ledger/complete/ledger.go


 	go func() {
-		walChan <- l.wal.RecordUpdate(trieUpdate)
+		segmentNum, err := l.wal.RecordUpdate(trieUpdate)


Writing to WAL files happens concurrently, we need a way to ensure the order of trie updates written to the file and the updates pushed to trieUpdateCh is consistent. The current implementation can not guarantee that

The current implementation can not guarantee that

The current implementation is not PR 2770 because it was superceded by PR #2792 days ago.

In PR #2792, when Compactor receives trieUpdate from channel, Compactor

writes update in WAL,

signals to ledger.Set when WAL update is completed,

waits for trie update completion (adding new trie ledger state)

Because all these step, the order of trie updates and WAL updates are consistent.

fxamacker · 2022-07-26T15:06:36Z

Closing this because PR #2792 was approved. I kept this open just in case Maks and Ramtin discovered a blocker requiring us to reconsider using some of this approach.

Reuse ledger state during checkpointing

91fc882

Avoid creating separate ledger state during checkpointing. Based on EN2 logs (July 8, 2022), this will - reduce operational memory by at least 152GB - reduce memory allocations (TBD) - reduce checkpoint duration by 24 mins (from 45 mins)

fxamacker requested review from ramtinms, m4ksio and AlexHentschel as code owners July 8, 2022 16:12

fxamacker self-assigned this Jul 8, 2022

fxamacker marked this pull request as draft July 8, 2022 16:39

fxamacker added the Execution Cadence Execution Team label Jul 12, 2022

fxamacker mentioned this pull request Jul 12, 2022

[Execution State] Avoid creating separate MTrie state during checkpoint creation for about -200GB peak RAM use and -32 minutes duration #2286

Closed

2 tasks

fxamacker changed the title ~~[Execution Node] [WIP] Reduce operational RAM by 152+ GB and checkpoint duration by 24 mins by reusing ledger state~~ [Execution Node] [POC] Reduce operational RAM by 152+ GB and checkpoint duration by 24 mins by reusing ledger state Jul 13, 2022

fxamacker mentioned this pull request Jul 13, 2022

[EN Performance] Reuse ledger state for about -200GB peak RAM, -160GB disk i/o, and about -32 minutes duration #2792

Merged

14 tasks

zhangchiqing reviewed Jul 21, 2022

View reviewed changes

fxamacker closed this Jul 26, 2022

fxamacker changed the title ~~[Execution Node] [POC] Reduce operational RAM by 152+ GB and checkpoint duration by 24 mins by reusing ledger state~~ [EN Performance] [POC] Reduce operational RAM by 152+ GB and checkpoint duration by 24 mins by reusing ledger state Aug 10, 2022

fxamacker deleted the fxamacker/reuse-mtrie-state-for-checkpointing branch February 2, 2024 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EN Performance] [POC] Reduce operational RAM by 152+ GB and checkpoint duration by 24 mins by reusing ledger state #2770

[EN Performance] [POC] Reduce operational RAM by 152+ GB and checkpoint duration by 24 mins by reusing ledger state #2770

fxamacker commented Jul 8, 2022 •

edited

Loading

zhangchiqing Jul 21, 2022

fxamacker Jul 21, 2022

zhangchiqing Jul 21, 2022

fxamacker Jul 21, 2022

zhangchiqing Jul 21, 2022

fxamacker Jul 21, 2022

zhangchiqing Jul 21, 2022

fxamacker Jul 21, 2022

zhangchiqing Jul 21, 2022

fxamacker Jul 21, 2022

zhangchiqing Jul 21, 2022

fxamacker Jul 21, 2022

zhangchiqing Jul 21, 2022 •

edited

Loading

fxamacker Jul 22, 2022

zhangchiqing Jul 21, 2022

fxamacker Jul 22, 2022

fxamacker commented Jul 26, 2022

[EN Performance] [POC] Reduce operational RAM by 152+ GB and checkpoint duration by 24 mins by reusing ledger state #2770

[EN Performance] [POC] Reduce operational RAM by 152+ GB and checkpoint duration by 24 mins by reusing ledger state #2770

Conversation

fxamacker commented Jul 8, 2022 • edited Loading

Changes

Impact

Context

Design

Other Approaches

TODO:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangchiqing Jul 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fxamacker commented Jul 26, 2022

fxamacker commented Jul 8, 2022 •

edited

Loading

zhangchiqing Jul 21, 2022 •

edited

Loading