Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EN Performance] [POC] Reduce operational RAM by 152+ GB and checkpoint duration by 24 mins by reusing ledger state #2770

Closed

Conversation

fxamacker
Copy link
Member

@fxamacker fxamacker commented Jul 8, 2022

Changes

Avoid creating separate ledger state during checkpointing.

⚠️ DO NOT REVIEW YET, this PR is work in progress POC that might be replaced with new PR. I will open another PR using an alternate approach the has more loose coupling between layers and fewer edge cases to handle but it requires locks. ⚠️

Updates #2286

Big thanks to @ramtinms for taking a look and providing useful insights.

Impact

Based on EN2 logs (July 8, 2022), reusing ledger state should:

  • reduce operational memory by at least 152GB
  • reduce memory allocations (TBD)
  • reduce checkpoint duration by 24 mins (more than half)

Context

Recent increase in transactions is causing WAL files to get created more frequently, which causes checkpoints to happen more frequently, increases checkpoint file size, and increases ledger state size in memory.

Earlier this year:

  • checkpoint frequency: about 0-2 times per day
  • checkpoint file size: 53GB (if using today's file format)

As of July 8, 2022:

  • checkpoint frequency: about every 2 hours
  • checkpoint file size: 126GB

Design

Goal is to reduce operational RAM, reduce allocations, and speed up checkpointing by not creating a separate ledger state.

To achieve these goals, this PR doesn't create a separate ledger state and:

  • avoids blocking with communications between main ledger state and cached ledger state
  • avoids blocking with time-consuming tasks such as creating checkpoint
  • retains as few cached mtries as possible before checkpointing

Three new goroutines are used by this PR:

activeSegmentTrieCompactor:

  • receives updated mtrie and segment number from Ledger
  • caches mtries for active segment
  • sends batched mtries to segmentsTrieCompactor when active segment is finalized (new segment is created)

segmentsTrieCompactor:

  • receives batched mtries for finalized segment
  • caches mtries in forest for finalized segments
  • lauch checkpointing goroutine when enough finalized segments are accumulated

CachedCompactor:

  • starts and stops activeSegmentTrieCompactor and segmentsTrieCompactor
  • calls observers when needed
  • deletes extra checkpoint files

Other Approaches

I will open a separate PR with a different approach. It will have looser coupling between layers and fewer edge cases to handle, but the tradeoff is the introduction of one or two locks.

TODO:

  • add more tests
  • handle edge cases
  • handle errors

Avoid creating separate ledger state during checkpointing.

Based on EN2 logs (July 8, 2022), this will
- reduce operational memory by at least 152GB
- reduce memory allocations (TBD)
- reduce checkpoint duration by 24 mins (from 45 mins)
@fxamacker fxamacker self-assigned this Jul 8, 2022
@fxamacker fxamacker marked this pull request as draft July 8, 2022 16:39
@fxamacker fxamacker added the Execution Cadence Execution Team label Jul 12, 2022
@fxamacker fxamacker changed the title [Execution Node] [WIP] Reduce operational RAM by 152+ GB and checkpoint duration by 24 mins by reusing ledger state [Execution Node] [POC] Reduce operational RAM by 152+ GB and checkpoint duration by 24 mins by reusing ledger state Jul 13, 2022

go func() {
walChan <- l.wal.RecordUpdate(trieUpdate)
segmentNum, err := l.wal.RecordUpdate(trieUpdate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After writing the update to WAL, we return the segment num that the update was written to.

If two updates are written to the same segment file, then the same segmentNum will be returned again, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If two updates are written to the same segment file, then the same segmentNum will be returned again, right?

Yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to return two numbers, the segmentNum and the index of the trie update included in the segment?

So that when subscribing the SegmentTrieUpdate, we can double check the order:

SegmentTrie{trie: trie1, segumentNum: 10, indexInSegument: 4}
SegmentTrie{trie: trie3, segumentNum: 10, indexInSegument: 6}
SegmentTrie{trie: trie2, segumentNum: 10, indexInSegument: 5}

If we receive the trie updates in the above order, then we know it must be inconsistent with the order in the WAL files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to include index in PR #2792. But since PR #2792 handles concurrency and ensures the WAL update and ledger state update are in sync, maybe we don't need to include index to detect inconsistency in #2792.

return ledger.State(hash.DummyHash), nil, fmt.Errorf("cannot get updated trie: %w", err)
}

l.trieUpdateCh <- &wal.SegmentTrie{Trie: trie, SegmentNum: walResult.segmentNum}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the update has been written to WAL, and a new trie is created, we make a SegmentTrie that contains both the new trie and the segment num that contains the update, and push it to the channel, so that the channel subscriber will process it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because tries in checkpoints need to be in sync with tries created by updates in WAL segments. So we need to know new trie and the segment num to reuse trie for checkpointing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the Set method be called concurrently?

Since pushing the trie updates to the channel is concurrent, is it possible that the order of the updates we read from the channel will be different from the order in the WAL file?

What I'm afraid is that imaging the following updates are called concurrently:

  1. SegmentTrie{trie: trie1, segumentNum: 10}
  2. SegmentTrie{trie: trie3, segumentNum: 11}
  3. SegmentTrie{trie: trie2, segumentNum: 10}

Then it's possible that the cache will be inconsistent if trie2 is not included in the previousSegment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the Set method be called concurrently?

Yes, I asked Maks the same question yesterday and he replied "If fork happens and first collection finishes execution at the same time. Pretty unlikely but can theoretically happen."

Since pushing the trie updates to the channel is concurrent, is it possible that the order of the updates we read from the channel will be different from the order in the WAL file?

Yes, you're correct. Although this POC doesn't handle Set in parallel, PR #2792 handles parallel execution of Set.

This PR #2770 is just a proof-of-concept. It was superceded by PR #2792 which supports parallel Set and memory improvements.

}

// Add to active segment tries cache
if segmentNum == c.tries.segmentNum {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cache an active segment trie in memory, which contains all the trie root nodes which are made from the updates that is written to the same segment file.

When there is a new trie with a different segment num, it means the new trie is written to a different segment file with a different segment num. In this case, we must have seen all the trie root nodes that are written in a segment file, and no more trie node will be written to that segment file. So we return the previous segment trie.

When the previous segment trie is returned, it should be consistent with all the tries written in the segment file. Since it's consistent, we can return the previous segment trie, instead of reading them again from the segment file, which would increase operational memory. And this is the main idea of the optimization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and very well said! 👍

c.Lock()
defer c.Unlock()

err := c.forest.AddTries(tries)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding all the trie nodes of a segment to the forest, and check whether checkpointing should be triggered by comparing with the checkpoint distance config.

If checkpoint should be triggered, then it should return the last X trie nodes to be included in a new checkpoint file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly! 👍

Copy link
Member

@zhangchiqing zhangchiqing Jul 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, we assume that the tries are consistent with those in the WAL files.

Consistent means both the total number and the order are consistent.

I wonder if possible that we can sanity check that assumption with low cost.

How is the WAL file encoded? Is it possible to scan through a range of WAL files and only read the file header (or footer) about the number of trie nodes, and their root hash without loading the entire file into memory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the WAL file encoded? Is it possible to scan through a range of WAL files and only read the file header (or footer) about the number of trie nodes, and their root hash without loading the entire file into memory?

I think WAL file uses proprietary encoding for records. I don't think there is a quick way to scan information without reading the entire file.

At this point, we assume that the tries are consistent with those in the WAL files.
I wonder if possible that we can sanity check that assumption with low cost.

I think PR #2792 enforce this assumption with communication between Compactor and Ledger.Set.

Also TestCompactorAccuracy in PR #2792 passes. It creates ~30 segments and triggers checkpointing every 5 segments. The tests expects checkpointed tries to match replayed tries. Replayed tries are tries updated by replaying all WAL segments (from segment 0, ignoring prior checkpoints) to the checkpoint number. This verifies that checkpointed tries are snapshot of segments and at segment boundary.

TestCompactorConcurrency also tests Set in parallel.


go func() {
walChan <- l.wal.RecordUpdate(trieUpdate)
segmentNum, err := l.wal.RecordUpdate(trieUpdate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing to WAL files happens concurrently, we need a way to ensure the order of trie updates written to the file and the updates pushed to trieUpdateCh is consistent. The current implementation can not guarantee that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation can not guarantee that

The current implementation is not PR 2770 because it was superceded by PR #2792 days ago.

In PR #2792, when Compactor receives trieUpdate from channel, Compactor

  • writes update in WAL,
  • signals to ledger.Set when WAL update is completed,
  • waits for trie update completion (adding new trie ledger state)

Because all these step, the order of trie updates and WAL updates are consistent.

@fxamacker
Copy link
Member Author

Closing this because PR #2792 was approved. I kept this open just in case Maks and Ramtin discovered a blocker requiring us to reconsider using some of this approach.

@fxamacker fxamacker closed this Jul 26, 2022
@fxamacker fxamacker changed the title [Execution Node] [POC] Reduce operational RAM by 152+ GB and checkpoint duration by 24 mins by reusing ledger state [EN Performance] [POC] Reduce operational RAM by 152+ GB and checkpoint duration by 24 mins by reusing ledger state Aug 10, 2022
@fxamacker fxamacker deleted the fxamacker/reuse-mtrie-state-for-checkpointing branch February 2, 2024 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Execution Cadence Execution Team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants