Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the Cadence 1.0 migration #5897

Merged
merged 27 commits into from
May 14, 2024
Merged

Conversation

turbolent
Copy link
Member

@turbolent turbolent commented May 11, 2024

Work towards onflow/cadence#3297

The execution state extraction currently supports running migrations in the form of one or more payloads-to-payloads functions.

The Cadence 1.0 migration consists of many migration stages (usually 7). Almost all stages migrate accounts in parallel. The individual migrations operate on indexed registers (owner/key-to-value maps), rather than "flat" lists of payloads. Therefore, currently each migration must group the input payloads by account. In addition, most of the migrations currently migrate in a copy-on-write fashion, where write operations do not mutate in-place, but rather produce a write set, which must be applied to the input payloads to produce new payloads.
This approach results in significant overhead.

This PR refactors the whole migration so that the migration only groups payloads by accounts at the beginning of the migration, into a new data structure simply called "registers" (a owner/key-to-value map), which is internally always grouped by account, performs all migrations on this data structure in-place, and only generates final payloads at the end of the migration.

Testing with the testnet-core.payloads state, the refactored migration produces the same state commitment, and reduces the time from 7 minutes to 1:15 minutes on my machine (M1 Pro).

To reduce the amount of changes, this PR also currently removes some of the migration which are unused:

  • Atree register inlining migration
  • Cadence value validation
  • Deduplicate contract names migration

If needed I can add these back, please let me know!

The PR is best viewed without whitespace changes, commit-by-commit. It is rather large, so I'm happy to schedule a review session. Most of the changes are in the tests.

3 tests of the staged contracts migration failed after the refactor to registers. I think this is actually because the tests were previously not passing any payloads to InitMigration, even though there are payloads for the accounts (payloads passed to MigrateAccount). I adjusted the tests in ec4b1b7.

turbolent added 23 commits May 10, 2024 17:35
we can still refactor them if needed
@turbolent turbolent requested review from a team May 11, 2024 01:20
@turbolent turbolent self-assigned this May 11, 2024
@codecov-commenter
Copy link

codecov-commenter commented May 11, 2024

Codecov Report

Attention: Patch coverage is 49.67949% with 314 lines in your changes are missing coverage. Please review.

Project coverage is 55.60%. Comparing base (957437f) to head (c24a98e).
Report is 121 commits behind head on master.

Files Patch % Lines
...execution-state-extract/execution_state_extract.go 10.44% 119 Missing and 1 partial ⚠️
...d/util/ledger/migrations/storage_used_migration.go 0.00% 39 Missing ⚠️
cmd/util/ledger/migrations/cadence.go 61.53% 23 Missing and 2 partials ⚠️
...ledger/migrations/account_size_filter_migration.go 0.00% 20 Missing ⚠️
...util/ledger/migrations/cadence_values_migration.go 48.64% 18 Missing and 1 partial ⚠️
.../migrations/filter_unreferenced_slabs_migration.go 62.79% 15 Missing and 1 partial ⚠️
...til/ledger/migrations/account_storage_migration.go 0.00% 13 Missing ⚠️
...til/ledger/migrations/fix_broken_data_migration.go 65.78% 12 Missing and 1 partial ⚠️
...il/ledger/migrations/staged_contracts_migration.go 79.24% 10 Missing and 1 partial ⚠️
.../util/ledger/migrations/account_based_migration.go 87.69% 5 Missing and 3 partials ⚠️
... and 8 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5897      +/-   ##
==========================================
+ Coverage   55.49%   55.60%   +0.11%     
==========================================
  Files        1133     1127       -6     
  Lines       89496    88838     -658     
==========================================
- Hits        49662    49401     -261     
+ Misses      35062    34732     -330     
+ Partials     4772     4705      -67     
Flag Coverage Δ
unittests 55.60% <49.67%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@SupunS SupunS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

Copy link
Member

@fxamacker fxamacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I mostly focused on non-test code.

Among various optimizations here, eliminating repeated payload key decoding and new payloads slices will reduce GC load. 👍

My only concern is that the optimization uses a map of all accounts and each account has a map of its payloads. This may require more RAM than expected. Maybe we should try full migration with this optimization on both testnet and recent mainnet data to avoid surprises.

reduces the time from 7 minutes to 1:15 minutes on my machine (M1 Pro).

Amazing speedup! 🚀

@@ -24,25 +23,37 @@ func newRegisterID(owner []byte, key []byte) flow.RegisterID {

type AccountsAtreeLedger struct {
Accounts environment.Accounts
temp map[string][]byte
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need AccountsAtreeLedger.temp to store zero address registers?

It seems AccountsAtreeLedger is used as base storage for atree storage, and atree storage manages zero address registers already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added because a test was failing. Looking into it again, it is only TestFixSlabsWithBrokenReferences that fails, and it only performs reads for temporaries, no writes.

In c24a98e I removed the support for writing temporaries, but needed to keep support for reading them without producing an error.

I'm not 100% if this is the right place though, or that logic should be moved up/down the stack.

}

registersByAccount := registers.NewByAccount()
g.Go(func() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this goroutine isn't necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, running the code in g.Go will properly handle errors, i.e. cancel running goroutines in the group.

@turbolent
Copy link
Member Author

I re-ran the migration on the test state for both PR and master, and maximum resident set size is about 12% higher, 25076957184 vs 21969027072. I think trading some memory increase for a significant time reduction is OK, especially given that reducing downtime of the network is a priority.

@j1010001
Copy link
Member

My only concern is that the optimization uses a map of all accounts and each account has a map of its payloads. This may require more RAM than expected. Maybe we should try full migration with this optimization on both testnet and recent mainnet data to avoid surprises.

@fxamacker so far all migrations I run on both TN and MN state were using < 50% of RAM available so we should be ok, but I will run MN state migration to be sure.

@fxamacker
Copy link
Member

I re-ran the migration on the test state for both PR and master, and maximum resident set size is about 12% higher, 25076957184 vs 21969027072.

@turbolent Thanks for checking! 👍 Only 12% increase in maxRSS using the test state is amazing given the speedup! 🎉

I think trading some memory increase for a significant time reduction is OK, especially given that reducing downtime of the network is a priority.

Me too! 👍 I asked Jan 2 weeks ago on Slack if we can keep vm with larger RAM in case migration optimizations need to use extra RAM to improve speed.

During just ~2 weeks last year, the payload count increased by 100+ million for a large account, so I think it's good to look for surprises by testing with higher counts of accounts, payloads, etc.

My concern is input data reaching some larger threshold in Go program that causes unexpectedly high spike in maxRSS.

@fxamacker so far all migrations I run on both TN and MN state were using < 50% of RAM available so we should be ok, but I will run MN state migration to be sure.

@j1010001 Thanks for keeping extra RAM for optimizations like this and also for running migrations on MN state! 👍

@turbolent turbolent added this pull request to the merge queue May 14, 2024
@turbolent turbolent removed this pull request from the merge queue due to a manual request May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants