Optimize the Cadence 1.0 migration #5897

turbolent · 2024-05-11T01:20:36Z

The execution state extraction currently supports running migrations in the form of one or more payloads-to-payloads functions.

The Cadence 1.0 migration consists of many migration stages (usually 7). Almost all stages migrate accounts in parallel. The individual migrations operate on indexed registers (owner/key-to-value maps), rather than "flat" lists of payloads. Therefore, currently each migration must group the input payloads by account. In addition, most of the migrations currently migrate in a copy-on-write fashion, where write operations do not mutate in-place, but rather produce a write set, which must be applied to the input payloads to produce new payloads.
This approach results in significant overhead.

This PR refactors the whole migration so that the migration only groups payloads by accounts at the beginning of the migration, into a new data structure simply called "registers" (a owner/key-to-value map), which is internally always grouped by account, performs all migrations on this data structure in-place, and only generates final payloads at the end of the migration.

Testing with the testnet-core.payloads state, the refactored migration produces the same state commitment, and reduces the time from 7 minutes to 1:15 minutes on my machine (M1 Pro).

To reduce the amount of changes, this PR also currently removes some of the migration which are unused:

Atree register inlining migration
Cadence value validation
Deduplicate contract names migration

If needed I can add these back, please let me know!

The PR is best viewed without whitespace changes, commit-by-commit. It is rather large, so I'm happy to schedule a review session. Most of the changes are in the tests.

3 tests of the staged contracts migration failed after the refactor to registers. I think this is actually because the tests were previously not passing any payloads to InitMigration, even though there are payloads for the accounts (payloads passed to MigrateAccount). I adjusted the tests in ec4b1b7.

…uped by account

…ce 1.0 migration code

…ayloads, in-place

…th interpreter

… migration from payloads to registers

we can still refactor them if needed

codecov-commenter · 2024-05-11T01:24:21Z

Codecov Report

Attention: Patch coverage is 49.67949% with 314 lines in your changes are missing coverage. Please review.

Project coverage is 55.60%. Comparing base (957437f) to head (c24a98e).
Report is 121 commits behind head on master.

Files	Patch %	Lines
...execution-state-extract/execution_state_extract.go	10.44%	119 Missing and 1 partial ⚠️
...d/util/ledger/migrations/storage_used_migration.go	0.00%	39 Missing ⚠️
cmd/util/ledger/migrations/cadence.go	61.53%	23 Missing and 2 partials ⚠️
...ledger/migrations/account_size_filter_migration.go	0.00%	20 Missing ⚠️
...util/ledger/migrations/cadence_values_migration.go	48.64%	18 Missing and 1 partial ⚠️
.../migrations/filter_unreferenced_slabs_migration.go	62.79%	15 Missing and 1 partial ⚠️
...til/ledger/migrations/account_storage_migration.go	0.00%	13 Missing ⚠️
...til/ledger/migrations/fix_broken_data_migration.go	65.78%	12 Missing and 1 partial ⚠️
...il/ledger/migrations/staged_contracts_migration.go	79.24%	10 Missing and 1 partial ⚠️
.../util/ledger/migrations/account_based_migration.go	87.69%	5 Missing and 3 partials ⚠️
... and 8 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5897      +/-   ##
==========================================
+ Coverage   55.49%   55.60%   +0.11%     
==========================================
  Files        1133     1127       -6     
  Lines       89496    88838     -658     
==========================================
- Hits        49662    49401     -261     
+ Misses      35062    34732     -330     
+ Partials     4772     4705      -67

Flag	Coverage Δ
unittests	`55.60% <49.67%> (+0.11%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

SupunS

👏

fxamacker

Nice! I mostly focused on non-test code.

Among various optimizations here, eliminating repeated payload key decoding and new payloads slices will reduce GC load. 👍

My only concern is that the optimization uses a map of all accounts and each account has a map of its payloads. This may require more RAM than expected. Maybe we should try full migration with this optimization on both testnet and recent mainnet data to avoid surprises.

reduces the time from 7 minutes to 1:15 minutes on my machine (M1 Pro).

Amazing speedup! 🚀

fxamacker · 2024-05-13T22:43:27Z

cmd/util/ledger/util/util.go

@@ -24,25 +23,37 @@ func newRegisterID(owner []byte, key []byte) flow.RegisterID {

 type AccountsAtreeLedger struct {
 	Accounts environment.Accounts
+	temp     map[string][]byte


Do we need AccountsAtreeLedger.temp to store zero address registers?

It seems AccountsAtreeLedger is used as base storage for atree storage, and atree storage manages zero address registers already.

I added because a test was failing. Looking into it again, it is only TestFixSlabsWithBrokenReferences that fails, and it only performs reads for temporaries, no writes.

In c24a98e I removed the support for writing temporaries, but needed to keep support for reading them without producing an error.

I'm not 100% if this is the right place though, or that logic should be moved up/down the stack.

fxamacker · 2024-05-14T00:46:06Z

cmd/util/cmd/execution-state-extract/execution_state_extract.go

+	}
+
+	registersByAccount := registers.NewByAccount()
+	g.Go(func() error {


Maybe this goroutine isn't necessary.

As far as I understand, running the code in g.Go will properly handle errors, i.e. cancel running goroutines in the group.

turbolent · 2024-05-14T02:57:35Z

I re-ran the migration on the test state for both PR and master, and maximum resident set size is about 12% higher, 25076957184 vs 21969027072. I think trading some memory increase for a significant time reduction is OK, especially given that reducing downtime of the network is a priority.

j1010001 · 2024-05-14T09:20:28Z

My only concern is that the optimization uses a map of all accounts and each account has a map of its payloads. This may require more RAM than expected. Maybe we should try full migration with this optimization on both testnet and recent mainnet data to avoid surprises.

@fxamacker so far all migrations I run on both TN and MN state were using < 50% of RAM available so we should be ok, but I will run MN state migration to be sure.

fxamacker · 2024-05-14T13:48:58Z

I re-ran the migration on the test state for both PR and master, and maximum resident set size is about 12% higher, 25076957184 vs 21969027072.

@turbolent Thanks for checking! 👍 Only 12% increase in maxRSS using the test state is amazing given the speedup! 🎉

I think trading some memory increase for a significant time reduction is OK, especially given that reducing downtime of the network is a priority.

Me too! 👍 I asked Jan 2 weeks ago on Slack if we can keep vm with larger RAM in case migration optimizations need to use extra RAM to improve speed.

During just ~2 weeks last year, the payload count increased by 100+ million for a large account, so I think it's good to look for surprises by testing with higher counts of accounts, payloads, etc.

My concern is input data reaching some larger threshold in Go program that causes unexpectedly high spike in maxRSS.

@fxamacker so far all migrations I run on both TN and MN state were using < 50% of RAM available so we should be ok, but I will run MN state migration to be sure.

@j1010001 Thanks for keeping extra RAM for optimizations like this and also for running migrations on MN state! 👍

turbolent added 23 commits May 10, 2024 17:35

introduce Registers, an in-memory structure which keeps registers gro…

7c962cd

…uped by account

add CPU profiling flag

6a94981

export functionality so it does not need a RegisterID

2146748

remove snapshot code

f471401

remove snapshot read-only ledger

fc75962

allow storage of temporary key-value pairs, i.e. zero owner

bf6ac42

move info about testnet accounts with broken slab references to Caden…

18c8805

…ce 1.0 migration code

refactor account-based migration to operate on registers instead of p…

5f4f56f

…ayloads, in-place

refactor account-size filter migration from payloads to registers

c07c087

split migration runtime into basic (one with just storage) and one wi…

fd6eede

…th interpreter

refactor account storage migration, transaction migration, and deploy…

097842e

… migration from payloads to registers

refactor Cadence value diff migration from payloads to registers

25200c2

refactor staged-contracts migration from payloads to registers

598e373

refactor storage-used migration from payloads to registers

9917a20

refactor fix broken data migration from payloads to registers

5da351a

refactor filter unreferenced slabs migration from payloads to registers

164f9da

refactor contract checking migration from payloads to registers

441e0c2

refactor Cadence value migration from payloads to registers

2a5ba0a

refactor prune migration from payloads to registers

036eb75

improve register/ledger conversion code

39acb0b

refactor slab loading and atree health check from payloads to registers

261f291

refactor state extraction from payloads to registers

1817727

remove unused migrations

62a945c

we can still refactor them if needed

turbolent requested review from a team May 11, 2024 01:20

turbolent self-assigned this May 11, 2024

turbolent requested review from ramtinms and AlexHentschel as code owners May 11, 2024 01:20

adjust tests

ec4b1b7

turbolent added 2 commits May 10, 2024 18:42

parallelize tests

285cfac

lint

15d2e63

SupunS approved these changes May 13, 2024

View reviewed changes

turbolent mentioned this pull request May 13, 2024

Improve scheduling in account-based migration #5901

Merged

j1010001 requested a review from zhangchiqing May 13, 2024 17:38

fxamacker approved these changes May 14, 2024

View reviewed changes

turbolent added this pull request to the merge queue May 14, 2024

turbolent removed this pull request from the merge queue due to a manual request May 14, 2024

remove support for storing temporary registers (zero address)

c24a98e

turbolent enabled auto-merge May 14, 2024 15:25

turbolent added this pull request to the merge queue May 14, 2024

Merged via the queue into master with commit 5e4bd4d May 14, 2024
55 checks passed

turbolent deleted the bastian/optimize-migrations branch May 14, 2024 17:24

This was referenced May 14, 2024

Optimize creation of new payloads #5914

Merged

Fix comparison of payload addresses, handle empty address #5942

Merged

Test registers data structure #5949

Merged

fxamacker mentioned this pull request May 30, 2024

Port modified PR #5897 (optimize Cadence 1.0 migration) to feature/atree-inlining-cadence-v0.42 #6005

Closed

fxamacker mentioned this pull request Jun 6, 2024

Optimize atree migration by porting and modifying PR 5897 to feature/atree-inlining-cadence-v0.42 #6043

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the Cadence 1.0 migration #5897

Optimize the Cadence 1.0 migration #5897

turbolent commented May 11, 2024 •

edited

Loading

codecov-commenter commented May 11, 2024 •

edited

Loading

SupunS left a comment

fxamacker left a comment

fxamacker May 13, 2024

turbolent May 14, 2024

fxamacker May 14, 2024

turbolent May 14, 2024

turbolent commented May 14, 2024

j1010001 commented May 14, 2024

fxamacker commented May 14, 2024

Optimize the Cadence 1.0 migration #5897

Optimize the Cadence 1.0 migration #5897

Conversation

turbolent commented May 11, 2024 • edited Loading

codecov-commenter commented May 11, 2024 • edited Loading

Codecov Report

SupunS left a comment

Choose a reason for hiding this comment

fxamacker left a comment

Choose a reason for hiding this comment

fxamacker May 13, 2024

Choose a reason for hiding this comment

turbolent May 14, 2024

Choose a reason for hiding this comment

fxamacker May 14, 2024

Choose a reason for hiding this comment

turbolent May 14, 2024

Choose a reason for hiding this comment

turbolent commented May 14, 2024

j1010001 commented May 14, 2024

fxamacker commented May 14, 2024

turbolent commented May 11, 2024 •

edited

Loading

codecov-commenter commented May 11, 2024 •

edited

Loading