-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the Cadence 1.0 migration #5897
Conversation
…ce 1.0 migration code
…ayloads, in-place
… migration from payloads to registers
we can still refactor them if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I mostly focused on non-test code.
Among various optimizations here, eliminating repeated payload key decoding and new payloads slices will reduce GC load. 👍
My only concern is that the optimization uses a map of all accounts and each account has a map of its payloads. This may require more RAM than expected. Maybe we should try full migration with this optimization on both testnet and recent mainnet data to avoid surprises.
reduces the time from 7 minutes to 1:15 minutes on my machine (M1 Pro).
Amazing speedup! 🚀
cmd/util/ledger/util/util.go
Outdated
@@ -24,25 +23,37 @@ func newRegisterID(owner []byte, key []byte) flow.RegisterID { | |||
|
|||
type AccountsAtreeLedger struct { | |||
Accounts environment.Accounts | |||
temp map[string][]byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need AccountsAtreeLedger.temp
to store zero address registers?
It seems AccountsAtreeLedger
is used as base storage for atree storage, and atree storage manages zero address registers already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added because a test was failing. Looking into it again, it is only TestFixSlabsWithBrokenReferences
that fails, and it only performs reads for temporaries, no writes.
In c24a98e I removed the support for writing temporaries, but needed to keep support for reading them without producing an error.
I'm not 100% if this is the right place though, or that logic should be moved up/down the stack.
} | ||
|
||
registersByAccount := registers.NewByAccount() | ||
g.Go(func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this goroutine isn't necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, running the code in g.Go
will properly handle errors, i.e. cancel running goroutines in the group.
I re-ran the migration on the test state for both PR and master, and maximum resident set size is about 12% higher, 25076957184 vs 21969027072. I think trading some memory increase for a significant time reduction is OK, especially given that reducing downtime of the network is a priority. |
@fxamacker so far all migrations I run on both TN and MN state were using < 50% of RAM available so we should be ok, but I will run MN state migration to be sure. |
@turbolent Thanks for checking! 👍 Only 12% increase in maxRSS using the test state is amazing given the speedup! 🎉
Me too! 👍 I asked Jan 2 weeks ago on Slack if we can keep vm with larger RAM in case migration optimizations need to use extra RAM to improve speed. During just ~2 weeks last year, the payload count increased by 100+ million for a large account, so I think it's good to look for surprises by testing with higher counts of accounts, payloads, etc. My concern is input data reaching some larger threshold in Go program that causes unexpectedly high spike in maxRSS.
@j1010001 Thanks for keeping extra RAM for optimizations like this and also for running migrations on MN state! 👍 |
Work towards onflow/cadence#3297
The execution state extraction currently supports running migrations in the form of one or more payloads-to-payloads functions.
The Cadence 1.0 migration consists of many migration stages (usually 7). Almost all stages migrate accounts in parallel. The individual migrations operate on indexed registers (owner/key-to-value maps), rather than "flat" lists of payloads. Therefore, currently each migration must group the input payloads by account. In addition, most of the migrations currently migrate in a copy-on-write fashion, where write operations do not mutate in-place, but rather produce a write set, which must be applied to the input payloads to produce new payloads.
This approach results in significant overhead.
This PR refactors the whole migration so that the migration only groups payloads by accounts at the beginning of the migration, into a new data structure simply called "registers" (a owner/key-to-value map), which is internally always grouped by account, performs all migrations on this data structure in-place, and only generates final payloads at the end of the migration.
Testing with the
testnet-core.payloads
state, the refactored migration produces the same state commitment, and reduces the time from 7 minutes to 1:15 minutes on my machine (M1 Pro).To reduce the amount of changes, this PR also currently removes some of the migration which are unused:
If needed I can add these back, please let me know!
The PR is best viewed without whitespace changes, commit-by-commit. It is rather large, so I'm happy to schedule a review session. Most of the changes are in the tests.
3 tests of the staged contracts migration failed after the refactor to registers. I think this is actually because the tests were previously not passing any payloads to
InitMigration
, even though there are payloads for the accounts (payloads passed toMigrateAccount
). I adjusted the tests in ec4b1b7.