Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration optimization #10048

Open
5 of 15 tasks
jennijuju opened this issue Jan 18, 2023 · 2 comments
Open
5 of 15 tasks

Migration optimization #10048

jennijuju opened this issue Jan 18, 2023 · 2 comments
Assignees
Labels
kind/enhancement Kind: Enhancement P2 P2: Should be resolved

Comments

@jennijuju
Copy link
Member

Checklist

  • This is not a new feature or an enhancement to the Filecoin protocol. If it is, please open an FIP issue.
  • This is not a new feature request. If it is, please file a feature request instead.
  • This is not brainstorming ideas. If you have an idea you'd like to discuss, please open a new discussion on the lotus forum and select the category as Ideas.
  • I have a specific, actionable, and well motivated improvement to propose.

Lotus component

  • lotus daemon - chain sync
  • lotus miner - mining and block production
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt)
  • lotus miner/market - storage deal
  • lotus miner/market - retrieval deal
  • lotus miner/market - data transfer
  • lotus client
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Improvement Suggestion

during shark upgrade, full nodes / nodes has more chainstate with less ram on machine had trouble with migration.

there was a couple optimization being proposed by pl netops team

  • implement memory maps
  • be able to opt out the (pre)migration cache
  • opt in to keep the cache without batch flush

ask:
🤔 about an optimization that works best and implement it.

@jennijuju jennijuju added P2 P2: Should be resolved and removed need/triage labels Jan 18, 2023
@travisperson
Copy link
Contributor

I went back and listen to our discussion to flesh out the proposed optimizations above and I think there is really only one concrete suggestion.

  • implement memory maps
  • opt-in to keep the cache without batch flush

These two points kind of boil down to the same thing as I was able to derive from the video notes.

We basically want to cache the intermediate statetree root after the migration completes such that on future calls to HandleStateForks we can look up the migrated statetree root and immediately return with the migrated statetree root. It would probably be best to avoid using the migration cache itself as that is a very large cache and ideally would be cleaned up fairly quickly after a successful migration to reduce memory usage.

Right now we could fairly easily store the migrated stateroot root on the migration structure directly. However, I think we should also think about persisting this value in the chain store as well to avoid having to redo the migration work after a restart occurs.

type migration struct {
upgrade MigrationFunc
preMigrations []PreMigration
cache *nv16.MemMigrationCache
}

  • be able to opt out the (pre)migration cache

This just sounds like opting out of the premigration (no point running the premigration without a cache). I thought this used to be a feature already but I guess has since been removed (though I can't even find when it existed). I believe there was an env DISABLE_PRE_MIGRATIONS that we should bring back.


Additionally, to make future improvements to migrations easier there are a few extra things I think we should be doing

  • Setup a process during releases to record and store the most recent snapshot right before all network upgrades so we can easily rerun migrations in the future.
  • Migration guide for node operators (Add migration guide for node operators lotus-docs#483)
    • Reset the datastore from a snapshot to reduce the size of the datastore prior to migration
    • Do not restart the lotus node after pre-migration starts until after the migration is completed.
    • Identify the log lines operators should be paying attention to understand their progress/performance during pre-migration & migration
    • Document important configuration / env variables for migrations
  • Additional metrics for migrations
    • cache size
    • cache hit/miss
  • Identify important metrics that we would like to record from all nodes and pass this information off to the NetOps team so they can collect and share it with us.

Additionally, an env has been added (#9784) LOTUS_MIGRATION_MAX_WORKER_COUNT to allow operators to set a max worker count to avoid lotus from using workers equal to the number of CPUs.

@arajasek
Copy link
Contributor

@travisperson Thanks a lot for this detailed synthesis, and for the summary in standup today. Based on feedback we received from the various users of the nv17 migration, I think a list of changes to make in order of impact might look something like this:

  1. Storing the migration result in memory. This is easy to do, as you described above, and addresses a need of a high-priority integration partner.
  2. Provide the option to disable pre-migrations entirely for node operators that are okay with extended out-of-sync time.
  3. Provide the option to have the pre-migration result persisted, so that it (a) becomes restart-resistant, and (b) reduces memory consumption. This is a bit more work than the previous items, but likely the most impactful item. We'll need to test that the performance is acceptable when doing this (the pre-migration is actually useful), and that splitstore doesn't interfere with this.
  4. The migration guide.
  5. Metrics for future innovation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Kind: Enhancement P2 P2: Should be resolved
Projects
None yet
Development

No branches or pull requests

3 participants