-
-
Notifications
You must be signed in to change notification settings - Fork 61
Description
What change would you like to see?
Migration 0042 should instead be converted to start a background job. The background job would then migrate each crawl and a version on the crawl object. Migrating the crawl means reimporting the pages with the proper data.
The background job would do something like following:
for each crawl where version != and not_migrating:
if pages.find_all(filename == null) == 0:
crawl.set(version, 2) # already migrated
else:
# reimport pages for crawl
pages.re_add_crawl_pages(crawl.id)
crawl.set(version, 2)
-
In /replay.json API endpoints, the
pagesQueryUrlandinitialResourcesare included only if version == 2 for all crawls. -
New crawl objects would have version set to 2
-
Migration 0042 would start this background job
-
A new endpoint
/jobs/migrateCrawlsmight be added to also start this job. -
Job retry endpoint can be used to retry the job if it fails?
-
/crawls/migrationNeededcould be added to check if any crawls need migration. -
Self-deployment docs should be updated to mention this migration, how to check if it succeeds
Context
Since migration 0042 (adding filenames and other data to pages) may potentially take a long time, and is an optimization for collection, we can instead convert the slow running migration 0042 into a background job.
This would allow the 1.14 migration to launch quickly, while crawls are being migrated in the background. Unmigrated crawls can still be added to collections, just will loaded slightly slower until they are migrated.
This is a trade-off between faster migration to 1.14 and migration of existing crawls to take advantage of optimized replay.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status