Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor reindex_studio management command to support large instances #235

Open
pomegranited opened this issue Sep 27, 2024 · 3 comments
Open

Comments

@pomegranited
Copy link

pomegranited commented Sep 27, 2024

cf comment

Refactor reindex_studio to support allow incremental index building for large instances.

Part of: openedx/frontend-app-authoring#1334

Requirements:

  • Add a new --reset flag that will create a new index, set its parameters (distinct_attribute, filterable_attributes, etc.), and swap it to become the active index, but not actually index any content.
  • Add a new --init flag that is the same as --reset but only if no index currently exists. If an index exists, it should print a warning saying that "A rebuild of the index is required. Please run ./manage.py cms reindex_studio --experimental [--incremental]"
  • Add a new --incremental flag that will add content to the current active index (NOT creating a temporary index, adding to that, and then swapping them). If there is no current index, it should create one automatically (same as running --reset). This script should be interruptable and resumable. It should also be easy to use and not require carefully specifying what courses to include in what order.

In "incremental mode", the script should:

  1. create a list of all courses and libraries, in order of newest first (e.g. sort by ID descending),
    • Note: we want newest first, because those are the courses most likely to be searched, and the search results are currently empty. So those are the priority to restore functionality to users.
  2. Get a list of "indexed" courses/libraries from the database (create a new "incremental indexes completed" table that stores just the course/library IDs),
  3. skip any already indexed courses/libraries, then move down the list from (1), indexing them as it does - upserting documents into the active index (not the temporary index - there won't be one)
  4. as each course/library is completely indexed, add its ID to the "incremental indexes completed" database table
  5. When all courses/libraries have been indexed, or when --reset is used, erase all rows from the "incremental indexes completed" database table

Summary:

The existing ./manage.py cms reindex_studio command will create a new search index matching the latest requirements, populate it with data from all courses/libraries in Studio, and then swap it to become the active index. This can be done anytime and works well for smaller instances; there will be no outage of search features during this time as any previously created index continues to be available until the new index is completely ready. This is not suitable for large instances (in terms of content, not users) because it may take many days for the index to complete, and if there's a problem it must start all over from scratch.

The new ./manage.py cms reindex_studio --incremental command will delete any existing studio search index and create a new search index matching the latest requirements (if necessary). Then, it will populate the index with content from courses/libraries - a process that can be paused and resumed as needed. This process may take several days. During this time, studio search will work without errors but results will be incomplete or missing entirely. This is recommended for large instances (in terms of content).

The new ./manage.py cms reindex_studio --init command is suitable to run during initial instance setup or during an upgrade and will work on any instance type/size. It will set up an empty index that's ready for content, but won't add any content to the index. Users will have to manually run one of the two above commands to populate the index if there are any existing courses/libraries.

In any case, as long as the index exists, newly changed content will get added to it as changes are made in studio. These commands are only necessary for mass indexing of existing content.

Future

We may simplify this and only support the --incremental mode in the future. Or maybe we should just change to that now?

@bradenmacdonald
Copy link
Contributor

@ormsbee @pomegranited Does the revised spec above for incremental indexing of studio content make sense?

@DanielVZ96
Copy link

@bradenmacdonald @pomegranited I can't figure out if --incremental should delete the index or not before starting. The following statements from above are a bit of a contradiction to me:

The new ./manage.py cms reindex_studio --incremental command will delete any existing studio search index and create a new search index matching the latest requirements and then swap it to become the active index.

Add a new --incremental flag that will add content to the current active index (NOT creating a temporary index, adding to that, and then swapping them).

But for now I'm assuming that we shouldn't delete the active index because if we do then it is not resumable.

@bradenmacdonald
Copy link
Contributor

@DanielVZ96 Sorry, you are right. I've updated it. The --incremental mode should not delete the index. In the future, it would be nice to have it only delete the index IF the index format/configuration has changed, but we currently don't have a way to track that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

3 participants