Skip to content
This repository has been archived by the owner on Apr 6, 2023. It is now read-only.

Process Wikipedia dump for loading into a search system #303

Open
data-sync-user opened this issue Jan 7, 2022 · 0 comments
Open

Process Wikipedia dump for loading into a search system #303

data-sync-user opened this issue Jan 7, 2022 · 0 comments

Comments

@data-sync-user
Copy link
Collaborator

data-sync-user commented Jan 7, 2022

Write a way to load the contents of one of Wikipedia’s XML site dumps into our search index.

  • Download the latest “pages” XML dump (if not already available).
  • Stream the contents of that file into a decompressor.
  • Stream the decompressed XML through a parser to extract documents.
  • Insert those documents into the search engine.
  • Package the search index into a single file for distribution.
  • Upload that file to a cloud storage location.

Note that the Wikipedia dumps are tens of gigabytes, and cannot be loaded into memory. They should be processed using streaming techniques so as not to overwhelm the system running the task.

For details about Wikipedia’s XML dumps, see their documentation on it: https://meta.wikimedia.org/wiki/Data_dumps

┆Issue is synchronized with this Jira Task
┆epic: Set up Elastic Search to be used for local indexing

@data-sync-user data-sync-user changed the title Process Wikipedia dump into for loading into a search system Process Wikipedia dump for loading into a search system Jan 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant