Process Wikipedia dump for loading into a search system #303

data-sync-user · 2022-01-07T17:39:29Z

Write a way to load the contents of one of Wikipedia’s XML site dumps into our search index.

Download the latest “pages” XML dump (if not already available).
Stream the contents of that file into a decompressor.
Stream the decompressed XML through a parser to extract documents.
Insert those documents into the search engine.
Package the search index into a single file for distribution.
Upload that file to a cloud storage location.

Note that the Wikipedia dumps are tens of gigabytes, and cannot be loaded into memory. They should be processed using streaming techniques so as not to overwhelm the system running the task.

For details about Wikipedia’s XML dumps, see their documentation on it: https://meta.wikimedia.org/wiki/Data_dumps

┆Issue is synchronized with this Jira Task
┆epic: Set up Elastic Search to be used for local indexing

data-sync-user changed the title ~~Process Wikipedia dump into for loading into a search system~~ Process Wikipedia dump for loading into a search system Jan 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process Wikipedia dump for loading into a search system #303

Process Wikipedia dump for loading into a search system #303

data-sync-user commented Jan 7, 2022 •

edited

Loading

Process Wikipedia dump for loading into a search system #303

Process Wikipedia dump for loading into a search system #303

Comments

data-sync-user commented Jan 7, 2022 • edited Loading

data-sync-user commented Jan 7, 2022 •

edited

Loading