You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 6, 2023. It is now read-only.
Write a way to load the contents of one of Wikipedia’s XML site dumps into our search index.
Download the latest “pages” XML dump (if not already available).
Stream the contents of that file into a decompressor.
Stream the decompressed XML through a parser to extract documents.
Insert those documents into the search engine.
Package the search index into a single file for distribution.
Upload that file to a cloud storage location.
Note that the Wikipedia dumps are tens of gigabytes, and cannot be loaded into memory. They should be processed using streaming techniques so as not to overwhelm the system running the task.
┆Issue is synchronized with this Jira Task
┆epic: Set up Elastic Search to be used for local indexing
The text was updated successfully, but these errors were encountered:
data-sync-user
changed the title
Process Wikipedia dump into for loading into a search system
Process Wikipedia dump for loading into a search system
Jan 7, 2022
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Write a way to load the contents of one of Wikipedia’s XML site dumps into our search index.
Note that the Wikipedia dumps are tens of gigabytes, and cannot be loaded into memory. They should be processed using streaming techniques so as not to overwhelm the system running the task.
For details about Wikipedia’s XML dumps, see their documentation on it: https://meta.wikimedia.org/wiki/Data_dumps
┆Issue is synchronized with this Jira Task
┆epic: Set up Elastic Search to be used for local indexing
The text was updated successfully, but these errors were encountered: