Skip to content

EdinburghNLP/CommonCrawlProcessing

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CommonCrawlProcessing

Contents

  • download, raw, deduped: Scripts for downloading, creating .raw.xz and .deduped.xz files, respectively. Largely they are based on Christian's pipeline.
  • s3: Scripts for uploading the local CommonCrawl data to AWS.
  • precc: A command line application to which automates the CommonCrawl processing pipeline. It is a wrapper around several scripts which can also be run separately.
  • language_lists: Files which contain a list of language codes. They are used extensively in the pipeline.
  • LOCATIONS.md: Contains information on where the CommonCrawl data is located on Valhalla.
  • TODO.md: List of things that I did not manage to finish.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 75.8%
  • Python 24.2%