Pattern-Aware Line-Filtering

A collection of tools for extracting Korean and English text for LLM training from the CommonCrawl WET dataset.

Codes for Beyond Line-Level Filtering for the Pretraining Corpora of LLMs

System Requirements

Our data pipeline were run with 64 CPU vCores and 512GB RAM space. This tool requires a system with more than 400GB CPU RAM space to function stablely. For systems with lower RAM spaces, you may reduce the degree of parallelism, yet you need 50GB of RAM space at least, even for the single process settings.

Installation

Command line tools

wget
gunzip

Installing package

pip install -e .

# downloading fasttext language identification model
wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz -O pattern_aware_filtering/utils/lid.176.ftz

Overview

Our scripts contain following steps

Preparing baseline data, including downloads of CC WET files.
Building hashmap for obtaining the count, which is the number of the document that contains a text line in a document set.
Obtaining intermediate data: count info file that contains (i) the raw data and (ii) the count information for each text line of the raw data.

This intermediate data allows ease of extracting text with Pattern-Aware Line-Level Deduplication with varying categorization thresholds.

Filtering the data with the intermediate data.

Preparation

Setup Constants

Fix data save path defined in pattern_aware_filtering/utils/constants.py file for your purpose.

Obtaining CommonCrawl WET Files

python scripts/prepare-ccnet/download_wet_files.py

Obtaining list of CommonCrawl WET Files (Necessary)

python scripts/prepare-ccnet/download_wet_paths_file.py

Sampling Splits

For handling English data, we did not extract all data from English shard, because of storage constraints. We first sample which shards to be used and then selectively extract English data only from the sampled shards. This step ensures that every experiment uses the same raw data for the training/validation split. Run the sampling script to generate the shard lists for each language:

python scripts/prepare-ccnet/sample_splits.py en
python scripts/prepare-ccnet/sample_splits.py ko

Extracting Baseline Data(Language Identification)

English

For handling English data, we did not extract all data from English shard, because of storage constraints. We only process 10% of the total data. See the script for more detail.

bash scripts/extract_baseline_files/extract_baseline_en.sh

Korean

We extract all Korean documents from all shards. See the script for more detail.

bash scripts/extract_baseline_files/extract_baseline_ko.sh

Hashing

English

To ensure each process run not longer than 12 hours(Our cluster's maximum length of job allocation), we employ two-step strategy for hashing. Our goal of this step is acquiring a hashmap spanning across 1000 shards. See the script for the detail, and parallelization options.

bash scripts/hashing/hashing_en.sh

Korean

The number of Korean documents spans from 15 million to 25 million documents per snapshot. We first make a hashmap for each shard file. Then, we merge all the hash files.

bash scripts/hashing/hashing_ko.sh

Obtaining Intermediate File: `Count Info`

As hashmap searching is the most memory-consuming part, we obtain the count information first for further analysis.

bash scripts/count_info/extract_en_count_info.sh # English
bash scripts/count_info/extract_ko_count_info.sh # Korean

Extraction

English

Following scripts apply PLD / PTF at the intermediate file. See the script for details.

bash scripts/extract-en/run_pld_ablation.sh # for setting r/y/g threshold for PLD
bash scripts/extract-en/run_ptf_ablation.sh # for acquiring 'k' for PTF. We obtain our final version of data with this script.

Korean

Following scripts apply PLD / PTF at the intermediate file. See the script for details.

bash scripts/extract-ko/run_pld_ablation.sh # for setting r/y/g threshold for PLD
bash scripts/extract-ko/run_ptf_ablation.sh # for acquiring 'k' for PTF. We obtain our final version of data with this script.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
pattern_aware_filtering		pattern_aware_filtering
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pattern-Aware Line-Filtering

System Requirements

Installation

Command line tools

Installing package

Overview

Preparation

Setup Constants

Obtaining CommonCrawl WET Files

Obtaining list of CommonCrawl WET Files (Necessary)

Sampling Splits

Extracting Baseline Data(Language Identification)

English

Korean

Hashing

English

Korean

Obtaining Intermediate File: `Count Info`

Extraction

English

Korean

About

Uh oh!

Releases

Packages

Languages

License

mcrl/pattern-aware-filtering

Folders and files

Latest commit

History

Repository files navigation

Pattern-Aware Line-Filtering

System Requirements

Installation

Command line tools

Installing package

Overview

Preparation

Setup Constants

Obtaining CommonCrawl WET Files

Obtaining list of CommonCrawl WET Files (Necessary)

Sampling Splits

Extracting Baseline Data(Language Identification)

English

Korean

Hashing

English

Korean

Obtaining Intermediate File: Count Info

Extraction

English

Korean

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Obtaining Intermediate File: `Count Info`

Packages