Name	Name	Last commit message	Last commit date
parent directory ..
code	code
README.md	README.md

Data curation for DAPT (Domain Adaptive Pre-Training)

ChipNeMo is a chip design domain adapted LLM. Instead of directly deploying off-theshelf commercial or open-source LLMs, the paper instead adopts the following domain adaptation techniques: domain-adaptive tokenization, domain adaptive continued pretraining, model alignment with domain-specific instructions, and domain adapted retrieval models. Specifically, LLama 2 foundation models are continually pre-trained with 20B plus tokens on domain-specific chip design data, including code, documents, etc., and then fine-tuned with instruction datasets from design data as well as external sources. Evaluations on the resultant domain-adapted ChipNeMo model demonstrate that domain-adaptive pretraining of language models, can lead to superior performance in domain related downstream tasks compared to their base LLaMA2 counterparts, without degradations in generic capabilities.

Here, we share a tutorial with best practices on data curation for DAPT (domain-adaptive pre-training) for a ChipNeMo-like code generation use case.

In this tutorial, we will leverage chip domain/hardware datasets from open-source GitHub repositories (./code/sources/github_repos.jsonl), wiki URLs (./code/sources/wikipedia_urls.jsonl), and academic papers (./sources/arxiv_urls.jsonl).
./code/data contains curated data after processing

The small size of this dataset makes it ideal for creating and validating data curation pipelines on a local machine or a cluster. This playbook utilizes specific tools and techniques. First, we convert all files to Txt format (if not already in Txt), compress files on disk, add metadata, and convert them to JSON (./data/raw/). Then, we leverage NeMo Curator to mine high-quality text at scale from a massive code-generation corpus. We use its capabilities to extract text, identify code file types, fix unicode errors, filter quality through heuristics, deduplicate, and redact personal information. We finally also provide steps to blend and shuffle data sources for continued pre-training.

Hardware Requirements

This playbook can run on CPUs or GPUs. This playbook can run entirely on a CPU. If you have GPUs, the PII Redaction will be accelerated using them.

Walkthrough

We will use the datasets in the dapt-curation/code/data folder to illustrate data curation through this pipeline. Specifically sample data collected in:

./data/raw/github (we clone github repos, extract text from each file and convert to jsonl)
./data/raw/arxiv_pdfs (we extract data from pdfs, convert to txt and store as jsonl files)
./data/raw/wikipedia (we extract data from htmls, parse, convert to txt and store as json files)

The tutorial follows the steps below:

Step 1: Install requirements and import libraries
Step 2: Download the data from online sources (Github repos, wiki urls, arxiv pdfs), extract metadata and convert to JSONL
Step 3: Load the dataset
Step 4: Examine the file types and sizes (optional)
Step 5: Run the data curation pipeline with with Nemo Curator
- File type identification and separation
- Document-level exact deduplication
- Heuristic-based quality filtering (Number of lines, worc count, top N-grams, etc.)
- Fix unicode errors via ftfy
- PII redaction
- GPU accelerated fuzzy and semantic deduplication
Step 6: Save the filtered and curated data
Step 7: Blend datasets and shuffle

Usage

After installing the NeMo Curator package, install the dependencies and run:

cd code
pip install -r requirements.txt
python main.py --device "gpu"

This will download chip-design related datasets and begin the data curation pipeline. Please use --device "gpu" to enable semantic and fuzzy deduplication, which require the GPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dapt-curation

dapt-curation

README.md

Data curation for DAPT (Domain Adaptive Pre-Training)

Hardware Requirements

Walkthrough

Usage

Files

dapt-curation

Directory actions

More options

Directory actions

More options

Latest commit

History

dapt-curation

Folders and files

parent directory

README.md

Data curation for DAPT (Domain Adaptive Pre-Training)

Hardware Requirements

Walkthrough

Usage