Skip to content

Latest commit

 

History

History
 
 

dapt-curation

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Data curation for DAPT (Domain Adaptive Pre-Training)

ChipNeMo is a chip design domain adapted LLM. Instead of directly deploying off-theshelf commercial or open-source LLMs, the paper instead adopts the following domain adaptation techniques: domain-adaptive tokenization, domain adaptive continued pretraining, model alignment with domain-specific instructions, and domain adapted retrieval models. Specifically, LLama 2 foundation models are continually pre-trained with 20B plus tokens on domain-specific chip design data, including code, documents, etc., and then fine-tuned with instruction datasets from design data as well as external sources. Evaluations on the resultant domain-adapted ChipNeMo model demonstrate that domain-adaptive pretraining of language models, can lead to superior performance in domain related downstream tasks compared to their base LLaMA2 counterparts, without degradations in generic capabilities.

Here, we share a tutorial with best practices on data curation for DAPT (domain-adaptive pre-training) for a ChipNeMo-like code generation use case.

  • In this tutorial, we will leverage chip domain/hardware datasets from open-source GitHub repositories (./code/sources/github_repos.jsonl), wiki URLs (./code/sources/wikipedia_urls.jsonl), and academic papers (./sources/arxiv_urls.jsonl).

  • ./code/data contains curated data after processing

The small size of this dataset makes it ideal for creating and validating data curation pipelines on a local machine or a cluster. This playbook utilizes specific tools and techniques. First, we convert all files to Txt format (if not already in Txt), compress files on disk, add metadata, and convert them to JSON (./data/raw/). Then, we leverage NeMo Curator to mine high-quality text at scale from a massive code-generation corpus. We use its capabilities to extract text, identify code file types, fix unicode errors, filter quality through heuristics, deduplicate, and redact personal information. We finally also provide steps to blend and shuffle data sources for continued pre-training.

Hardware Requirements

  • This playbook can run on CPUs or GPUs. This playbook can run entirely on a CPU. If you have GPUs, the PII Redaction will be accelerated using them.

Walkthrough

We will use the datasets in the dapt-curation/code/data folder to illustrate data curation through this pipeline. Specifically sample data collected in:

  • ./data/raw/github (we clone github repos, extract text from each file and convert to jsonl)
  • ./data/raw/arxiv_pdfs (we extract data from pdfs, convert to txt and store as jsonl files)
  • ./data/raw/wikipedia (we extract data from htmls, parse, convert to txt and store as json files)

The tutorial follows the steps below:

  • Step 1: Install requirements and import libraries
  • Step 2: Download the data from online sources (Github repos, wiki urls, arxiv pdfs), extract metadata and convert to JSONL
  • Step 3: Load the dataset
  • Step 4: Examine the file types and sizes (optional)
  • Step 5: Run the data curation pipeline with with Nemo Curator
    • File type identification and separation
    • Document-level exact deduplication
    • Heuristic-based quality filtering (Number of lines, worc count, top N-grams, etc.)
    • Fix unicode errors via ftfy
    • PII redaction
    • GPU accelerated fuzzy and semantic deduplication
  • Step 6: Save the filtered and curated data
  • Step 7: Blend datasets and shuffle

Usage

After installing the NeMo Curator package, install the dependencies and run:

cd code
pip install -r requirements.txt
python main.py --device "gpu"

This will download chip-design related datasets and begin the data curation pipeline. Please use --device "gpu" to enable semantic and fuzzy deduplication, which require the GPU.