Skip to content

Latest commit

 

History

History
33 lines (28 loc) · 5.74 KB

README.md

File metadata and controls

33 lines (28 loc) · 5.74 KB

Tutorials

The following is a set of tutorials that demonstrate various functionalities and features of NeMo Curator. These tutorials are meant to provide the coding foundation for building applications that consume the data that NeMo Curator curates.

Get Started

To get started, we recommend starting with the following tutorials to become familiar with various functionalities of NeMo Curator and get an idea of what a data curation pipeline might look like:

  1. tinystories, which overviews core functionalities such as downloading, filtering, PII removal and exact deduplication.
  2. peft-curation, which overviews operations suitable for curating small-scale datasets which are used for task-specific fine-tuning.
  3. synthetic-data-hello-world, which overviews basic synthetic data generation facilities for interfacing with external models such as Nemotron-4 340B Instruct.
  4. peft-curation-with-sdg, which combines data processing opeartions and synthetic data generation using Nemotron-4 340B Instruct or LLaMa 3.1 405B Instruct into a single pipeline. Additionally, this tutorial also demonstrates advanced functions such as reward score assignment via Nemotron-4 340B Reward, as well as semantic deduplication to remove semantically similar real or synthetic records.
  5. pretraining-data-curation, which overviews data curation pipeline for creating LLM pretraining dataset at scale.

List of Tutorials

Tutorial Description Additional Resources
pretraining-data-curation Demonstrates accelerated pipeline for curating large-scale data for LLM pretraining in a distributed environment
pretraining-vietnamese-data-curation Demonstrates how to use NeMo Curator to process large-scale and high-quality Vietnamese data in a distributed environment
dapt-curation Data curation sample for domain-adaptive pre-training (DAPT), focusing on ChipNeMo data curation as an example Blog post
distributed_data_classification Demonstrates data domain and data quality classification at scale in a distributed environment
nemotron_340B_synthetic_datagen Demonstrates the use of NeMo Curator synthetic data generation modules to leverage Nemotron-4 340B Instruct for generating synthetic preference data
nemo-retriever-synthetic-data-generation Demonstrates the use of NeMo Curator synthetic data generation modules to leverage NIM models for generating synthetic data and perform data quality assesement on generated data using LLM-as-judge and embedding-model-as-judge. The generated data would be used to evaluate retrieval/RAG pipelines
peft-curation Data curation sample for parameter efficient fine-tuning (PEFT) use-cases Blog post
peft-curation-with-sdg Demonstrates a pipeline to leverage external models such as Nemotron-4 340B Instruct for synthetic data generation, data quality annotation via Nemotron-4 340B Reward, as well as other data processing steps (semantic deduplication, HTML tag removal, etc.) for parameter efficient fine-tuning (PEFT) use-cases Use this data to fine-tune your own model
single_node_tutorial A comprehensive example to demonstrate running various NeMo Curator functionalities locally
synthetic-data-hello-world An introductory example of synthetic data generation using NeMo Curator
synthetic-preference-data Demonstrates the use of NeMo Curator synthetic data generation modules to leverage LLaMa 3.1 405B Instruct for generating synthetic preference data
synthetic-retrieval-evaluation Demonstrates the use of NeMo Curator synthetic data generation modules to leverage LLaMa 3.1 405B Instruct for generating synthetic data to evaluate retrieval pipelines
tinystories A comprehensive example of curating a small dataset to use for model pre-training. Blog post
zyda2-tutorial A comprehensive tutorial on how to reproduce Zyda2 dataset with NeMo Curator. Nvidia blog post Zyphra blog post