NeMo Megatron Doc updates1 (NVIDIA#4633)

* Work on NeMo Megatron OSS documentation Signed-off-by: Oleksii Kuchaiev <[email protected]> * NeMo Megatron doc updates Signed-off-by: Oleksii Kuchaiev <[email protected]> Signed-off-by: David Mosallanezhad <[email protected]>
Davood-M · Aug 9, 2022 · 9a6a4ae · 9a6a4ae
1 parent 0f2d6af
commit 9a6a4ae
Show file tree

Hide file tree

Showing 14 changed files with 357 additions and 6 deletions.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -40,10 +40,11 @@ NVIDIA NeMo User Guide
    :caption: Natural Language Processing
    :name: Natural Language Processing
 
-   nlp/models
-   nlp/megatron
-   nlp/api
+   nlp/nemo_megatron/intro
+   nlp/machine_translation/machine_translation
    nlp/text_normalization/intro
+   nlp/api
+
 
 .. toctree::
    :maxdepth: 2

diff --git a/docs/source/nlp/machine_translation.rst → ...chine_translation/machine_translation.rst b/docs/source/nlp/machine_translation.rst → ...chine_translation/machine_translation.rst
diff --git a/docs/source/nlp/nemo_megatron/batching.rst b/docs/source/nlp/nemo_megatron/batching.rst
@@ -0,0 +1,21 @@
+.. _batching:
+
+Batching
+--------
+
+Batch size is one of the first parameters you should play with. For efficiency and convergence reasons we recommend you first try maximizing your batch size per GPU so that your GPU RAM usage is maximized.
+
+NeMo Megatron uses the following concepts.
+
+*Micro batch size* is the number of examples per data parallel rank. It is controlled by ``model.micro_batch_size`` parameter.
+
+*Global batch size* = micro_batch_size * data_parallel_size * gradient_accumulation_steps. For details on ``data_parallel_size`` see :ref:`parallelisms` section, but typically it is equal to the number of GPUs being used.
+Global batch size is controlled by ``model.global_batch_size`` parameter. 
+
+
+*Gradient Accumulation*
+
+    * Idea: Train with large batch sizes with fixed memory footprint at the cost of additional compute.
+    * Do k forward and backward passes through the network with different batches, do not perform parameter updates until after k passes.
+    * Update paramters
+
diff --git a/docs/source/nlp/nemo_megatron/gpt/gpt_training.rst b/docs/source/nlp/nemo_megatron/gpt/gpt_training.rst
@@ -0,0 +1,232 @@
+GPT model training
+------------------
+
+GPT is a decoder-only Transformer model.
+
+
+Quick start
+^^^^^^^^^^^
+Steps below demonstrate training of a GPT style model with NeMo
+
+Data download & pre-processing
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. note::
+    Data download, pre-processing and tokenizer training in the example below will take ~3 hours.
+
+**Step 1: Download data**
+
+The step below will download Wikipedia data (around 20GB) and can take some several hours.
+
+.. code-block:: bash
+
+    wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
+    
+**Step 2: Extract raw data**
+
+.. code-block:: bash
+
+    pip install wikiextractor
+    python -m wikiextractor.WikiExtractor enwiki-latest-pages-articles.xml.bz2 --json
+    find text -name 'wiki_*' -exec cat {} \; > train_data.jsonl
+
+Now, ``train_data.jsonl`` will contain our training data in the json line format. We are interested in the data under "text" field.
+
+
+**Step 3: Train tokenizer**
+
+Below we will condider 2 options for training data tokenizers: Using pre-built HuggingFace BPE and training and using your own Google Sentencepiece tokenizer.
+Note that only second option allows you to experiment with vocabulary size.
+
+*Option 1:* Using HuggingFace GPT2 tokenizer files.
+
+With this option we will just download pre-built vocabulary and merge files for BPE tokenizer.
+
+.. code-block:: bash
+
+    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
+    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
+
+
+*Option 2:* Using `Google Sentencepiece <https://github.com/google/sentencepiece>`_ tokenizer library. 
+
+It comes as dependency with NeMo, so if you have installed NeMo it should already be installed.
+Note that training tokenizer model will also take some time.
+
+.. code-block:: bash
+   
+   sudo apt install jq
+   jq .text train_data.jsonl >> text_for_tokenizer.txt
+   spm_train --input=text_for_tokenizer.txt \
+        --model_prefix=spm_32k_wiki \
+        --vocab_size=32768 \
+        --character_coverage=0.9999 \
+        --model_type=bpe \
+        --byte_fallback=true \
+        --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 \
+        --split_digits true 
+
+After this is done (will take a while), you'll have two files: ```spm_32k_wiki.model and spm_32k_wiki.vocab`` which correspond to model and vocabulary.
+
+**Step 4: Convert training data into memory map format**
+
+This format makes trainig more efficient, especially with many nodes and GPUs. This step will also tokenize data using tokenizer model from Step 3.
+
+*Option 1:* Using HuggingFace GPT2 tokenizer files.
+
+.. code-block:: bash
+
+    python <NeMo_ROOT_FOLDER>/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
+    --input=train_data.jsonl \
+    --json-keys=text \
+    --tokenizer-library=megatron \
+    --vocab gpt2-vocab.json \
+    --dataset-impl mmap \
+    --tokenizer-type GPT2BPETokenizer \
+    --merge-file gpt2-merges.txt \
+    --output-prefix=hfbpe_gpt_training_data \
+    --append-eod \
+    --workers=32 
+
+*Option 2:* Using `Google Sentencepiece <https://github.com/google/sentencepiece>`_ tokenizer library.  
+
+.. code-block:: bash
+    
+    python <NeMo_ROOT_FOLDER>/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
+    --input=train_data.jsonl \
+    --json-keys=text \
+    --tokenizer-library=sentencepiece \
+    --tokenizer-model=spm_32k_wiki.model \
+    --output-prefix=gpt_training_data \
+    --append-eod \
+    --workers=32 
+
+
+Train GPT-style Model
+~~~~~~~~~~~~~~~~~~~~~
+
+Once you have prepared training data and tokenizer, you are ready to train the model.
+The configuration we present below has about 124M parameters and it should fit on a single 16GB GPU if using float16.
+Let's go!!!
+
+*Option 1:* Using HuggingFace GPT2 tokenizer files.
+
+.. code-block:: bash
+
+    python /home/okuchaiev/repos/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py  \
+	--config-path=/home/okuchaiev/repos/NeMo/examples/nlp/language_modeling/conf \
+	--config-name=megatron_gpt_config \
+	trainer.devices=1 \
+	trainer.num_nodes=1 \
+	trainer.max_epochs=null \
+	trainer.max_steps=300000 \
+	trainer.val_check_interval=300 \
+	trainer.log_every_n_steps=50 \
+	trainer.limit_val_batches=50 \
+	trainer.limit_test_batches=50 \
+	trainer.accumulate_grad_batches=1 \
+	trainer.precision=16 \
+	model.micro_batch_size=6 \
+	model.global_batch_size=192 \
+	model.tensor_model_parallel_size=1 \
+	model.pipeline_model_parallel_size=1 \
+	model.max_position_embeddings=1024 \
+	model.encoder_seq_length=1024 \
+	model.hidden_size=768 \
+	model.ffn_hidden_size=3072 \
+	model.num_layers=12 \
+	model.num_attention_heads=12 \
+	model.init_method_std=0.021 \
+	model.hidden_dropout=0.1 \
+	model.layernorm_epsilon=1e-5 \
+	model.tokenizer.vocab_file=gpt2-vocab.json \
+    model.tokenizer.merge_file=gpt2-merges.txt \
+	model.data.data_prefix=[1.0,hfbpe_gpt_training_data_text_document] \
+	model.data.num_workers=2 \
+	model.data.seq_length=1024 \
+	model.data.splits_string=\'980,10,10\' \
+	model.optim.name=fused_adam \
+	model.optim.lr=6e-4 \
+	model.optim.betas=[0.9,0.95] \
+	model.optim.weight_decay=0.1 \
+	model.optim.sched.name=CosineAnnealing \
+	model.optim.sched.warmup_steps=750 \
+	model.optim.sched.constant_steps=80000 \
+	model.optim.sched.min_lr=6e-5 \
+	exp_manager.resume_if_exists=True \
+	exp_manager.resume_ignore_no_checkpoint=True \
+	exp_manager.create_checkpoint_callback=True \
+	exp_manager.checkpoint_callback_params.monitor=val_loss \
+	exp_manager.checkpoint_callback_params.save_top_k=3 \
+	exp_manager.checkpoint_callback_params.mode=min \
+	exp_manager.checkpoint_callback_params.always_save_nemo=False
+
+
+*Option 2:* Using `Google Sentencepiece <https://github.com/google/sentencepiece>`_ tokenizer library.
+
+.. code-block:: bash
+
+    python /home/okuchaiev/repos/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py  \
+	--config-path=/home/okuchaiev/repos/NeMo/examples/nlp/language_modeling/conf \
+	--config-name=megatron_gpt_config \
+	trainer.devices=1 \
+	trainer.num_nodes=1 \
+	trainer.max_epochs=null \
+	trainer.max_steps=300000 \
+	trainer.val_check_interval=300 \
+	trainer.log_every_n_steps=50 \
+	trainer.limit_val_batches=50 \
+	trainer.limit_test_batches=50 \
+	trainer.accumulate_grad_batches=1 \
+	trainer.precision=16 \
+	model.micro_batch_size=6 \
+	model.global_batch_size=192 \
+	model.tensor_model_parallel_size=1 \
+	model.pipeline_model_parallel_size=1 \
+	model.max_position_embeddings=1024 \
+	model.encoder_seq_length=1024 \
+	model.hidden_size=768 \
+	model.ffn_hidden_size=3072 \
+	model.num_layers=12 \
+	model.num_attention_heads=12 \
+	model.init_method_std=0.021 \
+	model.hidden_dropout=0.1 \
+	model.layernorm_epsilon=1e-5 \
+	model.tokenizer.library=sentencepiece \
+	model.tokenizer.model=spm_32k_wiki.model \
+	model.data.data_prefix=[1.0,gpt_training_data_text_document] \
+	model.data.num_workers=2 \
+	model.data.seq_length=1024 \
+	model.data.splits_string=\'980,10,10\' \
+	model.optim.name=fused_adam \
+	model.optim.lr=6e-4 \
+	model.optim.betas=[0.9,0.95] \
+	model.optim.weight_decay=0.1 \
+	model.optim.sched.name=CosineAnnealing \
+	model.optim.sched.warmup_steps=750 \
+	model.optim.sched.constant_steps=80000 \
+	model.optim.sched.min_lr=6e-5 \
+	exp_manager.resume_if_exists=True \
+	exp_manager.resume_ignore_no_checkpoint=True \
+	exp_manager.create_checkpoint_callback=True \
+	exp_manager.checkpoint_callback_params.monitor=val_loss \
+	exp_manager.checkpoint_callback_params.save_top_k=3 \
+	exp_manager.checkpoint_callback_params.mode=min \
+	exp_manager.checkpoint_callback_params.always_save_nemo=False
+
+
+Next, simply launch Tensorboard to monitor training like so:
+
+.. code-block:: bash
+
+    tensorboard --logdir nemo_experiments --bind_all
+
+Next steps
+~~~~~~~~~~
+
+Please refer to:
+
+* :ref:`batching` section for batch size adjustments
+* :ref:`parallelisms` section for understanding various types of parallelisms
+* :ref:`promptlearning` section for details on prompt-tuning and p-tuning
+
diff --git a/docs/source/nlp/nemo_megatron/images/ddp.gif b/docs/source/nlp/nemo_megatron/images/ddp.gif
diff --git a/docs/source/nlp/nemo_megatron/images/pnom.gif b/docs/source/nlp/nemo_megatron/images/pnom.gif
diff --git a/docs/source/nlp/nemo_megatron/images/pp.gif b/docs/source/nlp/nemo_megatron/images/pp.gif
diff --git a/docs/source/nlp/nemo_megatron/images/sp.gif b/docs/source/nlp/nemo_megatron/images/sp.gif
diff --git a/docs/source/nlp/nemo_megatron/images/tp.gif b/docs/source/nlp/nemo_megatron/images/tp.gif
diff --git a/docs/source/nlp/nemo_megatron/intro.rst b/docs/source/nlp/nemo_megatron/intro.rst
@@ -0,0 +1,27 @@
+NeMo Megatron
+=============
+
+Megatron :cite:`nlp-megatron-shoeybi2019megatron` is a large, powerful transformer developed by the Applied Deep Learning Research 
+team at NVIDIA. NeMo Megatron supports several types of models:
+
+* GPT-style models (decoder only)
+* T5/BART/UL2-style models (encoder-decoder)
+* BERT-style models (encoder only)
+
+
+
+.. note::
+    NeMo Megatron has an Enterprise edition which contains tools for data preprocessing, hyperparameter tuning, container, scripts for various clouds and more. With Enterprise edition you also get deployment tools. Apply for `early access here <https://developer.nvidia.com/nemo-megatron-early-access>`_ .
+
+
+.. toctree::
+   :maxdepth: 1
+
+   mlm_migration   
+   gpt/gpt_training
+   t5/t5_training
+   batching 
+   parallelisms  
+   prompt_learning
+
+
diff --git a/docs/source/nlp/nemo_megatron/mlm_migration.rst b/docs/source/nlp/nemo_megatron/mlm_migration.rst
@@ -0,0 +1,24 @@
+Migrating from Megatron-LM
+--------------------------
+
+NeMo Megatron and Megatron-LM share many underlying technology. You should be able to convert your GPT model checkpoints trained with Megatron-LM into NeMo Megatron.
+Example conversion script:
+
+.. code-block:: bash
+
+   <NeMo_ROOT_FOLDER>/examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py \
+     --checkpoint_folder <path_to_PTL_checkpoints_folder> \
+     --checkpoint_name megatron_gpt--val_loss=99.99-step={steps}-consumed_samples={consumed}.0 \
+     --nemo_file_path <path_to_output_nemo_file> \
+     --model_type <megatron model type> \
+     --tensor_model_parallel_size <tensor_model_parallel_size> \
+     --pipeline_model_parallel_size <pipeline_model_parallel_size>  \
+     --gpus_per_node  <gpus per node>
+
+
+
+To resume the training from converted MegatronLM checkpoint, make sure to set the 
+`trainer.max_steps=round(lr-warmup-fraction * lr-decay-iters + lr-decay-iters)`
+where  `lr-warmup-fraction` and `lr-decay-iters` are arguments from MegatronLM training
+so the learning rate scheduler will follow the same curve.
+
diff --git a/docs/source/nlp/nemo_megatron/parallelisms.rst b/docs/source/nlp/nemo_megatron/parallelisms.rst
@@ -0,0 +1,44 @@
+.. _parallelisms:
+
+Parallelisms
+------------
+
+NeMo Megatron supports 4 types of parallelisms (can be mixed together arbitraritly):
+
+Distributed Data parallelism
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. image:: images/ddp.gif
+    :align: center
+    :alt: Distributed Data Parallel
+
+
+Tensor Parallelism
+^^^^^^^^^^^^^^^^^^
+
+.. image:: images/tp.gif
+    :align: center
+    :alt: Tensor Parallel
+
+Pipeline Parallelism
+^^^^^^^^^^^^^^^^^^^^
+
+.. image:: images/pp.gif
+    :align: center
+    :alt: Pipeline Parallel
+
+Sequence Parallelism
+^^^^^^^^^^^^^^^^^^^^
+
+.. image:: images/sp.gif
+    :align: center
+    :alt: Sqeuence Parallel
+
+Parallelism nomenclature
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+When reading and modifying NeMo Megatron code you will encounter the following terms.
+
+.. image:: images/pnom.gif
+    :align: center
+    :alt: Parallelism nomenclature
diff --git a/docs/source/nlp/prompt_learning.rst → ...rce/nlp/nemo_megatron/prompt_learning.rst b/docs/source/nlp/prompt_learning.rst → ...rce/nlp/nemo_megatron/prompt_learning.rst
@@ -1,5 +1,7 @@
+.. _promptlearning:
+
 Prompt Learning
--------------
+---------------
 
 Within NeMo we refer to **p-tuning** and **prompt tuning** methods collectively as prompt learning. Both methods are parameter efficient alternatives to fine-tuning pretrained language models. Our NeMo implementation makes it possible to use one pretrained GPT model on many downstream tasks without needing to tune the model's full set of parameters. It also allows for adding new tasks to your model without overwriting or disrupting previous tasks for which the model has already been p-tuned/prompt-tuned. Because the original model parameters are frozen and never altered by either method, p-tuning/prompt-tuning also avoid cartographic forgetting issues often encountered when fine-tuning models. 
 

diff --git a/docs/source/starthere/intro.rst b/docs/source/starthere/intro.rst
@@ -150,8 +150,8 @@ If you chose to work with the ``main`` branch, we recommend using `NVIDIA's PyTo
     stack=67108864 --device=/dev/snd nvcr.io/nvidia/pytorch:21.05-py3
 
 
-FAQ
----
+`FAQ <https://github.com/NVIDIA/NeMo/discussions>`_
+---------------------------------------------------
 Have a look at our `discussions board <https://github.com/NVIDIA/NeMo/discussions>`_ and feel free to post a question or start a discussion.