NeVA Tutorial Notebook (#8217)

* init commit - neva tutorial Signed-off-by: Pratyush Muthukumar <[email protected]> * NeVA tutorial notebook Signed-off-by: Pratyush Muthukumar <[email protected]> * init commit - neva tutorial Signed-off-by: Pratyush Muthukumar <[email protected]> Signed-off-by: Pratyush Muthukumar <[email protected]> Signed-off-by: Pratyush Muthukumar <[email protected]> * NeVA tutorial notebook Signed-off-by: Pratyush Muthukumar <[email protected]> Signed-off-by: Pratyush Muthukumar <[email protected]> Signed-off-by: Pratyush Muthukumar <[email protected]> * requested changes Signed-off-by: Pratyush Muthukumar <[email protected]> Signed-off-by: Pratyush Muthukumar <[email protected]> * add inference via script Signed-off-by: Pratyush Muthukumar <[email protected]> * requested changes Signed-off-by: Pratyush Muthukumar <[email protected]> * requested changes Signed-off-by: Pratyush Muthukumar <[email protected]> * add codeblocks to run torchrun in notebook Signed-off-by: Pratyush Muthukumar <[email protected]> --------- Signed-off-by: Pratyush Muthukumar <[email protected]> Signed-off-by: Pratyush Muthukumar <[email protected]> Co-authored-by: Pratyush Muthukumar <[email protected]>
NVIDIA · Feb 16, 2024 · 977f139 · 977f139
1 parent 536f573
commit 977f139
Showing 1 changed file with 366 additions and 0 deletions.
diff --git a/tutorials/multimodal/NeVA Tutorial.ipynb b/tutorials/multimodal/NeVA Tutorial.ipynb
@@ -0,0 +1,366 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a2225742c5996304",
+   "metadata": {},
+   "source": [
+    "# NeVA Training / Inference Tutorial\n",
+    "\n",
+    "### Note:\n",
+    "Currently, this notebook must be run in a NeMo container. An example command to launch the container:\n",
+    "\n",
+    "```\n",
+    "docker run --gpus all -it --rm -v <your_nemo_dir>:/opt/NeMo --shm-size=8g \\\n",
+    "     -p 8888:8888 --ulimit memlock=-1 --ulimit \\\n",
+    "      stack=67108864 <your_nemo_container>\n",
+    "```\n",
+    "\n",
+    "## Introduction\n",
+    "\n",
+    "This notebook illustrates how to train and perform inference using NeVA with the NeMo Toolkit. NeVA originates from [LLaVA](https://github.com/haotian-liu/LLaVA) (Large Language and Vision Assistant) and is a powerful multimodal image-text instruction tuned model optimized within the NeMo framework. \n",
+    "\n",
+    "\n",
+    "This tutorial will guide you through the following topics:\n",
+    "1. Training a NeVA model\n",
+    "2. Performing inference with the trained model\n",
+    "\n",
+    "## Datasets\n",
+    "\n",
+    "After downloading all below datasets for pretraining and instruction tuning, your dataset directory should look something similar to:\n",
+    "\n",
+    "```\n",
+    "LLaVA-Pretrain-LCS-558K\n",
+    "├── blip_laion_cc_sbu_558k.json\n",
+    "├── images\n",
+    "LLaVA-Instruct-mixture\n",
+    "├── llava_v1_5_mix665k.json\n",
+    "└── images\n",
+    "    └── ...\n",
+    "```\n",
+    "\n",
+    "### Pre-Training Dataset\n",
+    "\n",
+    "The pre-training dataset is open-sourced from the LLaVA implementation and can be downloaded [here](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain). The dataset consists of a 558K subset of the LAION-CC-SBU dataset with BLIP captions. \n",
+    "\n",
+    "The associated images for pretraining can be downloaded via HuggingFace [here](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/blob/main/images.zip).\n",
+    "\n",
+    "### Instruction Tuning Dataset\n",
+    "\n",
+    "The instruction tuning annotations are sourced from the LLaVA implementation and are available [here](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json).\n",
+    "\n",
+    "The associated images for the mixture instruction tuning annotations can be found [here](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#visual-instruction-tuning). After extracting, the data should be formatted as follows:\n",
+    "\n",
+    "```\n",
+    "    images\n",
+    "      ├── coco\n",
+    "      │    └── train2017\n",
+    "      ├── gqa\n",
+    "      │    └── images\n",
+    "      ├── ocr_vqa\n",
+    "      │    └── images\n",
+    "      ├── textvqa\n",
+    "      │    └── train_images\n",
+    "      └── vg\n",
+    "           ├── VG_100K\n",
+    "           └── VG_100K_2\n",
+    "```\n",
+    "\n",
+    "## Training\n",
+    "\n",
+    "\n",
+    "### Feature Alignment Pre-Training\n",
+    "\n",
+    "We provide a set of scripts for pre-training and fine-tuning which can be kicked off with CLI flags defining specified arguments. \n",
+    "\n",
+    "An example of a pre-training script execution:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3930351e",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "! torchrun --nproc_per_node=4 /opt/NeMo/examples/multimodal/multimodal_llm/neva/neva_pretrain.py \\\n",
+    " ++cluster_type=BCP \\\n",
+    " trainer.precision=bf16 \\\n",
+    " trainer.num_nodes=1 \\\n",
+    " trainer.devices=4 \\\n",
+    " trainer.val_check_interval=1000 \\\n",
+    " trainer.limit_val_batches=5 \\\n",
+    " trainer.log_every_n_steps=1 \\\n",
+    " trainer.max_steps=1000 \\\n",
+    " model.megatron_amp_O2=True \\\n",
+    " model.micro_batch_size=1 \\\n",
+    " model.global_batch_size=2 \\\n",
+    " model.tensor_model_parallel_size=4 \\\n",
+    " model.pipeline_model_parallel_size=1 \\\n",
+    " model.mcore_gpt=True \\\n",
+    " model.transformer_engine=True \\\n",
+    " model.data.data_path=/path/to/datasets/LLaVA-Pretrain-LCS-558K/blip_laion_cc_sbu_558k.json \\\n",
+    " model.data.image_folder=/path/to/dataset/LLaVA-Pretrain-LCS-558K/images \\\n",
+    " model.tokenizer.library=sentencepiece \\\n",
+    " model.tokenizer.model=/path/to/tokenizer/model \\\n",
+    " model.encoder_seq_length=4096 \\\n",
+    " model.num_layers=32 \\\n",
+    " model.hidden_size=4096 \\\n",
+    " model.ffn_hidden_size=16384 \\\n",
+    " model.num_attention_heads=32 \\\n",
+    " model.normalization=layernorm1p \\\n",
+    " model.do_layer_norm_weight_decay=False \\\n",
+    " model.apply_query_key_layer_scaling=True \\\n",
+    " model.activation=squared-relu \\\n",
+    " model.headscale=False \\\n",
+    " model.position_embedding_type=rope \\\n",
+    " model.rotary_percentage=0.5 \\\n",
+    " model.num_query_groups=null \\\n",
+    " model.data.num_workers=0 \\\n",
+    " model.mm_cfg.llm.from_pretrained=/path/to/checkpoint \\\n",
+    " model.mm_cfg.llm.model_type=nvgpt \\\n",
+    " model.data.conv_template=nvgpt \\\n",
+    " model.mm_cfg.vision_encoder.from_pretrained='openai/clip-vit-large-patch14' \\\n",
+    " model.mm_cfg.vision_encoder.from_hf=True \\\n",
+    " model.data.image_token_len=256 \\\n",
+    " model.optim.name=\"fused_adam\" \\\n",
+    " exp_manager.create_checkpoint_callback=True \\\n",
+    " exp_manager.create_wandb_logger=False \\\n",
+    " exp_manager.wandb_logger_kwargs.project=neva_demo"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6b619e0a",
+   "metadata": {},
+   "source": [
+    "\n",
+    "\n",
+    "**Note**: To initialize training a model from scratch rather than from a pretrained checkpoint, you may specify `null` instead of a path in the CLI arguments.\n",
+    "\n",
+    "### Image-Language Pair Instruction Fine-Tuning\n",
+    "\n",
+    "Fine-tuning can also be run from within the container via a similar command leveraging the `neva_finetune.py` script.\n",
+    "\n",
+    "An example of an image-text pair instruction tuning script execution:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "97963224",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "! torchrun --nproc_per_node=4 /opt/NeMo/examples/multimodal/multimodal_llm/neva/neva_finetune.py \\\n",
+    "++cluster_type=BCP \\\n",
+    " trainer.precision=bf16 \\\n",
+    " trainer.num_nodes=1 \\\n",
+    " trainer.devices=1 \\\n",
+    " trainer.val_check_interval=100 \\\n",
+    " trainer.limit_val_batches=50 \\\n",
+    " trainer.max_steps=4900 \\\n",
+    " model.megatron_amp_O2=True \\\n",
+    " model.micro_batch_size=4 \\\n",
+    " model.global_batch_size=32 \\\n",
+    " model.tensor_model_parallel_size=1 \\\n",
+    " model.pipeline_model_parallel_size=1 \\\n",
+    " model.mcore_gpt=True \\\n",
+    " model.transformer_engine=True \\\n",
+    " model.data.data_path=/path/to/dataset/LLaVA-Pretrain-LCS-558K/blip_laion_cc_sbu_558k.json \\\n",
+    " model.data.image_folder=/path/to/dataset/LLaVA-Pretrain-LCS-558K/images \\\n",
+    " model.tokenizer.library=megatron \\\n",
+    " model.tokenizer.model=/path/to/tokenizer \\\n",
+    " model.encoder_seq_length=4096 \\\n",
+    " model.num_layers=24 \\\n",
+    " model.hidden_size=2048 \\\n",
+    " model.ffn_hidden_size=5440 \\\n",
+    " model.num_attention_heads=16 \\\n",
+    " model.normalization=layernorm1p \\\n",
+    " model.do_layer_norm_weight_decay=False \\\n",
+    " model.apply_query_key_layer_scaling=True \\\n",
+    " model.activation=fast-swiglu \\\n",
+    " model.headscale=False \\\n",
+    " model.position_embedding_type=rope \\\n",
+    " model.rotary_percentage=0.5 \\\n",
+    " model.num_query_groups=null \\\n",
+    " model.data.num_workers=8 \\\n",
+    " model.mm_cfg.llm.from_pretrained=/path/to/checkpoint \\\n",
+    " model.mm_cfg.llm.model_type=nvgpt \\\n",
+    " exp_manager.create_checkpoint_callback=True \\\n",
+    " model.data.conv_template=nvgpt \\\n",
+    " model.mm_cfg.vision_encoder.from_pretrained='openai/clip-vit-large-patch14' \\\n",
+    " model.mm_cfg.vision_encoder.from_hf=True \\\n",
+    " model.data.image_token_len=256 \\\n",
+    " model.optim.name=\"fused_adam\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d69e937c",
+   "metadata": {},
+   "source": [
+    "## Inference\n",
+    "\n",
+    "### From Pre-trained Checkpoints\n",
+    "\n",
+    "If you would like to use NeVA for inference from pre-trained checkpoint via HuggingFace, you can convert from HuggingFace to `.nemo` first.\n",
+    "\n",
+    "First, download the model checkpoint from HuggingFace [here](https://huggingface.co/liuhaotian/llava-v1.5-7b). The tokenizer (stored as `tokenizer.model` within the pretrained checkpoint) must be modified with the following commands:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0d30003f",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "! cd /opt/sentencepiece/src/\n",
+    "! protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto\n",
+    "! python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \\\n",
+    "--input_file /path/to/tokenizer.model \\\n",
+    "--output_file /path/to/tokenizer_neva.model \\\n",
+    "--is_userdefined \\\n",
+    "--tokens \"<extra_id_0>\" \"<extra_id_1>\" \"<extra_id_2>\" \"<extra_id_3>\" \\\n",
+    "         \"<extra_id_4>\" \"<extra_id_5>\" \"<extra_id_6>\" \"<extra_id_7>\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "470c093b",
+   "metadata": {},
+   "source": [
+    "Finally, convert to `.nemo` via the provided script:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5f398c26",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "! python /opt/NeMo/examples/multimodal/mllm/neva/convert_hf_llava_to_neva.py \\\n",
+    "--in-file /path/to/llava-v1.5-7b \\\n",
+    "--out-file /path/to/llava-v1.5-7b.nemo \\\n",
+    "--tokenizer-model /path/to/tokenizer_neva.model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5235639a",
+   "metadata": {},
+   "source": [
+    "### Running Inference\n",
+    "\n",
+    "NeVA inference via the NeMo framework can be quickly spun up via the NeMo Launcher and a few modifications to use the default NeVA inference config file.\n",
+    "\n",
+    "Inference can be run with a similar command leveraging the provided inference script `neva_evaluation.py` within the container.\n",
+    "\n",
+    "An example of an inference script execution:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ee0156ea",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "! torchrun --nproc_per_node=1 /opt/NeMo/examples/multimodal/multimodal_llm/neva/neva_evaluation.py \\\n",
+    "tensor_model_parallel_size=1 \\\n",
+    "pipeline_model_parallel_size=1 \\\n",
+    "neva_model_file=/path/to/checkpoint \\\n",
+    "trainer.devices=1 \\\n",
+    "trainer.precision=bf16 \\\n",
+    "prompt_file=/path/to/prompt/file \\\n",
+    "inference.images_base_path=/path/to/image \\\n",
+    "output_file=path/for/output/file/ \\\n",
+    "inference.temperature=0.2 \\\n",
+    "inference.top_k=0 \\\n",
+    "inference.top_p=0.9 \\\n",
+    "inference.greedy=False \\\n",
+    "inference.add_BOS=False \\\n",
+    "inference.all_probs=False \\\n",
+    "inference.repetition_penalty=1.2 \\\n",
+    "inference.insert_image_token=null \\\n",
+    "inference.tokens_to_generate=256 \\\n",
+    "quantization.algorithm=awq \\\n",
+    "quantization.enable=False"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7d989385",
+   "metadata": {},
+   "source": [
+    "#### Running Inference via Launcher\n",
+    "\n",
+    "Inference can also be run via the NeMo Launcher, where parameters are specified in the inference config file rather than CLI arguments. To customize the default config provided in `conf/config.yaml` for NeVA inference, see below.\n",
+    "\n",
+    "##### Inference Config Setup\n",
+    "1. Modify `fw_inference` within `defaults` to use `neva/inference` \n",
+    "2. In `stages`, ensure that `fw_inference` is included\n",
+    "3. Within the `inference.yaml` default NeVA inference config file, ensure that the path to the `prompt` file, `neva_model_file`, and `images_base_path` within `inference` are specified.\n",
+    "\n",
+    "Once either the necessary checkpoints have been loaded or the training workflow is complete, inference can be executed within the launcher pipeline with the following command:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "68d434ff",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "! python3 main.py"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}