From 22085adba4fc425f972f43e32a20539070364afd Mon Sep 17 00:00:00 2001
From: Jan Lasek <janek.lasek@gmail.com>
Date: Wed, 3 Apr 2024 12:52:30 +0200
Subject: [PATCH 1/4] Resolve engine build command for int8_sq quantization

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
---
 docs/source/nlp/quantization.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/nlp/quantization.rst b/docs/source/nlp/quantization.rst
index feb5881ed09d..dda6e96e994a 100644
--- a/docs/source/nlp/quantization.rst
+++ b/docs/source/nlp/quantization.rst
@@ -66,7 +66,8 @@ The TensorRT-LLM engine can be build with ``trtllm-build`` command, see `TensorR
         --output_dir engine_dir \
         --max_batch_size 8 \
         --max_input_len 2048 \
-        --max_output_len 512
+        --max_output_len 512 \
+        --strongly_typed
 
 
 
@@ -74,7 +75,6 @@ Known issues
 ^^^^^^^^^^^^
 * Currently in NeMo quantizing and building TensorRT-LLM engines is limited to single-node use cases.
 * Supported and tested model family is Llama2. Quantizing other model types is experimental and may not be fully supported.
-* For INT8 SmoothQuant ``quantization.algorithm=int8_sq``, the TensorRT-LLM engine cannot be build with CLI ``trtllm-build`` command -- Python API and ``tensorrt_llm.builder`` should be used instead.
 
 
 Please refer to the following papers for more details on quantization techniques.

From 88850ae1663103d8ed67647d4005935fb325caa8 Mon Sep 17 00:00:00 2001
From: Jan Lasek <janek.lasek@gmail.com>
Date: Wed, 3 Apr 2024 15:21:22 +0200
Subject: [PATCH 2/4] Fix links and typos

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
---
 docs/source/nlp/quantization.rst | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/docs/source/nlp/quantization.rst b/docs/source/nlp/quantization.rst
index dda6e96e994a..e51d13568580 100644
--- a/docs/source/nlp/quantization.rst
+++ b/docs/source/nlp/quantization.rst
@@ -1,12 +1,12 @@
 .. _megatron_quantization:
 
-Quantization
+Model Quantization
 ==========================
 
-Post Training Quantization (PTQ)
+Post-Training Quantization (PTQ)
 --------------------------------
 
-PTQ enables deploying a model in a low-precision format -- FP8, INT4 or INT8 -- for efficient serving. Different quantization methods are available including FP8 quantization, INT8 SmoothQuant and INT4 AWQ.
+PTQ enables deploying a model in a low-precision format -- FP8, INT4, or INT8 -- for efficient serving. Different quantization methods are available including FP8 quantization, INT8 SmoothQuant, and INT4 AWQ.
 
 Model quantization has two primary benefits: reduced model memory requirements and increased inference throughput.
 
@@ -14,20 +14,20 @@ In NeMo, quantization is enabled by the Nvidia AMMO library -- a unified algorit
 
 The quantization process consists of the following steps:
 
-1. Loading a model checkpoint using appropriate parallelism strategy for evaluation
+1. Loading a model checkpoint using an appropriate parallelism strategy for evaluation
 2. Calibrating the model to obtain appropriate algorithm-specific scaling factors
-3. Producing output directory or .qnemo tarball with model config (json), quantized weights (safetensors) and tokenizer config (yaml).
+3. Producing an output directory or .qnemo tarball with model config (json), quantized weights (safetensors) and tokenizer config (yaml).
 
-Loading models requires using AMMO spec defined in `megatron.core.deploy.gpt.model_specs module <https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/deploy/gpt/model_specs.py>`_. Typically the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced (or a .qnemo tarball) is ready to be used to build a serving engine with the Nvidia TensorRT-LLM library. The engine build step is also soon to be the part of NeMo project and ``nemo.deploy`` and ``nemo.export`` modules, see https://github.com/NVIDIA/NeMo/pull/8690.
+Loading models requires using an AMMO spec defined in `megatron.core.inference.gpt.model_specs.py <https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/inference/gpt/model_specs.py>`_ module. Typically the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced (or a .qnemo tarball) is ready to be used to build a serving engine with the Nvidia TensorRT-LLM library. The engine build step is also soon to be the part of NeMo project and ``nemo.deploy`` and ``nemo.export`` modules, see https://github.com/NVIDIA/NeMo/pull/8743.
 
 Quantization algorithm can also be conveniently set to ``"null"`` to perform only the weights export step using default precision for TensorRT-LLM deployment. This is useful to obtain baseline performance and accuracy results for comparison.
 
 
 Example
 ^^^^^^^
-The example below shows how to quantize the Llama2 70b model into FP8 precision, using tensor parallelism of 8 on a single DGX H100 node. The quantized model is intended for serving using 2 GPUs specified with ``export.inference_tensor_parallel`` parameter.
+The example below shows how to quantize the Llama2 70b model into FP8 precision, using tensor parallelism of 8 on a single DGX H100 node. The quantized model is designed for serving using 2 GPUs specified with the ``export.inference_tensor_parallel`` parameter.
 
-The script should be launched correctly with the number of processes equal to tensor parallelism. This is achieved with the ``mpirun`` command below.
+The script must be launched correctly with the number of processes equal to tensor parallelism. This is achieved with the ``mpirun`` command below.
 
 .. code-block:: bash
 
@@ -73,8 +73,8 @@ The TensorRT-LLM engine can be build with ``trtllm-build`` command, see `TensorR
 
 Known issues
 ^^^^^^^^^^^^
-* Currently in NeMo quantizing and building TensorRT-LLM engines is limited to single-node use cases.
-* Supported and tested model family is Llama2. Quantizing other model types is experimental and may not be fully supported.
+* Currently in NeMo, quantizing and building TensorRT-LLM engines is limited to single-node use cases.
+* The supported and tested model family is Llama2. Quantizing other model types is experimental and may not be fully supported.
 
 
 Please refer to the following papers for more details on quantization techniques.

From 1c70b6cd38290f3e75ebb01a7fbadd9acf7c100f Mon Sep 17 00:00:00 2001
From: Jan Lasek <janek.lasek@gmail.com>
Date: Wed, 3 Apr 2024 15:40:37 +0200
Subject: [PATCH 3/4] Add quantization docs to ToC

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
---
 docs/source/index.rst            | 1 +
 docs/source/nlp/quantization.rst | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/docs/source/index.rst b/docs/source/index.rst
index 5795b57682a1..8dc74ecc771d 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -60,6 +60,7 @@ For more information, browse the developer docs for your area of interest in the
    nlp/models
    nlp/machine_translation/machine_translation
    nlp/megatron_onnx_export
+   nlp/quantization
    nlp/api
 
 
diff --git a/docs/source/nlp/quantization.rst b/docs/source/nlp/quantization.rst
index e51d13568580..f2fa883d0199 100644
--- a/docs/source/nlp/quantization.rst
+++ b/docs/source/nlp/quantization.rst
@@ -1,6 +1,6 @@
 .. _megatron_quantization:
 
-Model Quantization
+Quantization
 ==========================
 
 Post-Training Quantization (PTQ)
@@ -82,6 +82,8 @@ Please refer to the following papers for more details on quantization techniques
 References
 ----------
 
+`Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation, 2020 <https://arxiv.org/abs/2004.09602>`_
+
 `FP8 Formats for Deep Learning, 2022 <https://arxiv.org/abs/2209.05433>`_
 
 `SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models, 2022 <https://arxiv.org/abs/2211.10438>`_

From c5e0b973c4082efd2eed031de780b36f47aff368 Mon Sep 17 00:00:00 2001
From: Jan Lasek <janek.lasek@gmail.com>
Date: Fri, 5 Apr 2024 11:36:05 +0200
Subject: [PATCH 4/4] Opt for using torchrun

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
---
 docs/source/nlp/quantization.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/nlp/quantization.rst b/docs/source/nlp/quantization.rst
index f2fa883d0199..52d06997fc5e 100644
--- a/docs/source/nlp/quantization.rst
+++ b/docs/source/nlp/quantization.rst
@@ -27,11 +27,11 @@ Example
 ^^^^^^^
 The example below shows how to quantize the Llama2 70b model into FP8 precision, using tensor parallelism of 8 on a single DGX H100 node. The quantized model is designed for serving using 2 GPUs specified with the ``export.inference_tensor_parallel`` parameter.
 
-The script must be launched correctly with the number of processes equal to tensor parallelism. This is achieved with the ``mpirun`` command below.
+The script must be launched correctly with the number of processes equal to tensor parallelism. This is achieved with the ``torchrun`` command below.
 
 .. code-block:: bash
 
-    mpirun -n 8 python examples/nlp/language_modeling/megatron_llama_quantization.py \
+    torchrun --nproc-per-node 8 examples/nlp/language_modeling/megatron_llama_quantization.py \
         model_file=llama2-70b-base-bf16.nemo \
         tensor_model_parallel_size=8 \
         pipeline_model_parallel_size=1 \