Update doc for fp8 trt-llm export (NVIDIA#10444)

* Update doc for fp8 trt-llm export Signed-off-by: Piotr Kamiński <[email protected]> * Apply review suggestions Signed-off-by: Piotr Kamiński <[email protected]> * code review Signed-off-by: Piotr Kamiński <[email protected]> --------- Signed-off-by: Piotr Kamiński <[email protected]> Signed-off-by: Hainan Xu <[email protected]>
hainan-xv · Nov 5, 2024 · 934eff2 · 934eff2
1 parent 5b53f1f
commit 934eff2
Showing 1 changed file with 27 additions and 0 deletions.
diff --git a/docs/source/nlp/quantization.rst b/docs/source/nlp/quantization.rst
@@ -184,6 +184,33 @@ This script will produce a quantized ``.nemo`` checkpoint at the experiment mana
 It can also optionally produce an exported TensorRT-LLM engine directory or a ``.qnemo`` file that can be used for inference by setting the ``export`` parameters similar to the PTQ example.
 Note that you may tweak the QAT trainer steps and learning rate if needed to achieve better model quality.
 
+NeMo checkpoints trained in FP8 with `NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>`_
+---------------------------------
+
+If you have an FP8-quantized checkpoint, produced during pre-training or fine-tuning with Transformer Engine, you can convert it to a FP8 TensorRT-LLM engine directly using ``nemo.export``.
+The API is the same as with regular ``.nemo`` and ``.qnemo`` checkpoints:
+
+.. code-block:: python
+
+    from nemo.export.tensorrt_llm import TensorRTLLM
+
+
+    trt_llm_exporter = TensorRTLLM(model_dir="/path/to/trt_llm_engine_folder")
+    trt_llm_exporter.export(
+        nemo_checkpoint_path="/path/to/llama2-7b-base-fp8.nemo",
+        model_type="llama",
+    )
+    trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])
+
+
+
+The export settings for quantization can be adjusted via ``trt_llm_exporter.export`` arguments:
+
+* ``fp8_quantized: Optional[bool] = None``: manually enables/disables FP8 quantization
+* ``fp8_kvcache: Optional[bool] = None``: manually enables/disables FP8 quantization for KV-cache
+
+By default quantization settings are auto-detected from the NeMo checkpoint.
+
 
 References
 ----------