diff --git a/docs/source/nlp/quantization.rst b/docs/source/nlp/quantization.rst index 1d016dd0c3a8..b02286bfecf0 100644 --- a/docs/source/nlp/quantization.rst +++ b/docs/source/nlp/quantization.rst @@ -184,6 +184,33 @@ This script will produce a quantized ``.nemo`` checkpoint at the experiment mana It can also optionally produce an exported TensorRT-LLM engine directory or a ``.qnemo`` file that can be used for inference by setting the ``export`` parameters similar to the PTQ example. Note that you may tweak the QAT trainer steps and learning rate if needed to achieve better model quality. +NeMo checkpoints trained in FP8 with `NVIDIA Transformer Engine `_ +--------------------------------- + +If you have an FP8-quantized checkpoint, produced during pre-training or fine-tuning with Transformer Engine, you can convert it to a FP8 TensorRT-LLM engine directly using ``nemo.export``. +The API is the same as with regular ``.nemo`` and ``.qnemo`` checkpoints: + +.. code-block:: python + + from nemo.export.tensorrt_llm import TensorRTLLM + + + trt_llm_exporter = TensorRTLLM(model_dir="/path/to/trt_llm_engine_folder") + trt_llm_exporter.export( + nemo_checkpoint_path="/path/to/llama2-7b-base-fp8.nemo", + model_type="llama", + ) + trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"]) + + + +The export settings for quantization can be adjusted via ``trt_llm_exporter.export`` arguments: + +* ``fp8_quantized: Optional[bool] = None``: manually enables/disables FP8 quantization +* ``fp8_kvcache: Optional[bool] = None``: manually enables/disables FP8 quantization for KV-cache + +By default quantization settings are auto-detected from the NeMo checkpoint. + References ----------