Skip to content

Commit

Permalink
Update doc for fp8 trt-llm export (NVIDIA#10444)
Browse files Browse the repository at this point in the history
* Update doc for fp8 trt-llm export

Signed-off-by: Piotr Kamiński <[email protected]>

* Apply review suggestions

Signed-off-by: Piotr Kamiński <[email protected]>

* code review 

Signed-off-by: Piotr Kamiński <[email protected]>

---------

Signed-off-by: Piotr Kamiński <[email protected]>
Signed-off-by: Hainan Xu <[email protected]>
  • Loading branch information
Laplasjan107 authored and Hainan Xu committed Nov 5, 2024
1 parent 5b53f1f commit 934eff2
Showing 1 changed file with 27 additions and 0 deletions.
27 changes: 27 additions & 0 deletions docs/source/nlp/quantization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,33 @@ This script will produce a quantized ``.nemo`` checkpoint at the experiment mana
It can also optionally produce an exported TensorRT-LLM engine directory or a ``.qnemo`` file that can be used for inference by setting the ``export`` parameters similar to the PTQ example.
Note that you may tweak the QAT trainer steps and learning rate if needed to achieve better model quality.

NeMo checkpoints trained in FP8 with `NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>`_
---------------------------------

If you have an FP8-quantized checkpoint, produced during pre-training or fine-tuning with Transformer Engine, you can convert it to a FP8 TensorRT-LLM engine directly using ``nemo.export``.
The API is the same as with regular ``.nemo`` and ``.qnemo`` checkpoints:

.. code-block:: python
from nemo.export.tensorrt_llm import TensorRTLLM
trt_llm_exporter = TensorRTLLM(model_dir="/path/to/trt_llm_engine_folder")
trt_llm_exporter.export(
nemo_checkpoint_path="/path/to/llama2-7b-base-fp8.nemo",
model_type="llama",
)
trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])
The export settings for quantization can be adjusted via ``trt_llm_exporter.export`` arguments:

* ``fp8_quantized: Optional[bool] = None``: manually enables/disables FP8 quantization
* ``fp8_kvcache: Optional[bool] = None``: manually enables/disables FP8 quantization for KV-cache

By default quantization settings are auto-detected from the NeMo checkpoint.


References
----------
Expand Down

0 comments on commit 934eff2

Please sign in to comment.