jiqing-feng · jiqing-feng · Dec 24, 2024 · Dec 24, 2024 · Dec 24, 2024
diff --git a/docs/source/en/quantization/gptq.md b/docs/source/en/quantization/gptq.md
@@ -122,6 +122,14 @@ model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", de
 
 [Marlin](https://github.com/IST-DASLab/marlin) is a CUDA gptq kernel, 4-bit only, that is highly optimized for the Nvidia A100 GPU (Ampere) architecture where the the loading, dequantization, and execution of post-dequantized weights are highly parallelized offering a substantial inference improvement versus the original CUDA gptq kernel. Marlin is only available for quantized inference and does support model quantization.
 
+Marlin inference can be activated via the `backend` property in `GPTQConfig` for GPTQModel:
+
+```py
+
+from transformers import AutoModelForCausalLM, GPTQConfig
+
+model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config=GPTQConfig(bits=4, backend="marlin"))
+```
 
 ## ExLlama