How to deploy a quantized FP8 multimodel llm ? #681

zhishao · 2025-01-22T02:05:28Z

Do both the visual encoder and the language model need to be quantized to fp8?
How to modify the config.pbtxt?

zhishao · 2025-01-23T09:09:22Z

When I launch an Internvl2 multimodal model, it can perform inference normally. However, after I quantize its language model component, it loads successfully but encounters errors during inference :

[TensorRT-LLM][ERROR] Request embedding table data type doesn't match model weight data type.
[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: Request embedding table data type doesn't match model weight data type.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deploy a quantized FP8 multimodel llm ? #681

How to deploy a quantized FP8 multimodel llm ? #681

zhishao commented Jan 22, 2025

zhishao commented Jan 23, 2025

How to deploy a quantized FP8 multimodel llm ? #681

How to deploy a quantized FP8 multimodel llm ? #681

Comments

zhishao commented Jan 22, 2025

zhishao commented Jan 23, 2025