Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/source/models/vlm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ The following :ref:`engine arguments <engine_args>` are specific to VLMs:
Currently, the support for vision language models on vLLM has the following limitations:

* Only single image input is supported per text prompt.
* Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means model output might not exactly match the huggingface implementation.
* Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means model output might not exactly match the HuggingFace implementation.

We are continuously improving user & developer experience for VLMs. Please raise an issue on GitHub if you have any feedback or feature requests.

Offline Batched Inference
Expand Down
16 changes: 10 additions & 6 deletions docs/source/quantization/fp8.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ The FP8 types typically supported in hardware have two distinct representations,
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- ``inf``, and ``nan``. The tradeoff for the increased dynamic range is lower precision of the stored values.

Quick Start with Online Dynamic Quantization
-------------------------------------
--------------------------------------------

Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying ``--quantization="fp8"`` in the command line or setting ``quantization="fp8"`` in the LLM constructor.

Expand Down Expand Up @@ -173,30 +173,34 @@ Here we detail the structure for the FP8 checkpoints.

The following is necessary to be present in the model's ``config.json``:

.. code-block:: yaml
.. code-block:: text

"quantization_config": {
"quant_method": "fp8",
"activation_scheme": "static" or "dynamic"
},
}


Each quantized layer in the state_dict will have these tensors:

* If the config has `"activation_scheme": "static"`:
* If the config has ``"activation_scheme": "static"``:

.. code-block:: text

model.layers.0.mlp.down_proj.weight < F8_E4M3
model.layers.0.mlp.down_proj.input_scale < F32
model.layers.0.mlp.down_proj.weight_scale < F32

* If the config has `"activation_scheme": "dynamic"`:
* If the config has ``"activation_scheme": "dynamic"``:

.. code-block:: text

model.layers.0.mlp.down_proj.weight < F8_E4M3
model.layers.0.mlp.down_proj.weight_scale < F32


Additionally, there can be `FP8 kv-cache scaling factors <https://github.com/vllm-project/vllm/pull/4893>`_ contained within quantized checkpoints specified through the ``.kv_scale`` parameter present on the Attention Module, such as:

.. code-block:: text
model.layers.0.self_attn.kv_scale < F32

model.layers.0.self_attn.kv_scale < F32