Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,16 +58,18 @@ Some of the exciting new features include:
* **Extended KV Cache and Attention Quantization Support**: LLM Compressor now supports attention quantization, as well as fine-grained KV Cache quantization. Previously only per-tensor KV cache quantization was supported. Now, you can quantize KV cache with `per-head` scales and run with vLLM. Examples of more generalized attention and kv cache quantization can be found in the [experimental folder](experimental/attention).


### Supported Formats
* Activation Quantization: W8A8 (int8 and fp8), MXFP8 (experimental)
* Mixed Precision: W4A16, W8A16, MXFP8A16 (experimental), NVFP4 (W4A4 and W4A16 support)
### Supported Precisions and Types
* Activation Quantization: W8A8 (int8 and fp8), W4AFP8, Microscale (NVFP4, MXFP4, MXFP8)
Comment thread
dsikka marked this conversation as resolved.
* Mixed Precision: W4A16, W8A16, MXFP8A16, MXFP4A16, NVFP4A16
* Attention and KV Cache Quantization: FP8, NVFP4

### Supported Algorithms
* Simple PTQ
* GPTQ
* AWQ
* SmoothQuant
* AutoRound
* Rotation-based (SpinQuant, QuIP)

### Quantizing your model, step-by-step

Expand Down
Loading