Add INT4 compressed-tensors + LoRA support#1
Merged
sheikheddy merged 22 commits intomainfrom Nov 17, 2025
Merged
Conversation
This commit enables vLLM to support INT4 quantized models using compressed-tensors with LoRA adapters. ## Problem LoRA injection previously assumed tensors existed directly, but compressed-tensors quantized models only expose packed buffers. Direct access to `weight.shape` would fail or return incorrect dimensions due to bit-packing. ## Solution Implemented a multi-tiered fallback strategy for obtaining correct tensor dimensions: 1. Layer-specific attributes (org_vocab_size, embedding_dim) 2. Generic layer attributes (input_size, output_size) 3. weight_shape parameter (stores unpacked dims for compressed-tensors) 4. Fallback to tensor shape ## Changes - vllm/lora/models.py: Fixed dummy LoRA creation to use layer attributes and weight_shape instead of direct shape access - tests/lora/test_quant_model.py: Added INT4 compressed-tensors test case with neuralmagic/TinyLlama-1.1B-Chat-v1.0-INT4 - examples/offline_inference/lora_with_quantization_inference.py: Added compressed-tensors example ## Testing - Added integration test with compressed-tensors INT4 model - Follows existing patterns from AWQ/GPTQ/BitsAndBytes + LoRA support - All modified files pass Python syntax validation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: sheikheddy <sheikheddy@gmail.com>
Fixes INT4 compressed-tensors + LoRA for MoE models (e.g., Kimi K2 Thinking). ## Problem CompressedTensorsWNA16MoEMethod and CompressedTensorsWNA16MarlinMoEMethod did not set required layer attributes (hidden_size, intermediate_size_per_partition, local_num_experts) that the FusedMoEWithLoRA wrapper expects to access. This caused LoRA to fail with MoE models using compressed-tensors quantization, even though the weights were accessible. ## Solution Added layer attribute initialization in create_weights() methods for both: - CompressedTensorsWNA16MoEMethod - CompressedTensorsWNA16MarlinMoEMethod These attributes are set before weight creation, matching the pattern used by other MoE methods (e.g., CompressedTensorsW8A8Fp8MoEMethod). ## Impact - Enables LoRA with Kimi K2 Thinking (INT4 MoE + compressed-tensors) - Follows existing patterns from FP8 MoE + LoRA support - No changes to weight layout or kernel behavior 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: sheikheddy <sheikheddy@gmail.com>
Fixed incorrect fallback logic for embedding layers where dimensions were reversed. ## Problem For embedding layers with shape [vocab_size, embedding_dim]: - input_dim should be vocab_size (shape[0]) - output_dim should be embedding_dim (shape[1]) - embeddings_tensor_dim should be embedding_dim (shape[1]) Previous code had: - input_dim fallback: shape[1] ❌ (was getting embedding_dim instead of vocab_size) - output_dim fallback: shape[0] ❌ (was getting vocab_size instead of embedding_dim) - embeddings_tensor_dim: Used input_size instead of output_size ❌ ## Fix Corrected all fallback paths to use proper dimensions for embedding layers: - input_dim: shape[0] (vocab_size) - output_dim: shape[1] (embedding_dim) - embeddings_tensor_dim: shape[1] (embedding_dim) Also fixed elif chain to check output_size instead of input_size for embeddings_tensor_dim. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: sheikheddy <sheikheddy@gmail.com>
4a746ad to
8fd7c16
Compare
Extends LoRA support to NVFP4 (W4A4) and W4A8 MoE quantization methods. ## Problem CompressedTensorsW4A4MoeMethod and CompressedTensorsW4A8Int8MoEMethod did not set required layer attributes for LoRA compatibility. ## Solution Added layer attribute initialization in create_weights() for both: - CompressedTensorsW4A4MoeMethod (NVFP4) - CompressedTensorsW4A8Int8MoEMethod ## Impact - Enables LoRA with NVFP4-quantized MoE models - Enables LoRA with W4A8 INT8 MoE models (CPU/ARM) - Completes LoRA support for all compressed-tensors MoE variants Signed-off-by: sheikheddy <sheikheddy@gmail.com>
Signed-off-by: Bram Wasti <bwasti@meta.com> Signed-off-by: Bram Wasti <bwasti@fb.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…llm-project#28512) Signed-off-by: ai-jz <aijz.xplr@gmail.com>
…project#28194) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com>
…roject#23691) Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: Lucia Fang <fanglu@fb.com> Signed-off-by: Lucia Fang <116399278+luccafong@users.noreply.github.com> Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com>
…ect#28715) Co-authored-by: Dezhan Tu <dztu@meta.com>
…#28679) Signed-off-by: Scott Zhang <scottzh@fb.com> Co-authored-by: Scott Zhang <scottzh@fb.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Didier Durand <durand.didier@gmail.com>
Signed-off-by: Andy Xie <andy.xning@gmail.com>
…llm-project#28769) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…oject#28787) Signed-off-by: Nick Hill <nhill@redhat.com>
…project#28569) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
…25763 (vllm-project#28670) Signed-off-by: Xiake Sun <xiake.sun@amd.com>
Signed-off-by: Sheikh Abdur Raheem Ali <sheikheddy@gmail.com>
…eaming mode (vllm-project#28543) Signed-off-by: Jscaldwell55 <jay.s.caldwell@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR enables vLLM to support INT4 quantized models using compressed-tensors with LoRA adapters. Previously, LoRA injection assumed that tensors existed directly, but quantized models only expose packed buffers.
Problem
The LoRA dummy creation code in
vllm/lora/models.pydirectly accessedmodule.base_layer.weight.shapeto determine tensor dimensions. For compressed-tensors quantized models:weight_packed(int32 packed buffers) instead of regular tensorsweight_packedhas shape[output_size, input_size // pack_factor]due to bit-packingSolution
Implemented a multi-tiered fallback strategy to get correct dimensions:
org_vocab_size,embedding_dim)input_size,output_size)weight_shapeparameter (stores unpacked dimensions for compressed-tensors)This approach works for all quantization methods (AWQ, GPTQ, BitsAndBytes, compressed-tensors) and all layer types.
Changes Made
1. Fixed Dummy LoRA Creation (
vllm/lora/models.py)weight.shapeaccess with robust fallback chain2. Added Integration Tests (
tests/lora/test_quant_model.py)neuralmagic/TinyLlama-1.1B-Chat-v1.0-INT4to test model list3. Added Example Code (
examples/offline_inference/lora_with_quantization_inference.py)Technical Details
How LoRA Works with Quantization
LoRA operates on activations, not weights:
This is why the integration works seamlessly - LoRA doesn't need to touch the packed weights directly.
Compatibility
The fix maintains backward compatibility with:
Testing
Run the Integration Test
Run the Example
Performance Characteristics
References
🤖 Generated with Claude Code