[WIP] [DSV4] Quantization Support#41276
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the DeepseekV4 model implementation by adding a packed_modules_mapping for fused layers and implementing a safe initialization for scale_fmt that defaults to 'ue8m0' when the quantization configuration is missing or not a dictionary. I have no feedback to provide.
dsikka
left a comment
There was a problem hiding this comment.
this pathway is likely going to continue seeing multiple updates in the next few weeks. Would be good to add some form of smoke test
6a81e43 to
33f36d4
Compare
1e6b8a1 to
f910a73
Compare
|
FYI I'm seeing a slight accuracy loss with the model, I've ruled out output_dtype as the cause in #41533, which makes me suspect that the cause is the quantization of the indexer/compressor wkv weights. Currently working on updating the checkpoint to skip this quantization, will post accuracy evaluations. |
f910a73 to
f5fc438
Compare
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
f5fc438 to
322ca21
Compare
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
DeepSeek-V4-Flash-NVFP4-FP8
Model Optimizations
This model was obtained by using the following branch with LLM Compressor: vllm-project/llm-compressor#2647
Deployment
vllm serve RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 --tensor-parallel-size 4 --port 8089 --kv_cache_dtype="fp8"Evaluation
For more details on how this model was created and run in LLM Compressor, please contact Kyle Sayers on the vLLM Slack: https://communityinviter.com/apps/vllm-dev/join-vllm-developers-slack