Skip to content

Fix deepseek awq v3#3450

Merged
zhyncs merged 19 commits intomainfrom
fix-dpsk-v3-awq
Feb 12, 2025
Merged

Fix deepseek awq v3#3450
zhyncs merged 19 commits intomainfrom
fix-dpsk-v3-awq

Conversation

@hnyls2002
Copy link
Collaborator

@hnyls2002 hnyls2002 commented Feb 10, 2025

python -m sglang.launch_server --model-path cognitivecomputations/DeepSeek-V3-AWQ --tp-size 8 --trust-remote --disable-mla

@hnyls2002 hnyls2002 marked this pull request as draft February 10, 2025 04:43
@halexan
Copy link

halexan commented Feb 10, 2025

After this pr being merged.

Can sglang run this cognitivecomputations/DeepSeek-V3-AWQ?

@chenchunhui97
Copy link

After this pr being merged.

Can sglang run this cognitivecomputations/DeepSeek-V3-AWQ?

I am having a try......

@Xu-Chen
Copy link
Contributor

Xu-Chen commented Feb 10, 2025

We should also introduce triton fused moe kernel like moe_wna16.
AWQ marlin kernel may be just get 10 token/s on 8*A100.

@hnyls2002
Copy link
Collaborator Author

After this pr being merged.

Can sglang run this cognitivecomputations/DeepSeek-V3-AWQ?

Yes, this PR is exactly for this

@hnyls2002 hnyls2002 marked this pull request as ready for review February 10, 2025 11:47
@hnyls2002 hnyls2002 changed the title Fix deepseek awq v3 [DO NOT MERGE] Fix deepseek awq v3 Feb 10, 2025
@hnyls2002 hnyls2002 changed the title [DO NOT MERGE] Fix deepseek awq v3 Fix deepseek awq v3 Feb 10, 2025
@pachinko
Copy link

``> > After this pr being merged.

Can sglang run this cognitivecomputations/DeepSeek-V3-AWQ?

Yes, this PR is exactly for this

still have a problem, i am running this model cognitivecomputations/DeepSeek-V3-AWQ

[2025-02-11 14:42:20 TP6] Scheduler hit an exception: Traceback (most recent call last):
  File "/WORK/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/WORK/sglang/python/sglang/srt/managers/scheduler.py", line 240, in __init__
    self.tp_worker = TpWorkerClass(
                     ^^^^^^^^^^^^^^
  File "/WORK/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/WORK/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/WORK/sglang/python/sglang/srt/model_executor/model_runner.py", line 186, in __init__
    self.load_model()
  File "/WORK/sglang/python/sglang/srt/model_executor/model_runner.py", line 307, in load_model
    self.model = get_model(
                 ^^^^^^^^^^
  File "/WORK/sglang/python/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "/WORK/sglang/python/sglang/srt/model_loader/loader.py", line 362, in load_model
    model.load_weights(self._get_all_weights(model_config, model))
  File "/WORK/sglang/python/sglang/srt/models/deepseek_v2.py", line 924, in load_weights
    param = params_dict[name]
            ~~~~~~~~~~~^^^^^^
KeyError: 'model.layers.6.mlp.experts.w2_weight'

[2025-02-11 14:42:20] Received sigquit from a child proces. It usually means the child failed.

@halexan
Copy link

halexan commented Feb 11, 2025

@pachinko

What is your launch command?

@pachinko
Copy link

@halexan

python3 -m sglang.launch_server \
    --model-path /home/model/DeepSeek-R1 \
    --tp 8 \
    --dist-init-addr 10.10.0.1:6000 \
    --nnodes 1 \
    --node-rank 0 \
    --trust-remote-code \
    --disable-radix-cache  \
    --disable-outlines-disk-cache \
    --host 0.0.0.0 \
    --port 40000

@halexan
Copy link

halexan commented Feb 11, 2025

We should also introduce triton fused moe kernel like moe_wna16. AWQ marlin kernel may be just get 10 token/s on 8*A100.

So, does this pr still use AWQ marlin kernel?

@pachinko
Copy link

@halexan

python3 -m sglang.launch_server \
    --model-path /home/model/DeepSeek-R1 \
    --tp 8 \
    --dist-init-addr 10.10.0.1:6000 \
    --nnodes 1 \
    --node-rank 0 \
    --trust-remote-code \
    --disable-radix-cache  \
    --disable-outlines-disk-cache \
    --host 0.0.0.0 \
    --port 40000

I replaced the config.json with the awq version.

@hnyls2002
Copy link
Collaborator Author

hnyls2002 commented Feb 11, 2025

@halexan

python3 -m sglang.launch_server \
    --model-path /home/model/DeepSeek-R1 \
    --tp 8 \
    --dist-init-addr 10.10.0.1:6000 \
    --nnodes 1 \
    --node-rank 0 \
    --trust-remote-code \
    --disable-radix-cache  \
    --disable-outlines-disk-cache \
    --host 0.0.0.0 \
    --port 40000

I replaced the config.json with the awq version.

R1 and MLA are not supported by now, due to some unknown accuracy reasons. You can use V3-AWQ with this command

 python -m sglang.launch_server --model-path cognitivecomputations/DeepSeek-V3-AWQ --tp-size 8 --trust-remote --disable-mla

@chenchunhui97
Copy link

After this pr being merged.

Can sglang run this cognitivecomputations/DeepSeek-V3-AWQ?

I succeeded to deploy the model on 8*A800 by building docker image on branch fix-dpsk-v3-awq.

@Xu-Chen
Copy link
Contributor

Xu-Chen commented Feb 12, 2025

Could you share some benchmark?

@Zachary-ai-engineer
Copy link

We tested V3 AWQ based on the latest code and found that indicators such as tpot were relatively poor. How should we solve this problem?
image

@halexan
Copy link

halexan commented Feb 12, 2025

How about benchmark?@chenchunhui97

Copy link
Collaborator

@zhyncs zhyncs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix is a bit tricky, I'll merge it first to unblock the awq usage. Refactoring is on its way.

@zhyncs zhyncs merged commit 8616357 into main Feb 12, 2025
21 checks passed
@zhyncs zhyncs deleted the fix-dpsk-v3-awq branch February 12, 2025 14:09
@luweizheng
Copy link

My launch script on 8*A800 80G. This model havs been successfully deployed with vLLM with a smaller context length. But it seems vLLM does not optimize well on MLA now.

python3 -m sglang.launch_server --model-path /path/to/DeepSeek-R1-awq/DeepSeek-R1-awq --tp 8 --host 0.0.0.0 --port 11434 --trust-remote-code

Error:

File "/fs/fast/u20247643/envs/sglang/lib/python3.12/site-packages/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "/fs/fast/u20247643/envs/sglang/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 362, in load_model
    model.load_weights(self._get_all_weights(model_config, model))
  File "/fs/fast/u20247643/envs/sglang/lib/python3.12/site-packages/sglang/srt/models/deepseek_v2.py", line 962, in load_weights
    w = ops.awq_dequantize(
        ^^^^^^^^^^^^^^^^^^^
  File "/fs/fast/u20247643/envs/sglang/lib/python3.12/site-packages/vllm/_custom_ops.py", line 222, in awq_dequantize
    return torch.ops._C.awq_dequantize(qweight, scales, zeros, split_k_iters,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/fast/u20247643/envs/sglang/lib/python3.12/site-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: expected scalar type Half but found BFloat16

@chenchunhui97 @zhyncs Any suggestions?

@zjp-shadow zjp-shadow mentioned this pull request Feb 22, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants