-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for DeepseekV3 680B #117
Comments
+1 |
Does ktransformers depend on HF transformers support for a for a model arch? If so we are going to have to wait until DeepSeek-V3 supports HF transformers as it does not yet and I don't see a PR from the DeepSeek team yet. |
The ktransformer backend is an old commit of llama.cpp iirc Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f |
Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation.... |
I heard that it s not hard to support V3 on llama cpp due to been resemblance to v2 |
Sadly this is not the case in two ways. Firstly V3 from the technical report is far far more complex than V2. Second V2 never actually got a HF transformers implementation sadly only a stale draft PR. |
well. kt do support DS2 so it should be just a upgrade with next gen moe router |
What they do is use remote code. There is no native HF transformers implemenation. Feel free to find it if you think I am incorrect. If people wish to use this for enterprise deployment letting a model run what is functionally arbitrary code on your servers is 100% a no go like it is for my startup. |
+1 |
1 similar comment
+1 |
Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km? |
Does ktransformers not need transformers or llama.cpp support for a a model? huggingface/transformers#35425 Back of the napkin math is more like 5XXGB of VRAM/RAM would be needed given context. |
Yes, we need transformers‘ modeling.py. |
Just spitballing. Would it be possible to do some speculative decoding with a smaller model (dense or moe) and then on the larger moe just use nvme for some of the experts? Why do we need all experts loaded into ram at all times instead of selecting experts as necessary? Again just throwing an idea out there from where my understanding is at. I'd like to understand better why this would not work. |
why would you need smaller model speculative decoding when you can do it via MTP? |
@Azure-Tang |
For those looking for an update on this I've to forked it and I think I know how to get this working but no ETA. |
I manage to fit in Q3_K_M quant within 96GB VRAM + 256GB RAM. Q4_K_M is out of my reach. |
Wow. Great news. What would be the mininum specs to be able to run a Q4_K_M with 96 GB VRAM? 512 GB of RAM plus that VRAM would be enough? I have that much VRAM but I will upgrade my RAM for that. Thanks! |
I don't see how MTP helps me. I'm suggesting speculative decoding because we can get faster inference from a smaller model and only refer to the larger model if confidence is low. No need to call the large model if the small one has a confident answer. |
MTP generate two token and you use second token as speculative decode. This is call self-speculative decode. no need for extra model |
Understood. That's excellent! Thank you for explaining that to me. I see why there's no need to implement speculative decoding with a model that already has MTP implemented. |
if you read the paper, it mentioned you can get 90% hit rate with MTP 2-token and SD. |
https://huggingface.co/deepseek-ai/DeepSeek-V3-Base
well, that's a beast.
The text was updated successfully, but these errors were encountered: