Support for DeepseekV3 680B #117

sorasoras · 2024-12-25T16:53:28Z

https://huggingface.co/deepseek-ai/DeepSeek-V3-Base
well, that's a beast.

fengyang95 · 2024-12-26T09:40:12Z

+1

Nottlespike · 2024-12-26T23:56:40Z

Does ktransformers depend on HF transformers support for a for a model arch? If so we are going to have to wait until DeepSeek-V3 supports HF transformers as it does not yet and I don't see a PR from the DeepSeek team yet.

TyraVex · 2024-12-27T00:31:39Z

The ktransformer backend is an old commit of llama.cpp iirc

Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Nottlespike · 2024-12-27T00:50:32Z

The ktransformer backend is an old commit of llama.cpp iirc

Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

sorasoras · 2024-12-27T07:19:40Z

The ktransformer backend is an old commit of llama.cpp iirc

Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

I heard that it s not hard to support V3 on llama cpp due to been resemblance to v2

Nottlespike · 2024-12-27T09:44:41Z

The ktransformer backend is an old commit of llama.cpp iirc
Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

I heard that it s not hard to support V3 on llama cpp due to been resemblance to v2

Sadly this is not the case in two ways. Firstly V3 from the technical report is far far more complex than V2. Second V2 never actually got a HF transformers implementation sadly only a stale draft PR.
This is an issue that shows the problem in action huggingface/transformers#34335
This is the stale attempt at V2 integration huggingface/transformers#31976

sorasoras · 2024-12-27T17:23:32Z

The ktransformer backend is an old commit of llama.cpp iirc
Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

I heard that it s not hard to support V3 on llama cpp due to been resemblance to v2

Sadly this is not the case in two ways. Firstly V3 from the technical report is far far more complex than V2. Second V2 never actually got a HF transformers implementation sadly only a stale draft PR. This is an issue that shows the problem in action huggingface/transformers#34335 This is the stale attempt at V2 integration huggingface/transformers#31976

well. kt do support DS2 so it should be just a upgrade with next gen moe router

Nottlespike · 2024-12-27T17:41:06Z

The ktransformer backend is an old commit of llama.cpp iirc
Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

I heard that it s not hard to support V3 on llama cpp due to been resemblance to v2

Sadly this is not the case in two ways. Firstly V3 from the technical report is far far more complex than V2. Second V2 never actually got a HF transformers implementation sadly only a stale draft PR. This is an issue that shows the problem in action huggingface/transformers#34335 This is the stale attempt at V2 integration huggingface/transformers#31976

well. kt do support DS2 so it should be just a upgrade with next gen moe router

What they do is use remote code. There is no native HF transformers implemenation. Feel free to find it if you think I am incorrect. If people wish to use this for enterprise deployment letting a model run what is functionally arbitrary code on your servers is 100% a no go like it is for my startup.

mahald · 2024-12-28T06:34:39Z

+1

lzumot · 2024-12-29T16:36:24Z

+1

Azure-Tang · 2024-12-30T08:58:00Z

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

Nottlespike · 2024-12-30T11:11:03Z

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

Does ktransformers not need transformers or llama.cpp support for a a model? huggingface/transformers#35425 Back of the napkin math is more like 5XXGB of VRAM/RAM would be needed given context.

Azure-Tang · 2024-12-31T07:44:29Z

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

Does ktransformers not need transformers or llama.cpp support for a a model? huggingface/transformers#35425 Back of the napkin math is more like 5XXGB of VRAM/RAM would be needed given context.

Yes, we need transformers‘ modeling.py.

16x3b · 2024-12-31T14:54:26Z

Just spitballing. Would it be possible to do some speculative decoding with a smaller model (dense or moe) and then on the larger moe just use nvme for some of the experts? Why do we need all experts loaded into ram at all times instead of selecting experts as necessary?

Again just throwing an idea out there from where my understanding is at. I'd like to understand better why this would not work.

sorasoras · 2025-01-01T11:49:12Z

Just spitballing. Would it be possible to do some speculative decoding with a smaller model (dense or moe) and then on the larger moe just use nvme for some of the experts? Why do we need all experts loaded into ram at all times instead of selecting experts as necessary?

Again just throwing an idea out there from where my understanding is at. I'd like to understand better why this would not work.

why would you need smaller model speculative decoding when you can do it via MTP？

Nottlespike · 2025-01-02T20:58:36Z

@Azure-Tang
We are basically done with integrating DeepSeek-V3 into llama.cpp via a very very elegant solution from @fairydreaming
ggerganov/llama.cpp#10981 (comment)
Can you see if you can try what they did?

Nottlespike · 2025-01-06T23:04:34Z

For those looking for an update on this I've to forked it and I think I know how to get this working but no ETA.

ELigoP · 2025-01-09T14:44:55Z

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

I manage to fit in Q3_K_M quant within 96GB VRAM + 256GB RAM. Q4_K_M is out of my reach.

hvico · 2025-01-09T14:52:37Z

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

I manage to fit in Q3_K_M quant within 96GB VRAM + 256GB RAM. Q4_K_M is out of my reach.

Wow. Great news. What would be the mininum specs to be able to run a Q4_K_M with 96 GB VRAM? 512 GB of RAM plus that VRAM would be enough? I have that much VRAM but I will upgrade my RAM for that. Thanks!

16x3b · 2025-01-12T10:24:53Z

why would you need smaller model speculative decoding when you can do it via MTP？

I don't see how MTP helps me. I'm suggesting speculative decoding because we can get faster inference from a smaller model and only refer to the larger model if confidence is low. No need to call the large model if the small one has a confident answer.

sorasoras · 2025-01-13T07:12:57Z

why would you need smaller model speculative decoding when you can do it via MTP？

I don't see how MTP helps me. I'm suggesting speculative decoding because we can get faster inference from a smaller model and only refer to the larger model if confidence is low. No need to call the large model if the small one has a confident answer.

MTP generate two token and you use second token as speculative decode. This is call self-speculative decode. no need for extra model

16x3b · 2025-01-14T13:21:32Z

MTP generate two token and you use second token as speculative decode. This is call self-speculative decode. no need for extra model

Understood. That's excellent! Thank you for explaining that to me. I see why there's no need to implement speculative decoding with a model that already has MTP implemented.

sorasoras · 2025-01-14T17:04:39Z

MTP generate two token and you use second token as speculative decode. This is call self-speculative decode. no need for extra model

Understood. That's excellent! Thank you for explaining that to me. I see why there's no need to implement speculative decoding with a model that already has MTP implemented.

if you read the paper, it mentioned you can get 90% hit rate with MTP 2-token and SD.
I guess we are going to have 4-token MTP in the future.

web-traveler mentioned this issue Jan 1, 2025

Feature Request: add DeepSeek-v3 support ggerganov/llama.cpp#10981

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for DeepseekV3 680B #117

Support for DeepseekV3 680B #117

sorasoras commented Dec 25, 2024

fengyang95 commented Dec 26, 2024

Nottlespike commented Dec 26, 2024

TyraVex commented Dec 27, 2024 •

edited

Loading

Nottlespike commented Dec 27, 2024

sorasoras commented Dec 27, 2024

Nottlespike commented Dec 27, 2024

sorasoras commented Dec 27, 2024

Nottlespike commented Dec 27, 2024

mahald commented Dec 28, 2024

lzumot commented Dec 29, 2024

Azure-Tang commented Dec 30, 2024 •

edited

Loading

Nottlespike commented Dec 30, 2024

Azure-Tang commented Dec 31, 2024

16x3b commented Dec 31, 2024 •

edited

Loading

sorasoras commented Jan 1, 2025

Nottlespike commented Jan 2, 2025 •

edited

Loading

Nottlespike commented Jan 6, 2025

ELigoP commented Jan 9, 2025

hvico commented Jan 9, 2025

16x3b commented Jan 12, 2025

sorasoras commented Jan 13, 2025

16x3b commented Jan 14, 2025

sorasoras commented Jan 14, 2025

Support for DeepseekV3 680B #117

Support for DeepseekV3 680B #117

Comments

sorasoras commented Dec 25, 2024

fengyang95 commented Dec 26, 2024

Nottlespike commented Dec 26, 2024

TyraVex commented Dec 27, 2024 • edited Loading

Nottlespike commented Dec 27, 2024

sorasoras commented Dec 27, 2024

Nottlespike commented Dec 27, 2024

sorasoras commented Dec 27, 2024

Nottlespike commented Dec 27, 2024

mahald commented Dec 28, 2024

lzumot commented Dec 29, 2024

Azure-Tang commented Dec 30, 2024 • edited Loading

Nottlespike commented Dec 30, 2024

Azure-Tang commented Dec 31, 2024

16x3b commented Dec 31, 2024 • edited Loading

sorasoras commented Jan 1, 2025

Nottlespike commented Jan 2, 2025 • edited Loading

Nottlespike commented Jan 6, 2025

ELigoP commented Jan 9, 2025

hvico commented Jan 9, 2025

16x3b commented Jan 12, 2025

sorasoras commented Jan 13, 2025

16x3b commented Jan 14, 2025

sorasoras commented Jan 14, 2025

TyraVex commented Dec 27, 2024 •

edited

Loading

Azure-Tang commented Dec 30, 2024 •

edited

Loading

16x3b commented Dec 31, 2024 •

edited

Loading

Nottlespike commented Jan 2, 2025 •

edited

Loading