Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for DeepseekV3 680B #117

Open
sorasoras opened this issue Dec 25, 2024 · 23 comments
Open

Support for DeepseekV3 680B #117

sorasoras opened this issue Dec 25, 2024 · 23 comments

Comments

@sorasoras
Copy link

https://huggingface.co/deepseek-ai/DeepSeek-V3-Base
well, that's a beast.

@fengyang95
Copy link

+1

@Nottlespike
Copy link

Does ktransformers depend on HF transformers support for a for a model arch? If so we are going to have to wait until DeepSeek-V3 supports HF transformers as it does not yet and I don't see a PR from the DeepSeek team yet.

@TyraVex
Copy link

TyraVex commented Dec 27, 2024

The ktransformer backend is an old commit of llama.cpp iirc

Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

@Nottlespike
Copy link

The ktransformer backend is an old commit of llama.cpp iirc

Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

@sorasoras
Copy link
Author

The ktransformer backend is an old commit of llama.cpp iirc

Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

I heard that it s not hard to support V3 on llama cpp due to been resemblance to v2

@Nottlespike
Copy link

The ktransformer backend is an old commit of llama.cpp iirc
Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

I heard that it s not hard to support V3 on llama cpp due to been resemblance to v2

Sadly this is not the case in two ways. Firstly V3 from the technical report is far far more complex than V2. Second V2 never actually got a HF transformers implementation sadly only a stale draft PR.
This is an issue that shows the problem in action huggingface/transformers#34335
This is the stale attempt at V2 integration huggingface/transformers#31976

@sorasoras
Copy link
Author

The ktransformer backend is an old commit of llama.cpp iirc
Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

I heard that it s not hard to support V3 on llama cpp due to been resemblance to v2

Sadly this is not the case in two ways. Firstly V3 from the technical report is far far more complex than V2. Second V2 never actually got a HF transformers implementation sadly only a stale draft PR. This is an issue that shows the problem in action huggingface/transformers#34335 This is the stale attempt at V2 integration huggingface/transformers#31976

well. kt do support DS2 so it should be just a upgrade with next gen moe router

@Nottlespike
Copy link

The ktransformer backend is an old commit of llama.cpp iirc
Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

I heard that it s not hard to support V3 on llama cpp due to been resemblance to v2

Sadly this is not the case in two ways. Firstly V3 from the technical report is far far more complex than V2. Second V2 never actually got a HF transformers implementation sadly only a stale draft PR. This is an issue that shows the problem in action huggingface/transformers#34335 This is the stale attempt at V2 integration huggingface/transformers#31976

well. kt do support DS2 so it should be just a upgrade with next gen moe router

What they do is use remote code. There is no native HF transformers implemenation. Feel free to find it if you think I am incorrect. If people wish to use this for enterprise deployment letting a model run what is functionally arbitrary code on your servers is 100% a no go like it is for my startup.

@mahald
Copy link

mahald commented Dec 28, 2024

+1

1 similar comment
@lzumot
Copy link

lzumot commented Dec 29, 2024

+1

@Azure-Tang
Copy link
Contributor

Azure-Tang commented Dec 30, 2024

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

@Nottlespike
Copy link

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

Does ktransformers not need transformers or llama.cpp support for a a model? huggingface/transformers#35425 Back of the napkin math is more like 5XXGB of VRAM/RAM would be needed given context.

@Azure-Tang
Copy link
Contributor

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

Does ktransformers not need transformers or llama.cpp support for a a model? huggingface/transformers#35425 Back of the napkin math is more like 5XXGB of VRAM/RAM would be needed given context.

Yes, we need transformers‘ modeling.py.

@16x3b
Copy link

16x3b commented Dec 31, 2024

Just spitballing. Would it be possible to do some speculative decoding with a smaller model (dense or moe) and then on the larger moe just use nvme for some of the experts? Why do we need all experts loaded into ram at all times instead of selecting experts as necessary?

Again just throwing an idea out there from where my understanding is at. I'd like to understand better why this would not work.

@sorasoras
Copy link
Author

Just spitballing. Would it be possible to do some speculative decoding with a smaller model (dense or moe) and then on the larger moe just use nvme for some of the experts? Why do we need all experts loaded into ram at all times instead of selecting experts as necessary?

Again just throwing an idea out there from where my understanding is at. I'd like to understand better why this would not work.

why would you need smaller model speculative decoding when you can do it via MTP?

@Nottlespike
Copy link

Nottlespike commented Jan 2, 2025

@Azure-Tang
We are basically done with integrating DeepSeek-V3 into llama.cpp via a very very elegant solution from @fairydreaming
ggerganov/llama.cpp#10981 (comment)
Can you see if you can try what they did?

@Nottlespike
Copy link

For those looking for an update on this I've to forked it and I think I know how to get this working but no ETA.

@ELigoP
Copy link

ELigoP commented Jan 9, 2025

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

I manage to fit in Q3_K_M quant within 96GB VRAM + 256GB RAM. Q4_K_M is out of my reach.

@hvico
Copy link

hvico commented Jan 9, 2025

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

I manage to fit in Q3_K_M quant within 96GB VRAM + 256GB RAM. Q4_K_M is out of my reach.

Wow. Great news. What would be the mininum specs to be able to run a Q4_K_M with 96 GB VRAM? 512 GB of RAM plus that VRAM would be enough? I have that much VRAM but I will upgrade my RAM for that. Thanks!

@16x3b
Copy link

16x3b commented Jan 12, 2025

why would you need smaller model speculative decoding when you can do it via MTP?

I don't see how MTP helps me. I'm suggesting speculative decoding because we can get faster inference from a smaller model and only refer to the larger model if confidence is low. No need to call the large model if the small one has a confident answer.

@sorasoras
Copy link
Author

why would you need smaller model speculative decoding when you can do it via MTP?

I don't see how MTP helps me. I'm suggesting speculative decoding because we can get faster inference from a smaller model and only refer to the larger model if confidence is low. No need to call the large model if the small one has a confident answer.

MTP generate two token and you use second token as speculative decode. This is call self-speculative decode. no need for extra model

@16x3b
Copy link

16x3b commented Jan 14, 2025

MTP generate two token and you use second token as speculative decode. This is call self-speculative decode. no need for extra model

Understood. That's excellent! Thank you for explaining that to me. I see why there's no need to implement speculative decoding with a model that already has MTP implemented.

@sorasoras
Copy link
Author

MTP generate two token and you use second token as speculative decode. This is call self-speculative decode. no need for extra model

Understood. That's excellent! Thank you for explaining that to me. I see why there's no need to implement speculative decoding with a model that already has MTP implemented.

if you read the paper, it mentioned you can get 90% hit rate with MTP 2-token and SD.
I guess we are going to have 4-token MTP in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants