DeepSeek_v3 support by srajabos · Pull Request #1735 · huggingface/optimum-habana

srajabos · 2025-01-30T02:15:24Z

What does this PR do?

DeepSeek v3 support on OH

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

regisss · 2025-01-30T14:28:52Z

@srajabos FYI there is an open PR to add Deepseek V3 to Transformers: huggingface/transformers#35926

We won't be able to rely on the Transformers implementation before Transformers v4.49 is released, but I thought this might be interesting to you.

srajabos · 2025-01-30T14:35:30Z

@regiss, I'll keep this as draft until verified with Transformers v4.49.
Thanks for the update.

anishagartia · 2025-02-05T20:41:25Z

@srajabos FYI there is an open PR to add Deepseek V3 to Transformers: huggingface/transformers#35926

We won't be able to rely on the Transformers implementation before Transformers v4.49 is released, but I thought this might be interesting to you.

Deepseek V3 (and hence R1) requriements.txt says the minimum version of transformer required is 4.46.3
OH currently is uses 4.45.2 per requirements.txt
Could be a easier step to enable 4.46.3 on Gaudi than wait for 4.49. Then adding the model files to it could work.

srajabos · 2025-02-05T21:14:44Z

@anishagartia, currently we are adding the model files and optimizing for Gaudi. Once we have performant data the plan is to get it in. Thanks for the link.

srajabos · 2025-02-12T16:51:14Z

@yao-matrix @gyou2021 @IT-Forrest - kindly review the code.

ssarkar2

[explanatory] are just comments to help follow the hpu code. no changes required for those comments. sorry for spamming comments in this category, thought it might be useful for future readers going thru the change and for others looking ot port similar models

[clarifications] some question from my end. Sometimes these are marked with [minor] if they are minor nitpicks

ssarkar2 · 2025-02-12T20:07:56Z

+    from habana_frameworks.torch.hpex.kernels import FusedSDPA
+except ImportError:
+    print("Not using HPU fused scaled dot-product attention kernel.")
+    FusedSDPA = None


[explanatory] Import hpu fused ops

ssarkar2 · 2025-02-12T20:12:41Z

+
+    def forward(self, hidden_states):
+        if hidden_states.device.type == "hpu" and FusedRMSNorm:
+            # mixed dtypes are not good for FusedRMSNorm, both inputs need to have same dtype


[explanatory] use fused ops

ssarkar2 · 2025-02-12T20:16:07Z

+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+
+        # Build here to make `torch.jit.trace` work.
+        self.max_seq_len_cached = max_position_embeddings


[explanatory] make it static (max_position_embeddings ) instead of updating depending on longest eq_len seen till now: "seq_len > self.max_seq_len_cached"

ssarkar2 · 2025-02-12T20:17:26Z

+
+def apply_customized_rope(q, k, cos, sin, position_ids):
+    if q.device.type == "hpu" and FusedRoPE:
+        return FusedRoPE.apply(


[explanatory] fused hpu op

[clarification][minor] Could we call apply_customized_rope here?

ssarkar2 · 2025-02-12T20:38:08Z

+    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
+        return tensor.view(bsz, seq_len, self.num_heads, self.v_head_dim).transpose(1, 2).contiguous()
+
+    def split_kv_b_proj(self):


[clarification] this is present only in deepseek attention (v2/v3). Can we add some comment about this?

ssarkar2 · 2025-02-12T20:38:59Z

+        self.q_absorb = kv_b_proj_weight[:, : self.qk_nope_head_dim, :].unsqueeze(0).transpose(0, 1)
+        self.out_absorb = kv_b_proj_weight[:, self.qk_nope_head_dim :, :].unsqueeze(0)
+
+    def compress_kv(


[clarification] this is present only in deepseek attention (v2/v3). Can we add some comment about this? In the original deepseek code this is not a function, any particular reason of functionify-ing this? just want to clarify if making this a function is a stylistic choice or there is some reason

ssarkar2 · 2025-02-12T20:45:50Z

+                    key_states, value_states, self.layer_idx, cache_kwargs
+                )
+            # optimization
+            if use_flash_attention and FusedSDPA is not None:


[explanatory] hpu specific, similar to other modelling files in OH

ssarkar2 · 2025-02-12T20:48:53Z

+
+        past_key_values_length = 0
+        if past_key_values is not None:
+            past_key_values_length = past_key_values[0][0].shape[2]


[explanatory] hpu kv cache management, similar to other OH models

ssarkar2

LGTM

ssarkar2 · 2025-02-14T22:39:16Z

+# Maximum number of experts supported by dynamic MoE op (mixture_of_experts)
 SLICE_MAX_EXPERT = 80
+
+# import hpu fused ops


[minor] stray comment

ssarkar2 · 2025-02-14T22:40:06Z

        # Build here to make `torch.jit.trace` work.
+
+        # make it static (max_position_embeddings) instead of updating depending on
+        # longest eq_len seen till now: seq_len > self.max_seq_len_cached


[minor] eq_len -> seq_len

Thanks @ssarkar2 . Let me clean those up

Copied from transformers v4.48.2 for DeepSeek-R1 support. Delete after upgrade transformers v4.45.2 to v4.48

regisss

Nice! I left a couple of comments

HuggingFaceDocBuilderDev · 2025-02-21T08:31:55Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

regisss

LGTM!

srajabos requested review from bhargaveede, regisss, ssarkar2 and vivekgoe as code owners January 30, 2025 02:15

srajabos marked this pull request as draft January 30, 2025 05:39

skavulya force-pushed the DeepSeek_v3 branch from 17f62bd to b2b1715 Compare February 11, 2025 18:14

srajabos marked this pull request as ready for review February 12, 2025 16:15

ssarkar2 reviewed Feb 12, 2025

View reviewed changes

ssarkar2 approved these changes Feb 14, 2025

View reviewed changes

skavulya force-pushed the DeepSeek_v3 branch from e22cd28 to 1ded31d Compare February 15, 2025 16:45

srajabos and others added 15 commits February 18, 2025 11:48

Resolve rebase conflicts on DeepSeek_v3

aca9778

Update __init__.py

a8fdecd

Update __init__.py

3e7a00e

Update modeling_deepseek_v3.py

8629818

Support optimized KV cache, static MOE and expert parallelism

a161a36

Commented out attention_mask assertion for the mmlu tests

da80fba

Support optimized fusedDSPA, RoPE and RMS

34f8ff7

Commented out attention_mask assertionn

937deeb

Change references to deepseekv2 to deepseekv3

e32aadd

Override load_state_dict to support deepseek-R1

37a0431

Copied from transformers v4.48.2 for DeepSeek-R1 support. Delete after upgrade transformers v4.45.2 to v4.48

Added dynamic MoE changes

7b0b9cf

Fix multicard expert parallelism for deepseekv3

33e792c

Delete duplicate tests accidentally copied

71de78c

Refactor deepseek_v3 and add clarifying comments

1336c14

Fix edge case in expert slices in DeepSeek-V3

e02c6de

skavulya force-pushed the DeepSeek_v3 branch from 1ded31d to e02c6de Compare February 18, 2025 20:46

skavulya and others added 3 commits February 18, 2025 13:03

Add deepseekv3 to list of models supporting reuse_cache

96a3473

Style fix in modeling_utils

5e95f58

Updated the README.md for the deepseek-r1-bf16

9bb5601

libinta reviewed Feb 19, 2025

View reviewed changes

Comment thread examples/text-generation/README.md

deepvars approved these changes Feb 19, 2025

View reviewed changes

regisss reviewed Feb 19, 2025

View reviewed changes

Comment thread optimum/habana/transformers/modeling_utils.py Outdated

Comment thread tests/test_text_generation_example.py Outdated

pallavijaini0525 and others added 3 commits February 20, 2025 00:03

Updated the README.md with hostfile reference

c66d9d9

Move load_state_dict to modeling_utils_transformers

8a1a460

Removed the deepseek tests from CI, will enable when FP8 is supported

eb03fad

libinta added the run-test Run CI for PRs from external contributors label Feb 20, 2025

regisss approved these changes Feb 21, 2025

View reviewed changes

regisss merged commit c5a715c into huggingface:main Feb 21, 2025

Conversation

srajabos commented Jan 30, 2025

What does this PR do?

Before submitting

Uh oh!

regisss commented Jan 30, 2025

Uh oh!

srajabos commented Jan 30, 2025

Uh oh!

anishagartia commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srajabos commented Feb 5, 2025

Uh oh!

srajabos commented Feb 12, 2025

Uh oh!

ssarkar2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ssarkar2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Feb 21, 2025

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

anishagartia commented Feb 5, 2025 •

edited

Loading