[DSV3] Offload dequantization process to DCP QuantizedHFReader (pytorch#1804)

wwwjn · githubsgi · commit 2e8585c6dbd8 · 2025-10-15T16:14:37.000-07:00
## Benchmarking <meta charset="utf-8"><b style="font-weight:normal;" id="docs-internal-guid-852d634c-7fff-a3ae-72e8-d17e64bb4b2c"><div dir="ltr" style="margin-left:0pt;" align="center"> Step | time | log -- | -- | -- to_hf() | 0.1103s | [trainer0\|0]:[titan] 2025-10-03 17:07:45,697 - root - INFO - Completed to_hf conversion, generated 189 keys, duration: 0.1103s Split local GroupedExperts DTensor to individual experts’ weight | 0.008 s per layer per matrix (total 58 MoE Layers * 3 weight matrices per layer) | [trainer0\|0]:[titan] 2025-10-03 17:07:45,697 - root - INFO - Completed _get_local_experts_weights for layer 6, abstract_key: model.layers.{}.mlp.experts.{}.up_proj.weight, duration: 0.0082s dcp.load()Threads count=4 | 193.20s | [trainer0\|0]:[titan] 2025-10-03 17:10:58,899 - root - INFO - dcp.load with HuggingFaceStorageReader completed in 193.20 seconds from_hf() | 0.48s | [trainer0\|0]:[titan] 2025-10-03 17:10:59,378 - root - INFO - Completed from_hf conversion, processed 189 keys, duration: 0.4787s Concatenate individual experts weight into GroupedExperts weight | 0.01s per layer per matrix (total 58 MoE Layers * 3 weight matrices) | [trainer0\|0]:[titan] 2025-10-03 17:10:59,120 - root - INFO - Completed _concatenate_expert_weights_dtensor for layer 5, abstract_key: layers.{}.moe.experts.w2, duration: 0.0142s Total | 193.87s | [trainer0\|0]:[titan] 2025-10-03 17:10:59,458 - root - INFO - Finished loading the checkpoint in 193.87 seconds. </div></b> ## End-to-End verification for 671B model Parallelsim: FSDP=32, PP=8, 1F1B, EP=32 <img width="393" height="421" alt="Screenshot 2025-10-06 at 8 32 37 PM" src="https://github.com/user-attachments/assets/6d8dab00-a188-4c57-8348-02bae1d21d03" /> <img width="393" height="421" alt="Screenshot 2025-10-06 at 8 32 54 PM" src="https://github.com/user-attachments/assets/a730f71b-3dc8-45e0-8d3e-b21080884f8d" />
diff --git a/torchtitan/models/deepseek_v3/__init__.py b/torchtitan/models/deepseek_v3/__init__.py
@@ -132,7 +132,7 @@
         dim=7168,
         inter_dim=18432,
         moe_inter_dim=2048,
-        n_layers=61,
+        n_layers=4,
         n_dense_layers=3,
         n_heads=128,
         moe_args=MoEArgs(