Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] deepseek v3 deployment on h200 #3049

Open
3 tasks
zhyncs opened this issue Jan 18, 2025 · 10 comments
Open
3 tasks

[Bug] deepseek v3 deployment on h200 #3049

zhyncs opened this issue Jan 18, 2025 · 10 comments
Assignees

Comments

@zhyncs
Copy link
Collaborator

zhyncs commented Jan 18, 2025

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

pip3 install lmdeploy==0.7.0
lmdeploy serve api_server deepseek-ai/DeepSeek-V3 --tp 8 --backend pytorch
LICENSE-MODEL: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 13.8k/13.8k [00:00<00:00, 48.1MB/s]
.gitattributes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1.52k/1.52k [00:00<00:00, 4.03MB/s]
README_WEIGHTS.md: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 3.65k/3.65k [00:00<00:00, 16.8MB/s]
LICENSE-CODE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.06k/1.06k [00:00<00:00, 4.70MB/s]
inference/configs/config_16B.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 417/417 [00:00<00:00, 1.70MB/s]
README.md: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 22.6k/22.6k [00:00<00:00, 12.8MB/s]
figures/benchmark.png: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 184k/184k [00:00<00:00, 1.62MB/s]
figures/niah.png: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 108k/108k [00:00<00:00, 1.32MB/s]
inference/configs/config_236B.json: 100%|██████████████████████████████████████████████████████████████████████████████████████| 455/455 [00:00<00:00, 2.02MB/s]
inference/configs/config_671B.json: 100%|██████████████████████████████████████████████████████████████████████████████████████| 503/503 [00:00<00:00, 2.26MB/s]
inference/convert.py: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 3.25k/3.25k [00:00<00:00, 14.4MB/s]
inference/fp8_cast_bf16.py: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 3.24k/3.24k [00:00<00:00, 14.3MB/s]
inference/generate.py: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 5.39k/5.39k [00:00<00:00, 21.6MB/s]
inference/kernel.py: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.33k/4.33k [00:00<00:00, 18.2MB/s]
inference/requirements.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 66.0/66.0 [00:00<00:00, 293kB/s]
inference/model.py: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 17.6k/17.6k [00:00<00:00, 50.4MB/s]
modeling_deepseek.py: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 75.8k/75.8k [00:00<00:00, 684kB/s]
Fetching 185 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 185/185 [00:01<00:00, 184.46it/s]

The phenomenon is stuck here.

Reproduction

as mentioned above

Environment

sys.platform: linux
Python: 3.10.16 (main, Dec  4 2024, 08:53:37) [GCC 9.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA H200
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.5.1+cu124
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.5.3 (Git Hash 66f0cb9eb66affd2da3bf5f8d897376f04aae6af)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.4
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 90.1
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.4, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.5.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.20.1+cu124
LMDeploy: 0.7.0+c2f212d
transformers: 4.48.0
gradio: Not Found
fastapi: 0.115.6
pydantic: 2.10.5
triton: 3.1.0
NVIDIA Topology:
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	NIC6	NIC7	NIC8	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	NODE	NODE	PIX	NODE	SYS	SYS	SYS	SYS	SYS	0-95	0		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NODE	NODE	NODE	PIX	SYS	SYS	SYS	SYS	SYS	0-95	0		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NODE	PIX	NODE	NODE	SYS	SYS	SYS	SYS	SYS	0-95	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	PIX	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	0-95	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	SYS	SYS	SYS	SYS	NODE	NODE	PIX	NODE	NODE	96-191	1		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	SYS	SYS	SYS	SYS	NODE	NODE	NODE	PIX	NODE	96-191	1		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	SYS	SYS	SYS	SYS	NODE	PIX	NODE	NODE	PHB	96-191	1		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	SYS	SYS	SYS	SYS	PIX	NODE	NODE	NODE	NODE	96-191	1		N/A
NIC0	NODE	NODE	NODE	PIX	SYS	SYS	SYS	SYS	 X 	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS
NIC1	NODE	NODE	PIX	NODE	SYS	SYS	SYS	SYS	NODE	 X 	NODE	NODE	SYS	SYS	SYS	SYS	SYS
NIC2	PIX	NODE	NODE	NODE	SYS	SYS	SYS	SYS	NODE	NODE	 X 	NODE	SYS	SYS	SYS	SYS	SYS
NIC3	NODE	PIX	NODE	NODE	SYS	SYS	SYS	SYS	NODE	NODE	NODE	 X 	SYS	SYS	SYS	SYS	SYS
NIC4	SYS	SYS	SYS	SYS	NODE	NODE	NODE	PIX	SYS	SYS	SYS	SYS	 X 	NODE	NODE	NODE	NODE
NIC5	SYS	SYS	SYS	SYS	NODE	NODE	PIX	NODE	SYS	SYS	SYS	SYS	NODE	 X 	NODE	NODE	PHB
NIC6	SYS	SYS	SYS	SYS	PIX	NODE	NODE	NODE	SYS	SYS	SYS	SYS	NODE	NODE	 X 	NODE	NODE
NIC7	SYS	SYS	SYS	SYS	NODE	PIX	NODE	NODE	SYS	SYS	SYS	SYS	NODE	NODE	NODE	 X 	NODE
NIC8	SYS	SYS	SYS	SYS	NODE	NODE	PHB	NODE	SYS	SYS	SYS	SYS	NODE	PHB	NODE	NODE	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_8
  NIC7: mlx5_9
  NIC8: mlx5_bond_0

Error traceback

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 18, 2025

lmdeploy serve api_server deepseek-ai/DeepSeek-V3 --tp 8 --backend pytorch --log-level INFO

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 18, 2025

Image

15 minutes have passed and it still shows loading.

@grimoire
Copy link
Collaborator

Errr, it is just Desperately SLOW. We have not performed much optimization on weight loading.

@grimoire grimoire linked a pull request Jan 20, 2025 that will close this issue
@RunningLeon
Copy link
Collaborator

@zhyncs hi, can you try on this PR #2886 ?

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 20, 2025

@zhyncs hi, can you try on this PR #2886 ?

I'll try it today. Thanks!

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 21, 2025

I did a test, and it still takes over 20 minutes to finish loading. Is this within expectations?

@RunningLeon
Copy link
Collaborator

I did a test, and it still takes over 20 minutes to finish loading. Is this within expectations?

@zhyncs hi, we don't have deepseekv3 so it tested on deepseekv2-chat with tp=8 and loading timing can reduce from 15min to 8 min.

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 22, 2025

Hi @RunningLeon, I've granted @grimoire access to the H200. Could you @grimoire please help verify? Thanks!

@grimoire
Copy link
Collaborator

Is the model placed on NFS? The uncached first-time load could be slow.

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 27, 2025

Is the model placed on NFS? The uncached first-time load could be slow.

@grimoire I think it's placed on SSD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants