-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Closed
Description
I'm trying to run Mixtral (Mixtral Hermes) with two 48GB GPUs but it seems that sglang server is not using my second GPU.
CUDA_VISIBLE_DEVICES="0,1" python -m sglang.launch_server --model-path /workspace/model --port 30000
errors out with
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Process Process-1:
Traceback (most recent call last):
router init state: Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/router/manager.py", line 68, in start_router_process
model_client = ModelRpcClient(server_args, port_args)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/router/model_rpc.py", line 448, in __init__
self.model_server.exposed_init_model(0, server_args, port_args)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/router/model_rpc.py", line 54, in exposed_init_model
self.model_runner = ModelRunner(
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/router/model_runner.py", line 229, in __init__
self.load_model()
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/router/model_runner.py", line 272, in load_model
model = model_class(
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 322, in __init__
self.model = MixtralModel(config, linear_method)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 285, in __init__
[
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 286, in <listcomp>
MixtralDecoderLayer(config, i, linear_method=linear_method)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 238, in __init__
self.block_sparse_moe = MixtralMoE(config=config, linear_method=linear_method)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 99, in __init__
[
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 100, in <listcomp>
MixtralMLP(
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 55, in __init__
self.w2 = ReplicatedLinear(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 108, in __init__
self.linear_weights = self.linear_method.create_weights(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 55, in create_weights
weight = Parameter(torch.empty(output_size_per_partition,
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 77, in __torch_function__
return func(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 44.35 GiB of which 9.38 MiB is free. Process 2776126 has 44.33 GiB memory in use. Of the allocated memory 44.02 GiB is allocated by PyTorch, and 14.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
detoken init state: init ok
I had the same error on Modal with 2 x 80GB GPUs.
Thanks for the support
hnyls2002 and Jaeha0526
Metadata
Metadata
Assignees
Labels
No labels