Mixtral OutOfMemoryError with 2 GPUs

I'm trying to run Mixtral (Mixtral Hermes) with two 48GB GPUs but it seems that sglang server is not using my second GPU.

`CUDA_VISIBLE_DEVICES="0,1" python -m sglang.launch_server --model-path /workspace/model --port 30000`

errors out with

```
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Process Process-1:
Traceback (most recent call last):
router init state: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/router/manager.py", line 68, in start_router_process
    model_client = ModelRpcClient(server_args, port_args)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/router/model_rpc.py", line 448, in __init__
    self.model_server.exposed_init_model(0, server_args, port_args)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/router/model_rpc.py", line 54, in exposed_init_model
    self.model_runner = ModelRunner(
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/router/model_runner.py", line 229, in __init__
    self.load_model()
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/router/model_runner.py", line 272, in load_model
    model = model_class(
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 322, in __init__
    self.model = MixtralModel(config, linear_method)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 285, in __init__
    [
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 286, in <listcomp>
    MixtralDecoderLayer(config, i, linear_method=linear_method)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 238, in __init__
    self.block_sparse_moe = MixtralMoE(config=config, linear_method=linear_method)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 99, in __init__
    [
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 100, in <listcomp>
    MixtralMLP(
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 55, in __init__
    self.w2 = ReplicatedLinear(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 108, in __init__
    self.linear_weights = self.linear_method.create_weights(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 55, in create_weights
    weight = Parameter(torch.empty(output_size_per_partition,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 77, in __torch_function__
    return func(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 44.35 GiB of which 9.38 MiB is free. Process 2776126 has 44.33 GiB memory in use. Of the allocated memory 44.02 GiB is allocated by PyTorch, and 14.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

detoken init state: init ok
```


I had the same error on Modal with 2 x 80GB GPUs.

Thanks for the support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mixtral OutOfMemoryError with 2 GPUs #51

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mixtral OutOfMemoryError with 2 GPUs #51

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions