Skip to content

Mixtral OutOfMemoryError with 2 GPUs #51

@thomasgauthier

Description

@thomasgauthier

I'm trying to run Mixtral (Mixtral Hermes) with two 48GB GPUs but it seems that sglang server is not using my second GPU.

CUDA_VISIBLE_DEVICES="0,1" python -m sglang.launch_server --model-path /workspace/model --port 30000

errors out with

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Process Process-1:
Traceback (most recent call last):
router init state: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/router/manager.py", line 68, in start_router_process
    model_client = ModelRpcClient(server_args, port_args)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/router/model_rpc.py", line 448, in __init__
    self.model_server.exposed_init_model(0, server_args, port_args)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/router/model_rpc.py", line 54, in exposed_init_model
    self.model_runner = ModelRunner(
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/router/model_runner.py", line 229, in __init__
    self.load_model()
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/router/model_runner.py", line 272, in load_model
    model = model_class(
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 322, in __init__
    self.model = MixtralModel(config, linear_method)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 285, in __init__
    [
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 286, in <listcomp>
    MixtralDecoderLayer(config, i, linear_method=linear_method)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 238, in __init__
    self.block_sparse_moe = MixtralMoE(config=config, linear_method=linear_method)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 99, in __init__
    [
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 100, in <listcomp>
    MixtralMLP(
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/mixtral.py", line 55, in __init__
    self.w2 = ReplicatedLinear(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 108, in __init__
    self.linear_weights = self.linear_method.create_weights(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 55, in create_weights
    weight = Parameter(torch.empty(output_size_per_partition,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 77, in __torch_function__
    return func(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 44.35 GiB of which 9.38 MiB is free. Process 2776126 has 44.33 GiB memory in use. Of the allocated memory 44.02 GiB is allocated by PyTorch, and 14.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

detoken init state: init ok

I had the same error on Modal with 2 x 80GB GPUs.

Thanks for the support

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions