Skip to content

Fail to run llm model on multiple GPUs in a single node. #4051

@harryzwh

Description

@harryzwh

System Info / 系統信息

Fail to launch Qwen3-Thinking-fp8-235b-fp8 by runing
xinference launch --model-name Qwen3-Thinking --model-type LLM --model-engine vLLM --model-format fp8 --size-in-billions 235 --quantization fp8 --n-gpu auto --replica 1 --n-worker 1 --reasoning_content false --gpu_memory_utilization 0.9 --max_model_len 32768 --tensor_parallel_size 4 --enable_chunked_prefill true --disable-virtual-env

The log is below

WARNING 09-10 05:06:49 [init.py:2662] We must use the spawn multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reason: CUDA is initialized

INFO 09-10 05:06:53 [init.py:244] Automatically detected platform cuda.

INFO 09-10 05:06:54 [core.py:526] Waiting for init message from front-end.

INFO 09-10 05:06:54 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='/root/.xinference/cache/v2/Qwen3-Thinking-fp8-235b-fp8', speculative_config=None, tokenizer='/root/.xinference/cache/v2/Qwen3-Thinking-fp8-235b-fp8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/root/.xinference/cache/v2/Qwen3-Thinking-fp8-235b-fp8, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}

2025-09-10 05:06:57,844 INFO worker.py:1951 -- Started a local Ray instance.

INFO 09-10 05:07:01 [ray_utils.py:334] No current placement group found. Creating a new placement group.

WARNING 09-10 05:07:01 [ray_utils.py:341] The number of required GPUs exceeds the total number of available GPUs in the placement group.

INFO 09-10 05:07:11 [ray_utils.py:232] Waiting for creating a placement group of specs for 10 seconds. specs=[{'GPU': 1.0, 'node:172.18.0.2': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check ray status and ray list nodes to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.

INFO 09-10 05:07:31 [ray_utils.py:232] Waiting for creating a placement group of specs for 30 seconds. specs=[{'GPU': 1.0, 'node:172.18.0.2': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check ray status and ray list nodes to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.

By running ray status with output below, it seems ray cluster is not able to init with 4 gpus. There are 8 gpus in the same node.

======== Autoscaler status: 2025-09-10 05:10:31.658230 ========
Node status

Active:
1 node_3a77e4b8141c8eb5cb20153bcb498e46d8a7370adb7dc9a2bd1319df
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Total Usage:
0.0/192.0 CPU
0.0/1.0 GPU
0B/1.78TiB memory
0B/186.26GiB object_store_memory

Total Constraints:
(no request_resources() constraints)
Total Demands:
{'node:172.18.0.2': 0.001, 'GPU': 1.0} * 1, {'GPU': 1.0} * 3 (PACK): 1+ pending placement groups

Also try to start a ray cluster with 4 gpus by simple running 'ray start --head --num-gpus=4' and then 'ray status', the output is below, which indicate 4 gpus are add to the cluster.

======== Autoscaler status: 2025-09-10 05:12:05.000345 ========
Node status

Active:
1 node_31f8b7a96a0e8c20d53025c2d0b967aaaa15654fa5ab83dce4aa4de6
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Total Usage:
0.0/192.0 CPU
0.0/4.0 GPU
0B/1.77TiB memory
0B/186.26GiB object_store_memory

Total Constraints:
(no request_resources() constraints)
Total Demands:
(no resource demands)

Also a small model that can fit in single gpu can be launched without any issue.

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • docker / docker
  • pip install / 通过 pip install 安装
  • installation from source / 从源码安装

Version info / 版本信息

Package Version


accelerate 1.10.1
aiofiles 24.1.0
aiohappyeyeballs 2.6.1
aiohttp 3.12.15
aioprometheus 23.12.0
aiosignal 1.4.0
airportsdata 20250811
aliyun-python-sdk-core 2.16.0
aliyun-python-sdk-kms 2.16.5
annotated-types 0.7.0
antlr4-python3-runtime 4.9.3
anyio 4.10.0
astor 0.8.1
asttokens 3.0.0
async-timeout 5.0.1
attrs 25.3.0
audioread 3.0.1
autoawq 0.2.9
av 15.1.0
bcrypt 4.3.0
beautifulsoup4 4.13.5
blake3 1.0.5
Brotli 1.1.0
cachetools 6.2.0
certifi 2025.8.3
cffi 1.17.1
charset-normalizer 3.4.3
click 8.2.1
cloudpickle 3.1.1
coloredlogs 15.0.1
compressed-tensors 0.10.2
conformer 0.3.2
crcmod 1.7
cryptography 45.0.7
cuda-bindings 12.8.0
cuda-python 12.8.0
cupy-cuda12x 13.6.0
datasets 4.0.0
decorator 5.2.1
decord 0.6.0
depyf 0.18.0
diffusers 0.35.1
dill 0.3.8
diskcache 5.6.3
distro 1.9.0
dnspython 2.8.0
ecdsa 0.19.1
editdistance 0.8.1
einops 0.8.1
email-validator 2.3.0
executing 2.2.1
fastapi 0.116.1
fastapi-cli 0.0.10
fastapi-cloud-cli 0.1.5
fastrlock 0.8.3
ffmpy 0.6.1
filelock 3.19.1
flash_attn 2.7.4
flashinfer-python 0.3.1
flatbuffers 25.2.10
frozenlist 1.7.0
fsspec 2025.3.0
funasr 1.2.7
gdown 5.2.0
gguf 0.17.1
gptqmodel 4.1.0
gradio 5.44.1
gradio_client 1.12.1
groovy 0.1.2
h11 0.16.0
hf-xet 1.1.9
httpcore 1.0.9
httptools 0.6.4
httpx 0.28.1
huggingface-hub 0.34.4
humanfriendly 10.0
hydra-core 1.3.2
HyperPyYAML 1.2.2
idna 3.10
importlib_metadata 8.7.0
importlib_resources 6.5.2
inflect 7.5.0
interegular 0.3.3
ipython 9.5.0
ipython_pygments_lexers 1.1.1
jaconv 0.4.0
jamo 0.4.1
jedi 0.19.2
jieba 0.42.1
Jinja2 3.1.6
jiter 0.10.0
jmespath 0.10.0
joblib 1.5.2
jsonschema 4.25.1
jsonschema-specifications 2025.9.1
kaldifst 1.7.17
kaldiio 2.18.1
lark 1.2.2
lazy_loader 0.4
librosa 0.11.0
lightning 2.5.5
lightning-utilities 0.15.2
llguidance 0.7.30
llvmlite 0.44.0
lm-format-enforcer 0.10.12
markdown-it-py 4.0.0
MarkupSafe 3.0.2
matplotlib-inline 0.1.7
mdurl 0.1.2
mistral_common 1.8.4
modelscope 1.29.2
more-itertools 10.8.0
mpmath 1.3.0
msgpack 1.1.1
msgspec 0.19.0
multidict 6.6.4
multiprocess 0.70.16
nest-asyncio 1.6.0
networkx 3.5
ninja 1.13.0
numba 0.61.2
numpy 2.2.6
nvidia-cublas-cu12 12.8.3.14
nvidia-cuda-cupti-cu12 12.8.57
nvidia-cuda-nvrtc-cu12 12.8.61
nvidia-cuda-runtime-cu12 12.8.57
nvidia-cudnn-cu12 9.7.1.26
nvidia-cudnn-frontend 1.14.1
nvidia-cufft-cu12 11.3.3.41
nvidia-cufile-cu12 1.13.0.11
nvidia-curand-cu12 10.3.9.55
nvidia-cusolver-cu12 11.7.2.55
nvidia-cusparse-cu12 12.5.7.53
nvidia-cusparselt-cu12 0.6.3
nvidia-ml-py 13.580.65
nvidia-nccl-cu12 2.26.2
nvidia-nvjitlink-cu12 12.8.61
nvidia-nvshmem-cu12 3.4.5
nvidia-nvtx-cu12 12.8.55
omegaconf 2.3.0
onnxruntime-gpu 1.22.0
openai 1.90.0
opencv-python-headless 4.12.0.88
optimum 1.27.0
orjson 3.11.3
oss2 2.19.1
outlines 0.1.11
outlines_core 0.1.26
packaging 25.0
pandas 2.3.2
parso 0.8.5
partial-json-parser 0.2.1.1.post6
passlib 1.7.4
peft 0.17.1
pexpect 4.9.0
pillow 11.3.0
pip 25.2
platformdirs 4.4.0
pooch 1.8.2
prometheus_client 0.22.1
prometheus-fastapi-instrumentator 7.1.0
prompt_toolkit 3.0.52
propcache 0.3.2
protobuf 6.32.0
psutil 7.0.0
ptyprocess 0.7.0
pure_eval 0.2.3
py-cpuinfo 9.0.0
pyarrow 21.0.0
pyasn1 0.6.1
pybase64 1.4.2
pycountry 24.6.1
pycparser 2.22
pycryptodome 3.23.0
pydantic 2.11.7
pydantic_core 2.33.2
pydantic-extra-types 2.10.5
pydub 0.25.1
Pygments 2.19.2
pynini 2.1.6
pynndescent 0.5.13
PySocks 1.7.1
python-dateutil 2.9.0.post0
python-dotenv 1.1.1
python-jose 3.5.0
python-json-logger 3.3.0
python-multipart 0.0.20
pytorch-lightning 2.5.5
pytorch-wpe 0.0.1
pytz 2025.2
pyworld 0.3.5
PyYAML 6.0.2
pyzmq 27.0.2
quantile-python 1.1
qwen-vl-utils 0.0.11
ray 2.49.1
referencing 0.36.2
regex 2025.9.1
requests 2.32.5
rich 14.1.0
rich-toolkit 0.15.1
rignore 0.6.4
rpds-py 0.27.1
rsa 4.9.1
ruamel.yaml 0.18.15
ruamel.yaml.clib 0.2.12
ruff 0.12.12
safehttpx 0.1.6
safetensors 0.6.2
scikit-learn 1.7.1
scipy 1.16.1
semantic-version 2.10.0
sentence-transformers 5.1.0
sentencepiece 0.2.1
sentry-sdk 2.37.0
setproctitle 1.3.7
setuptools 79.0.1
sgl-kernel 0.3.8
sglang 0.5.1.post3
shellingham 1.5.4
six 1.17.0
sniffio 1.3.1
soundfile 0.13.1
soupsieve 2.8
soxr 1.0.0
sse-starlette 3.0.2
stack-data 0.6.3
starlette 0.47.3
sympy 1.14.0
tabulate 0.9.0
tblib 3.1.0
tensorboardX 2.6.4
threadpoolctl 3.6.0
tiktoken 0.11.0
tokenizers 0.21.4
tomlkit 0.13.3
torch 2.7.0+cu128
torch-complex 0.4.4
torchao 0.13.0
torchaudio 2.7.0+cu128
torchmetrics 1.8.2
torchvision 0.22.0+cu128
tqdm 4.67.1
traitlets 5.14.3
transformers 4.53.3
triton 3.3.0
typeguard 4.4.4
typer 0.17.4
typing_extensions 4.15.0
typing-inspection 0.4.1
tzdata 2025.2
umap-learn 0.5.9.post2
urllib3 2.5.0
uv 0.8.15
uvicorn 0.35.0
uvloop 0.21.0
vllm 0.9.2
watchfiles 1.1.0
wcwidth 0.2.13
websockets 15.0.1
wetext 0.1.0
WeTextProcessing 1.0.4.1
xformers 0.0.30
xgrammar 0.1.19
xinference 1.9.1
xllamacpp 0.2.0
xoscar 0.7.16
xxhash 3.5.0
yarl 1.20.1
zipp 3.23.0
zstandard 0.24.0

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local -H 0.0.0.0 to start xinference and then
xinference launch --model-name Qwen3-Thinking --model-type LLM --model-engine vLLM --model-format fp8 --size-in-billions 235 --quantization fp8 --n-gpu auto --replica 1 --n-worker 1 --reasoning_content false --gpu_memory_utilization 0.9 --max_model_len 32768 --tensor_parallel_size 4 --enable_chunked_prefill true --disable-virtual-env
to launch model.

Reproduction / 复现过程

As above

Expected behavior / 期待表现

Able to launch LLM on multiple GPUs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions