Fail to run llm model on multiple GPUs in a single node.

### System Info / 系統信息

Fail to launch Qwen3-Thinking-fp8-235b-fp8 by runing 
`xinference launch --model-name Qwen3-Thinking --model-type LLM --model-engine vLLM --model-format fp8 --size-in-billions 235 --quantization fp8 --n-gpu auto --replica 1 --n-worker 1 --reasoning_content false --gpu_memory_utilization 0.9 --max_model_len 32768 --tensor_parallel_size 4 --enable_chunked_prefill true --disable-virtual-env`

The log is below
> WARNING 09-10 05:06:49 [__init__.py:2662] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reason: CUDA is initialized
> 
> INFO 09-10 05:06:53 [__init__.py:244] Automatically detected platform cuda.
> 
> INFO 09-10 05:06:54 [core.py:526] Waiting for init message from front-end.
> 
> INFO 09-10 05:06:54 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='/root/.xinference/cache/v2/Qwen3-Thinking-fp8-235b-fp8', speculative_config=None, tokenizer='/root/.xinference/cache/v2/Qwen3-Thinking-fp8-235b-fp8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/root/.xinference/cache/v2/Qwen3-Thinking-fp8-235b-fp8, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
> 
> 2025-09-10 05:06:57,844	INFO worker.py:1951 -- Started a local Ray instance.
> 
> INFO 09-10 05:07:01 [ray_utils.py:334] No current placement group found. Creating a new placement group.
> 
> WARNING 09-10 05:07:01 [ray_utils.py:341] The number of required GPUs exceeds the total number of available GPUs in the placement group.
> 
> INFO 09-10 05:07:11 [ray_utils.py:232] Waiting for creating a placement group of specs for 10 seconds. specs=[{'GPU': 1.0, 'node:172.18.0.2': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` and `ray list nodes` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
> 
> INFO 09-10 05:07:31 [ray_utils.py:232] Waiting for creating a placement group of specs for 30 seconds. specs=[{'GPU': 1.0, 'node:172.18.0.2': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` and `ray list nodes` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.

By running `ray status` with output below, it seems ray cluster is not able to init with 4 gpus. There are 8 gpus in the same node.

> ======== Autoscaler status: 2025-09-10 05:10:31.658230 ========
> Node status
> ---------------------------------------------------------------
> Active:
>  1 node_3a77e4b8141c8eb5cb20153bcb498e46d8a7370adb7dc9a2bd1319df
> Pending:
>  (no pending nodes)
> Recent failures:
>  (no failures)
> 
> Resources
> ---------------------------------------------------------------
> Total Usage:
>  0.0/192.0 CPU
>  0.0/1.0 GPU
>  0B/1.78TiB memory
>  0B/186.26GiB object_store_memory
> 
> Total Constraints:
>  (no request_resources() constraints)
> Total Demands:
>  {'node:172.18.0.2': 0.001, 'GPU': 1.0} * 1, {'GPU': 1.0} * 3 (PACK): 1+ pending placement groups

Also try to start a ray cluster with 4 gpus by simple running 'ray start --head --num-gpus=4' and then 'ray status', the output is below, which indicate 4 gpus are add to the cluster.

> ======== Autoscaler status: 2025-09-10 05:12:05.000345 ========
> Node status
> ---------------------------------------------------------------
> Active:
>  1 node_31f8b7a96a0e8c20d53025c2d0b967aaaa15654fa5ab83dce4aa4de6
> Pending:
>  (no pending nodes)
> Recent failures:
>  (no failures)
> 
> Resources
> ---------------------------------------------------------------
> Total Usage:
>  0.0/192.0 CPU
>  0.0/4.0 GPU
>  0B/1.77TiB memory
>  0B/186.26GiB object_store_memory
> 
> Total Constraints:
>  (no request_resources() constraints)
> Total Demands:
>  (no resource demands)

Also a small model that can fit in single gpu can be launched without any issue.

### Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

- [x] docker / docker
- [ ] pip install / 通过 pip install 安装
- [ ] installation from source / 从源码安装

### Version info / 版本信息

Package                           Version
--------------------------------- -------------
accelerate                        1.10.1
aiofiles                          24.1.0
aiohappyeyeballs                  2.6.1
aiohttp                           3.12.15
aioprometheus                     23.12.0
aiosignal                         1.4.0
airportsdata                      20250811
aliyun-python-sdk-core            2.16.0
aliyun-python-sdk-kms             2.16.5
annotated-types                   0.7.0
antlr4-python3-runtime            4.9.3
anyio                             4.10.0
astor                             0.8.1
asttokens                         3.0.0
async-timeout                     5.0.1
attrs                             25.3.0
audioread                         3.0.1
autoawq                           0.2.9
av                                15.1.0
bcrypt                            4.3.0
beautifulsoup4                    4.13.5
blake3                            1.0.5
Brotli                            1.1.0
cachetools                        6.2.0
certifi                           2025.8.3
cffi                              1.17.1
charset-normalizer                3.4.3
click                             8.2.1
cloudpickle                       3.1.1
coloredlogs                       15.0.1
compressed-tensors                0.10.2
conformer                         0.3.2
crcmod                            1.7
cryptography                      45.0.7
cuda-bindings                     12.8.0
cuda-python                       12.8.0
cupy-cuda12x                      13.6.0
datasets                          4.0.0
decorator                         5.2.1
decord                            0.6.0
depyf                             0.18.0
diffusers                         0.35.1
dill                              0.3.8
diskcache                         5.6.3
distro                            1.9.0
dnspython                         2.8.0
ecdsa                             0.19.1
editdistance                      0.8.1
einops                            0.8.1
email-validator                   2.3.0
executing                         2.2.1
fastapi                           0.116.1
fastapi-cli                       0.0.10
fastapi-cloud-cli                 0.1.5
fastrlock                         0.8.3
ffmpy                             0.6.1
filelock                          3.19.1
flash_attn                        2.7.4
flashinfer-python                 0.3.1
flatbuffers                       25.2.10
frozenlist                        1.7.0
fsspec                            2025.3.0
funasr                            1.2.7
gdown                             5.2.0
gguf                              0.17.1
gptqmodel                         4.1.0
gradio                            5.44.1
gradio_client                     1.12.1
groovy                            0.1.2
h11                               0.16.0
hf-xet                            1.1.9
httpcore                          1.0.9
httptools                         0.6.4
httpx                             0.28.1
huggingface-hub                   0.34.4
humanfriendly                     10.0
hydra-core                        1.3.2
HyperPyYAML                       1.2.2
idna                              3.10
importlib_metadata                8.7.0
importlib_resources               6.5.2
inflect                           7.5.0
interegular                       0.3.3
ipython                           9.5.0
ipython_pygments_lexers           1.1.1
jaconv                            0.4.0
jamo                              0.4.1
jedi                              0.19.2
jieba                             0.42.1
Jinja2                            3.1.6
jiter                             0.10.0
jmespath                          0.10.0
joblib                            1.5.2
jsonschema                        4.25.1
jsonschema-specifications         2025.9.1
kaldifst                          1.7.17
kaldiio                           2.18.1
lark                              1.2.2
lazy_loader                       0.4
librosa                           0.11.0
lightning                         2.5.5
lightning-utilities               0.15.2
llguidance                        0.7.30
llvmlite                          0.44.0
lm-format-enforcer                0.10.12
markdown-it-py                    4.0.0
MarkupSafe                        3.0.2
matplotlib-inline                 0.1.7
mdurl                             0.1.2
mistral_common                    1.8.4
modelscope                        1.29.2
more-itertools                    10.8.0
mpmath                            1.3.0
msgpack                           1.1.1
msgspec                           0.19.0
multidict                         6.6.4
multiprocess                      0.70.16
nest-asyncio                      1.6.0
networkx                          3.5
ninja                             1.13.0
numba                             0.61.2
numpy                             2.2.6
nvidia-cublas-cu12                12.8.3.14
nvidia-cuda-cupti-cu12            12.8.57
nvidia-cuda-nvrtc-cu12            12.8.61
nvidia-cuda-runtime-cu12          12.8.57
nvidia-cudnn-cu12                 9.7.1.26
nvidia-cudnn-frontend             1.14.1
nvidia-cufft-cu12                 11.3.3.41
nvidia-cufile-cu12                1.13.0.11
nvidia-curand-cu12                10.3.9.55
nvidia-cusolver-cu12              11.7.2.55
nvidia-cusparse-cu12              12.5.7.53
nvidia-cusparselt-cu12            0.6.3
nvidia-ml-py                      13.580.65
nvidia-nccl-cu12                  2.26.2
nvidia-nvjitlink-cu12             12.8.61
nvidia-nvshmem-cu12               3.4.5
nvidia-nvtx-cu12                  12.8.55
omegaconf                         2.3.0
onnxruntime-gpu                   1.22.0
openai                            1.90.0
opencv-python-headless            4.12.0.88
optimum                           1.27.0
orjson                            3.11.3
oss2                              2.19.1
outlines                          0.1.11
outlines_core                     0.1.26
packaging                         25.0
pandas                            2.3.2
parso                             0.8.5
partial-json-parser               0.2.1.1.post6
passlib                           1.7.4
peft                              0.17.1
pexpect                           4.9.0
pillow                            11.3.0
pip                               25.2
platformdirs                      4.4.0
pooch                             1.8.2
prometheus_client                 0.22.1
prometheus-fastapi-instrumentator 7.1.0
prompt_toolkit                    3.0.52
propcache                         0.3.2
protobuf                          6.32.0
psutil                            7.0.0
ptyprocess                        0.7.0
pure_eval                         0.2.3
py-cpuinfo                        9.0.0
pyarrow                           21.0.0
pyasn1                            0.6.1
pybase64                          1.4.2
pycountry                         24.6.1
pycparser                         2.22
pycryptodome                      3.23.0
pydantic                          2.11.7
pydantic_core                     2.33.2
pydantic-extra-types              2.10.5
pydub                             0.25.1
Pygments                          2.19.2
pynini                            2.1.6
pynndescent                       0.5.13
PySocks                           1.7.1
python-dateutil                   2.9.0.post0
python-dotenv                     1.1.1
python-jose                       3.5.0
python-json-logger                3.3.0
python-multipart                  0.0.20
pytorch-lightning                 2.5.5
pytorch-wpe                       0.0.1
pytz                              2025.2
pyworld                           0.3.5
PyYAML                            6.0.2
pyzmq                             27.0.2
quantile-python                   1.1
qwen-vl-utils                     0.0.11
ray                               2.49.1
referencing                       0.36.2
regex                             2025.9.1
requests                          2.32.5
rich                              14.1.0
rich-toolkit                      0.15.1
rignore                           0.6.4
rpds-py                           0.27.1
rsa                               4.9.1
ruamel.yaml                       0.18.15
ruamel.yaml.clib                  0.2.12
ruff                              0.12.12
safehttpx                         0.1.6
safetensors                       0.6.2
scikit-learn                      1.7.1
scipy                             1.16.1
semantic-version                  2.10.0
sentence-transformers             5.1.0
sentencepiece                     0.2.1
sentry-sdk                        2.37.0
setproctitle                      1.3.7
setuptools                        79.0.1
sgl-kernel                        0.3.8
sglang                            0.5.1.post3
shellingham                       1.5.4
six                               1.17.0
sniffio                           1.3.1
soundfile                         0.13.1
soupsieve                         2.8
soxr                              1.0.0
sse-starlette                     3.0.2
stack-data                        0.6.3
starlette                         0.47.3
sympy                             1.14.0
tabulate                          0.9.0
tblib                             3.1.0
tensorboardX                      2.6.4
threadpoolctl                     3.6.0
tiktoken                          0.11.0
tokenizers                        0.21.4
tomlkit                           0.13.3
torch                             2.7.0+cu128
torch-complex                     0.4.4
torchao                           0.13.0
torchaudio                        2.7.0+cu128
torchmetrics                      1.8.2
torchvision                       0.22.0+cu128
tqdm                              4.67.1
traitlets                         5.14.3
transformers                      4.53.3
triton                            3.3.0
typeguard                         4.4.4
typer                             0.17.4
typing_extensions                 4.15.0
typing-inspection                 0.4.1
tzdata                            2025.2
umap-learn                        0.5.9.post2
urllib3                           2.5.0
uv                                0.8.15
uvicorn                           0.35.0
uvloop                            0.21.0
vllm                              0.9.2
watchfiles                        1.1.0
wcwidth                           0.2.13
websockets                        15.0.1
wetext                            0.1.0
WeTextProcessing                  1.0.4.1
xformers                          0.0.30
xgrammar                          0.1.19
xinference                        1.9.1
xllamacpp                         0.2.0
xoscar                            0.7.16
xxhash                            3.5.0
yarl                              1.20.1
zipp                              3.23.0
zstandard                         0.24.0

### The command used to start Xinference / 用以启动 xinference 的命令

`xinference-local -H 0.0.0.0` to start xinference and then
`xinference launch --model-name Qwen3-Thinking --model-type LLM --model-engine vLLM --model-format fp8 --size-in-billions 235 --quantization fp8 --n-gpu auto --replica 1 --n-worker 1 --reasoning_content false --gpu_memory_utilization 0.9 --max_model_len 32768 --tensor_parallel_size 4 --enable_chunked_prefill true --disable-virtual-env` 
to launch model.

### Reproduction / 复现过程

As above

### Expected behavior / 期待表现

Able to launch LLM on multiple GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fail to run llm model on multiple GPUs in a single node. #4051

System Info / 系統信息

======== Autoscaler status: 2025-09-10 05:10:31.658230 ========
Node status

Resources

======== Autoscaler status: 2025-09-10 05:12:05.000345 ========
Node status

Resources

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fail to run llm model on multiple GPUs in a single node. #4051

Description

System Info / 系統信息

======== Autoscaler status: 2025-09-10 05:10:31.658230 ======== Node status

Resources

======== Autoscaler status: 2025-09-10 05:12:05.000345 ======== Node status

Resources

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

======== Autoscaler status: 2025-09-10 05:10:31.658230 ========
Node status

======== Autoscaler status: 2025-09-10 05:12:05.000345 ========
Node status