-
Notifications
You must be signed in to change notification settings - Fork 742
Description
System Info / 系統信息
Fail to launch Qwen3-Thinking-fp8-235b-fp8 by runing
xinference launch --model-name Qwen3-Thinking --model-type LLM --model-engine vLLM --model-format fp8 --size-in-billions 235 --quantization fp8 --n-gpu auto --replica 1 --n-worker 1 --reasoning_content false --gpu_memory_utilization 0.9 --max_model_len 32768 --tensor_parallel_size 4 --enable_chunked_prefill true --disable-virtual-env
The log is below
WARNING 09-10 05:06:49 [init.py:2662] We must use the
spawn
multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reason: CUDA is initializedINFO 09-10 05:06:53 [init.py:244] Automatically detected platform cuda.
INFO 09-10 05:06:54 [core.py:526] Waiting for init message from front-end.
INFO 09-10 05:06:54 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='/root/.xinference/cache/v2/Qwen3-Thinking-fp8-235b-fp8', speculative_config=None, tokenizer='/root/.xinference/cache/v2/Qwen3-Thinking-fp8-235b-fp8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/root/.xinference/cache/v2/Qwen3-Thinking-fp8-235b-fp8, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
2025-09-10 05:06:57,844 INFO worker.py:1951 -- Started a local Ray instance.
INFO 09-10 05:07:01 [ray_utils.py:334] No current placement group found. Creating a new placement group.
WARNING 09-10 05:07:01 [ray_utils.py:341] The number of required GPUs exceeds the total number of available GPUs in the placement group.
INFO 09-10 05:07:11 [ray_utils.py:232] Waiting for creating a placement group of specs for 10 seconds. specs=[{'GPU': 1.0, 'node:172.18.0.2': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check
ray status
andray list nodes
to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.INFO 09-10 05:07:31 [ray_utils.py:232] Waiting for creating a placement group of specs for 30 seconds. specs=[{'GPU': 1.0, 'node:172.18.0.2': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check
ray status
andray list nodes
to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
By running ray status
with output below, it seems ray cluster is not able to init with 4 gpus. There are 8 gpus in the same node.
======== Autoscaler status: 2025-09-10 05:10:31.658230 ========
Node statusActive:
1 node_3a77e4b8141c8eb5cb20153bcb498e46d8a7370adb7dc9a2bd1319df
Pending:
(no pending nodes)
Recent failures:
(no failures)Resources
Total Usage:
0.0/192.0 CPU
0.0/1.0 GPU
0B/1.78TiB memory
0B/186.26GiB object_store_memoryTotal Constraints:
(no request_resources() constraints)
Total Demands:
{'node:172.18.0.2': 0.001, 'GPU': 1.0} * 1, {'GPU': 1.0} * 3 (PACK): 1+ pending placement groups
Also try to start a ray cluster with 4 gpus by simple running 'ray start --head --num-gpus=4' and then 'ray status', the output is below, which indicate 4 gpus are add to the cluster.
======== Autoscaler status: 2025-09-10 05:12:05.000345 ========
Node statusActive:
1 node_31f8b7a96a0e8c20d53025c2d0b967aaaa15654fa5ab83dce4aa4de6
Pending:
(no pending nodes)
Recent failures:
(no failures)Resources
Total Usage:
0.0/192.0 CPU
0.0/4.0 GPU
0B/1.77TiB memory
0B/186.26GiB object_store_memoryTotal Constraints:
(no request_resources() constraints)
Total Demands:
(no resource demands)
Also a small model that can fit in single gpu can be launched without any issue.
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
- docker / docker
- pip install / 通过 pip install 安装
- installation from source / 从源码安装
Version info / 版本信息
Package Version
accelerate 1.10.1
aiofiles 24.1.0
aiohappyeyeballs 2.6.1
aiohttp 3.12.15
aioprometheus 23.12.0
aiosignal 1.4.0
airportsdata 20250811
aliyun-python-sdk-core 2.16.0
aliyun-python-sdk-kms 2.16.5
annotated-types 0.7.0
antlr4-python3-runtime 4.9.3
anyio 4.10.0
astor 0.8.1
asttokens 3.0.0
async-timeout 5.0.1
attrs 25.3.0
audioread 3.0.1
autoawq 0.2.9
av 15.1.0
bcrypt 4.3.0
beautifulsoup4 4.13.5
blake3 1.0.5
Brotli 1.1.0
cachetools 6.2.0
certifi 2025.8.3
cffi 1.17.1
charset-normalizer 3.4.3
click 8.2.1
cloudpickle 3.1.1
coloredlogs 15.0.1
compressed-tensors 0.10.2
conformer 0.3.2
crcmod 1.7
cryptography 45.0.7
cuda-bindings 12.8.0
cuda-python 12.8.0
cupy-cuda12x 13.6.0
datasets 4.0.0
decorator 5.2.1
decord 0.6.0
depyf 0.18.0
diffusers 0.35.1
dill 0.3.8
diskcache 5.6.3
distro 1.9.0
dnspython 2.8.0
ecdsa 0.19.1
editdistance 0.8.1
einops 0.8.1
email-validator 2.3.0
executing 2.2.1
fastapi 0.116.1
fastapi-cli 0.0.10
fastapi-cloud-cli 0.1.5
fastrlock 0.8.3
ffmpy 0.6.1
filelock 3.19.1
flash_attn 2.7.4
flashinfer-python 0.3.1
flatbuffers 25.2.10
frozenlist 1.7.0
fsspec 2025.3.0
funasr 1.2.7
gdown 5.2.0
gguf 0.17.1
gptqmodel 4.1.0
gradio 5.44.1
gradio_client 1.12.1
groovy 0.1.2
h11 0.16.0
hf-xet 1.1.9
httpcore 1.0.9
httptools 0.6.4
httpx 0.28.1
huggingface-hub 0.34.4
humanfriendly 10.0
hydra-core 1.3.2
HyperPyYAML 1.2.2
idna 3.10
importlib_metadata 8.7.0
importlib_resources 6.5.2
inflect 7.5.0
interegular 0.3.3
ipython 9.5.0
ipython_pygments_lexers 1.1.1
jaconv 0.4.0
jamo 0.4.1
jedi 0.19.2
jieba 0.42.1
Jinja2 3.1.6
jiter 0.10.0
jmespath 0.10.0
joblib 1.5.2
jsonschema 4.25.1
jsonschema-specifications 2025.9.1
kaldifst 1.7.17
kaldiio 2.18.1
lark 1.2.2
lazy_loader 0.4
librosa 0.11.0
lightning 2.5.5
lightning-utilities 0.15.2
llguidance 0.7.30
llvmlite 0.44.0
lm-format-enforcer 0.10.12
markdown-it-py 4.0.0
MarkupSafe 3.0.2
matplotlib-inline 0.1.7
mdurl 0.1.2
mistral_common 1.8.4
modelscope 1.29.2
more-itertools 10.8.0
mpmath 1.3.0
msgpack 1.1.1
msgspec 0.19.0
multidict 6.6.4
multiprocess 0.70.16
nest-asyncio 1.6.0
networkx 3.5
ninja 1.13.0
numba 0.61.2
numpy 2.2.6
nvidia-cublas-cu12 12.8.3.14
nvidia-cuda-cupti-cu12 12.8.57
nvidia-cuda-nvrtc-cu12 12.8.61
nvidia-cuda-runtime-cu12 12.8.57
nvidia-cudnn-cu12 9.7.1.26
nvidia-cudnn-frontend 1.14.1
nvidia-cufft-cu12 11.3.3.41
nvidia-cufile-cu12 1.13.0.11
nvidia-curand-cu12 10.3.9.55
nvidia-cusolver-cu12 11.7.2.55
nvidia-cusparse-cu12 12.5.7.53
nvidia-cusparselt-cu12 0.6.3
nvidia-ml-py 13.580.65
nvidia-nccl-cu12 2.26.2
nvidia-nvjitlink-cu12 12.8.61
nvidia-nvshmem-cu12 3.4.5
nvidia-nvtx-cu12 12.8.55
omegaconf 2.3.0
onnxruntime-gpu 1.22.0
openai 1.90.0
opencv-python-headless 4.12.0.88
optimum 1.27.0
orjson 3.11.3
oss2 2.19.1
outlines 0.1.11
outlines_core 0.1.26
packaging 25.0
pandas 2.3.2
parso 0.8.5
partial-json-parser 0.2.1.1.post6
passlib 1.7.4
peft 0.17.1
pexpect 4.9.0
pillow 11.3.0
pip 25.2
platformdirs 4.4.0
pooch 1.8.2
prometheus_client 0.22.1
prometheus-fastapi-instrumentator 7.1.0
prompt_toolkit 3.0.52
propcache 0.3.2
protobuf 6.32.0
psutil 7.0.0
ptyprocess 0.7.0
pure_eval 0.2.3
py-cpuinfo 9.0.0
pyarrow 21.0.0
pyasn1 0.6.1
pybase64 1.4.2
pycountry 24.6.1
pycparser 2.22
pycryptodome 3.23.0
pydantic 2.11.7
pydantic_core 2.33.2
pydantic-extra-types 2.10.5
pydub 0.25.1
Pygments 2.19.2
pynini 2.1.6
pynndescent 0.5.13
PySocks 1.7.1
python-dateutil 2.9.0.post0
python-dotenv 1.1.1
python-jose 3.5.0
python-json-logger 3.3.0
python-multipart 0.0.20
pytorch-lightning 2.5.5
pytorch-wpe 0.0.1
pytz 2025.2
pyworld 0.3.5
PyYAML 6.0.2
pyzmq 27.0.2
quantile-python 1.1
qwen-vl-utils 0.0.11
ray 2.49.1
referencing 0.36.2
regex 2025.9.1
requests 2.32.5
rich 14.1.0
rich-toolkit 0.15.1
rignore 0.6.4
rpds-py 0.27.1
rsa 4.9.1
ruamel.yaml 0.18.15
ruamel.yaml.clib 0.2.12
ruff 0.12.12
safehttpx 0.1.6
safetensors 0.6.2
scikit-learn 1.7.1
scipy 1.16.1
semantic-version 2.10.0
sentence-transformers 5.1.0
sentencepiece 0.2.1
sentry-sdk 2.37.0
setproctitle 1.3.7
setuptools 79.0.1
sgl-kernel 0.3.8
sglang 0.5.1.post3
shellingham 1.5.4
six 1.17.0
sniffio 1.3.1
soundfile 0.13.1
soupsieve 2.8
soxr 1.0.0
sse-starlette 3.0.2
stack-data 0.6.3
starlette 0.47.3
sympy 1.14.0
tabulate 0.9.0
tblib 3.1.0
tensorboardX 2.6.4
threadpoolctl 3.6.0
tiktoken 0.11.0
tokenizers 0.21.4
tomlkit 0.13.3
torch 2.7.0+cu128
torch-complex 0.4.4
torchao 0.13.0
torchaudio 2.7.0+cu128
torchmetrics 1.8.2
torchvision 0.22.0+cu128
tqdm 4.67.1
traitlets 5.14.3
transformers 4.53.3
triton 3.3.0
typeguard 4.4.4
typer 0.17.4
typing_extensions 4.15.0
typing-inspection 0.4.1
tzdata 2025.2
umap-learn 0.5.9.post2
urllib3 2.5.0
uv 0.8.15
uvicorn 0.35.0
uvloop 0.21.0
vllm 0.9.2
watchfiles 1.1.0
wcwidth 0.2.13
websockets 15.0.1
wetext 0.1.0
WeTextProcessing 1.0.4.1
xformers 0.0.30
xgrammar 0.1.19
xinference 1.9.1
xllamacpp 0.2.0
xoscar 0.7.16
xxhash 3.5.0
yarl 1.20.1
zipp 3.23.0
zstandard 0.24.0
The command used to start Xinference / 用以启动 xinference 的命令
xinference-local -H 0.0.0.0
to start xinference and then
xinference launch --model-name Qwen3-Thinking --model-type LLM --model-engine vLLM --model-format fp8 --size-in-billions 235 --quantization fp8 --n-gpu auto --replica 1 --n-worker 1 --reasoning_content false --gpu_memory_utilization 0.9 --max_model_len 32768 --tensor_parallel_size 4 --enable_chunked_prefill true --disable-virtual-env
to launch model.
Reproduction / 复现过程
As above
Expected behavior / 期待表现
Able to launch LLM on multiple GPUs.