-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
Using ubuntu:22.04 docker image and running inside docker:
ENV PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
ENV VLLM_TARGET_DEVICE="cpu"
RUN pip install -v git+https://github.com/vllm-project/[email protected]
RUN pip install intel_extension_for_pytorch==2.7.0🐛 Describe the bug
Running model casperhansen/llama-3-8b-instruct-awq with params: --model casperhansen/llama-3-8b-instruct-awq --device cpu --tensor-parallel-size 1 --pipeline-parallel-size 1 --dtype bfloat16 --max-num-seqs 256 --max-model-len 4096 --download_dir /data --host 0.0.0.0 --port 80
on version 0.9.0 returns following error:
llm-vllm-model-server | [W527 10:31:54.634480995 OperatorEntry.cpp:154] Warning: Warning only once for all operators, other operators may also be overridden.
llm-vllm-model-server | Overriding a previously registered kernel for the same operator and the same dispatch key
llm-vllm-model-server | operator: aten::_addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor
llm-vllm-model-server | registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
llm-vllm-model-server | dispatch key: AutocastCPU
llm-vllm-model-server | previous kernel: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:327
llm-vllm-model-server | new kernel: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/autocast/autocast_mode.cpp:112 (function operator())
llm-vllm-model-server | INFO 05-27 10:31:55 [__init__.py:243] Automatically detected platform cpu.
llm-vllm-model-server | INFO 05-27 10:31:57 [__init__.py:31] Available plugins for group vllm.general_plugins:
llm-vllm-model-server | INFO 05-27 10:31:57 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
llm-vllm-model-server | INFO 05-27 10:31:57 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
llm-vllm-model-server | INFO 05-27 10:31:58 [config.py:1903] Disabled the custom all-reduce kernel because it is not supported on current platform.
llm-vllm-model-server | WARNING 05-27 10:31:58 [utils.py:1366] argument 'device' is deprecated
llm-vllm-model-server | INFO 05-27 10:31:58 [api_server.py:1289] vLLM API server version 0.9.0
llm-vllm-model-server | INFO 05-27 10:31:58 [config.py:1903] Disabled the custom all-reduce kernel because it is not supported on current platform.
llm-vllm-model-server | INFO 05-27 10:31:58 [cli_args.py:300] non-default args: {'host': '0.0.0.0', 'port': 80, 'model': 'casperhansen/llama-3-8b-instruct-awq', 'dtype': 'bfloat16', 'max_model_len': 4096, 'download_dir': '/data', 'device': 'cpu', 'max_num_seqs': 256}
llm-vllm-model-server | WARNING 05-27 10:31:59 [_logger.py:72] Casting torch.float16 to torch.bfloat16.
llm-vllm-model-server | INFO 05-27 10:32:05 [config.py:788] This model supports multiple tasks: {'reward', 'embed', 'generate', 'classify', 'score'}. Defaulting to 'generate'.
llm-vllm-model-server | Traceback (most recent call last):
llm-vllm-model-server | File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
llm-vllm-model-server | return _run_code(code, main_globals, None,
llm-vllm-model-server | File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
llm-vllm-model-server | exec(code, run_globals)
llm-vllm-model-server | File "/home/user/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1376, in <module>
llm-vllm-model-server | uvloop.run(run_server(args))
llm-vllm-model-server | File "/home/user/.local/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
llm-vllm-model-server | return loop.run_until_complete(wrapper())
llm-vllm-model-server | File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
llm-vllm-model-server | File "/home/user/.local/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
llm-vllm-model-server | return await main
llm-vllm-model-server | File "/home/user/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1324, in run_server
llm-vllm-model-server | async with build_async_engine_client(args) as engine_client:
llm-vllm-model-server | File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
llm-vllm-model-server | return await anext(self.gen)
llm-vllm-model-server | File "/home/user/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 153, in build_async_engine_client
llm-vllm-model-server | async with build_async_engine_client_from_engine_args(
llm-vllm-model-server | File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
llm-vllm-model-server | return await anext(self.gen)
llm-vllm-model-server | File "/home/user/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 173, in build_async_engine_client_from_engine_args
llm-vllm-model-server | vllm_config = engine_args.create_engine_config(usage_context=usage_context)
llm-vllm-model-server | File "/home/user/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 983, in create_engine_config
llm-vllm-model-server | model_config = self.create_model_config()
llm-vllm-model-server | File "/home/user/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 875, in create_model_config
llm-vllm-model-server | return ModelConfig(
llm-vllm-model-server | File "<string>", line 42, in __init__
llm-vllm-model-server | File "/home/user/.local/lib/python3.10/site-packages/vllm/config.py", line 601, in __post_init__
llm-vllm-model-server | self._verify_quantization()
llm-vllm-model-server | File "/home/user/.local/lib/python3.10/site-packages/vllm/config.py", line 866, in _verify_quantization
llm-vllm-model-server | method = get_quantization_config(name)
llm-vllm-model-server | File "/home/user/.local/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/__init__.py", line 85, in get_quantization_config
llm-vllm-model-server | from vllm.model_executor.layers.quantization.quark.quark import QuarkConfig
llm-vllm-model-server | File "/home/user/.local/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/quark/quark.py", line 9, in <module>
llm-vllm-model-server | from vllm.model_executor.layers.fused_moe import FusedMoE
llm-vllm-model-server | File "/home/user/.local/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/__init__.py", line 6, in <module>
llm-vllm-model-server | from vllm.model_executor.layers.fused_moe.layer import (
llm-vllm-model-server | File "/home/user/.local/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 240, in <module>
llm-vllm-model-server | class FusedMoEMethodBase(QuantizeMethodBase):
llm-vllm-model-server | File "/home/user/.local/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 295, in FusedMoEMethodBase
llm-vllm-model-server | ) -> FusedMoEPermuteExpertsUnpermute:
llm-vllm-model-server | NameError: name 'FusedMoEPermuteExpertsUnpermute' is not definedBefore submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
sle78
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working
Type
Projects
Status
Done