Skip to content

Add NotImplementedError to v1 cpu runner#19527

Closed
fred2167 wants to merge 2 commits intovllm-project:mainfrom
fred2167:v1-cpu-not-implemented
Closed

Add NotImplementedError to v1 cpu runner#19527
fred2167 wants to merge 2 commits intovllm-project:mainfrom
fred2167:v1-cpu-not-implemented

Conversation

@fred2167
Copy link

@fred2167 fred2167 commented Jun 12, 2025

add v1 cpu runner not implemented error

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

This is my first commit to the repo. The purpose is to familiarize the codebase with minimal changes.

This PR is only aim to make the error more verbose while running the example on a CPU machine so that it doesnt fall to the GPU implementation

VLLM_USE_V1=1 python examples/offline_inference/basic/basic.py        

Test Plan

Unit test and example

Test Result

pytest tests/v1/worker/test_cpu_model_runner.py
=========================================================================== warnings summary ============================================================================
../../.venv/lib/python3.12/site-packages/schemathesis/generation/coverage.py:305
  /Users/fredchan/.venv/lib/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================================== 1 passed, 1 warning in 12.78s =====================================================================

add v1 cpu runner not implemented error
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Jun 12, 2025
@fred2167 fred2167 force-pushed the v1-cpu-not-implemented branch from 609d77c to feb00d5 Compare June 12, 2025 03:41
@bigPYJ1151
Copy link
Member

Any issue when you using the CPU V1?

The CPU V1 model runner inherits from the GPU V1 model runner and most member functions can be reused directly. So it's not required to throw unimplemented errors.

Just checked python examples/offline_inference/basic/basic.py and it worked with the V1 engine by default.

Signed-off-by: Fred Chan <fred2167@gmail.com>
@fred2167 fred2167 force-pushed the v1-cpu-not-implemented branch from feb00d5 to 7cfc9e1 Compare June 12, 2025 03:54
@houseroad
Copy link
Collaborator

yeah, I am also a bit confused, since CPU backend also works on macos CPU side. :-)

@fred2167
Copy link
Author

Any issue when you using the CPU V1?

Im running on M1 Mac so I guess I dont have this package from intel in SDAP backend. any recommendation for Mac?

The CPU V1 model runner inherits from the GPU V1 model runner and most member functions can be reused directly. So it's not required to throw unimplemented errors.

does the intel package also support GPU? if not, is it better to move cpu specific logic to the cpu model runner?

Just checked python examples/offline_inference/basic/basic.py and it worked with the V1 engine by default.

this is the output Im getting by running

 VLLM_USE_V1=1 python examples/offline_inference/basic/basic.py
ERROR 06-12 14:45:32 [core.py:517]   File "/Users/fredchan/code/original-vllm/vllm/attention/layer.py", line 402, in unified_attention
ERROR 06-12 14:45:32 [core.py:517]     output = self.impl.forward(self, query, key, value, kv_cache,
ERROR 06-12 14:45:32 [core.py:517]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-12 14:45:32 [core.py:517]   File "/Users/fredchan/code/original-vllm/vllm/attention/backends/torch_sdpa.py", line 559, in forward
ERROR 06-12 14:45:32 [core.py:517]     import intel_extension_for_pytorch.llm.modules as ipex_modules
ERROR 06-12 14:45:32 [core.py:517] ModuleNotFoundError: No module named 'intel_extension_for_pytorch'

@fred2167
Copy link
Author

yeah, I am also a bit confused, since CPU backend also works on macos CPU side. :-)

interesting, are you able to run the example on Mac?

@bigPYJ1151
Copy link
Member

@fred2167 The V1 engine requires chunked-prefill support, which has not been supported on macos. For CPU, only x86 supports this via intel_extension_for_pytorch package. By default the V0 engine will be used for macos.

CUDA backend will not use SDAP backend, it is only used by CPU.

@houseroad
Copy link
Collaborator

https://gist.github.com/houseroad/9fc43ba08c192c7c91914f2f1af539fb, I tried something like this yesterday :-)

@fred2167
Copy link
Author

@fred2167 The V1 engine requires chunked-prefill support, which has not been supported on macos. For CPU, only x86 supports this via intel_extension_for_pytorch package. By default the V0 engine will be used for macos.

CUDA backend will not use SDAP backend, it is only used by CPU.

make sense. I dont think its a small lift for v1 to support Mac. maybe worth updating the doc to reflect this given someone have similar issue on slack

@fred2167
Copy link
Author

https://gist.github.com/houseroad/9fc43ba08c192c7c91914f2f1af539fb, I tried something like this yesterday :-)

is this on V0 or V1 engine

@houseroad
Copy link
Collaborator

V0 works. V1 failed.

@houseroad
Copy link
Collaborator

I thought my work was done on v1, but actually it was on v0. Then this PR may make sense.

@hmellor
Copy link
Member

hmellor commented Mar 4, 2026

The underlying issue was addressed by #19121

@hmellor hmellor closed this Mar 4, 2026
@mergify mergify bot added the cpu Related to CPU backends label Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpu Related to CPU backends v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants