[Core]: Remove V1 engine proxy as no longer required by hickeyma · Pull Request #23762 · vllm-project/vllm

hickeyma · 2025-08-27T17:30:37Z

As V1 is now default engine, there is not longer a need to set class proxy for V1 to V0 engine.

Local testing shown below.

$ cat simple_test_vllm.py

from vllm import LLM, SamplingParams

def main():
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    model = "meta-llama/Meta-Llama-3.1-8B-Instruct"

    llm = LLM(model=model)

    outputs = llm.generate(prompts, sampling_params)

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == "__main__":
    main()(vllm)

$ python simple_test_vllm.py 

NFO 08-27 16:06:36 [__init__.py:241] Automatically detected platform cuda.
INFO 08-27 16:06:37 [utils.py:326] non-default args: {'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'disable_log_stats': True}
INFO 08-27 16:06:56 [__init__.py:742] Resolved architecture: LlamaForCausalLM
INFO 08-27 16:06:56 [__init__.py:1786] Using max model len 131072
INFO 08-27 16:07:00 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
(EngineCore_0 pid=231141) INFO 08-27 16:07:06 [core.py:644] Waiting for init message from front-end.
(EngineCore_0 pid=231141) INFO 08-27 16:07:06 [core.py:74] Initializing a V1 LLM engine (v0.10.1rc2.dev249+g1fdc73241.d20250827) with config: model='meta-llama/Meta-Llama-3.1-8B-Instru
[...]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  9.80it/s, est. speed input: 63.69 toks/s, output: 156.78 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Helen and I am a devoted animal lover and owner of Angels of Mercy. I'
Prompt: 'The president of the United States is', Generated text: ' trying to convince the American public that he is not a threat to global security.\n'
Prompt: 'The capital of France is', Generated text: ' Paris, a city that has captured the hearts of millions of tourists from all over'
Prompt: 'The future of AI is', Generated text: ' exciting, but also raises questions about the role of humans in this new world.'

$ VLLM_USE_V1=1 python simple_test_vllm.py 

INFO 08-27 16:10:25 [__init__.py:241] Automatically detected platform cuda.
INFO 08-27 16:10:26 [utils.py:326] non-default args: {'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'disable_log_stats': True}
INFO 08-27 16:10:40 [__init__.py:742] Resolved architecture: LlamaForCausalLM
INFO 08-27 16:10:40 [__init__.py:1786] Using max model len 131072
INFO 08-27 16:10:42 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
(EngineCore_0 pid=231435) INFO 08-27 16:10:45 [core.py:644] Waiting for init message from front-end.
(EngineCore_0 pid=231435) INFO 08-27 16:10:45 [core.py:74] Initializing a V1 LLM engine (v0.10.1rc2.dev249+g1fdc73241.d20250827) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct'
[...]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  9.79it/s, est. speed input: 63.64 toks/s, output: 156.66 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Helen and I am a devoted animal lover and owner of Angels of Mercy. I'
Prompt: 'The president of the United States is', Generated text: ' trying to convince the American public that he is not a threat to global security.\n'
Prompt: 'The capital of France is', Generated text: ' Paris, a city that has captured the hearts of millions of tourists from all over'
Prompt: 'The future of AI is', Generated text: ' exciting, but also raises questions about the role of humans in this new world.'

$ VLLM_USE_V1=0 python simple_test_vllm.py 

NFO 08-27 16:21:22 [__init__.py:241] Automatically detected platform cuda.
INFO 08-27 16:21:23 [utils.py:326] non-default args: {'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'disable_log_stats': True}
INFO 08-27 16:21:36 [__init__.py:742] Resolved architecture: LlamaForCausalLM
INFO 08-27 16:21:36 [__init__.py:1786] Using max model len 131072
WARNING 08-27 16:21:36 [arg_utils.py:1578] Chunked prefill is enabled by default for models with max_model_len > 32K. Chunked prefill might not work with some features or models. If you encounter any issues, please disable by launching with --enable-chunked-prefill=False.
INFO 08-27 16:21:39 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 08-27 16:21:39 [llm_engine.py:222] Initializing a V0 LLM engine (v0.10.1rc2.dev249+g1fdc73241.d20250827) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct'
[...]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  9.48it/s, est. speed input: 61.64 toks/s, output: 151.74 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Rob Thomas and I am a highly experienced and skilled Acoustic Guitarist. I'
Prompt: 'The president of the United States is', Generated text: ' the most powerful person in the world. The president is both the head of state'
Prompt: 'The capital of France is', Generated text: ' a popular destination for tourists. It is a city with rich history, art,'
Prompt: 'The future of AI is', Generated text: ' being shaped by the work of researchers like Victor Zhang, a computer science Ph.D'

$ python -m vllm.entrypoints.api_server --model "meta-llama/Meta-Llama-3.1-8B-Instruct"

INFO 08-28 08:41:57 [__init__.py:241] Automatically detected platform cuda.
INFO 08-28 08:41:58 [api_server.py:122] vLLM API server version 0.10.1rc2.dev310+g4f35be10a
[...]
(EngineCore_0 pid=259025) INFO 08-28 08:42:17 [core.py:75] Initializing a V1 LLM engine (v0.10.1rc2.dev310+g4f35be10a) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct', 
[...]
INFO:     127.0.0.1:50376 - "POST /generate HTTP/1.1" 200 OK

$ python https://github.com/vllm-project/vllm/blob/main/examples/online_serving/api_client.py

Prompt: 'San Francisco is a'

Beam candidate 0: 'San Francisco is a top tourist destination, and for good reason. The city is known for its iconic'

$ VLLM_USE_V1=1 python -m vllm.entrypoints.api_server --model "meta-llama/Meta-Llama-3.1-8B-Instruct"
INFO 08-28 09:02:53 [__init__.py:241] Automatically detected platform cuda.
INFO 08-28 09:02:54 [api_server.py:122] vLLM API server version 0.10.1rc2.dev310+g4f35be10a
[...]
(EngineCore_0 pid=259753) INFO 08-28 09:03:13 [core.py:75] Initializing a V1 LLM engine (v0.10.1rc2.dev310+g4f35be10a) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct',
[...]
INFO:     127.0.0.1:36952 - "POST /generate HTTP/1.1" 200 OK

$ python https://github.com/vllm-project/vllm/blob/main/examples/online_serving/api_client.py

Prompt: 'San Francisco is a'

Beam candidate 0: 'San Francisco is a top tourist destination, and for good reason. The city is known for its iconic'

$ VLLM_USE_V1=0 python -m vllm.entrypoints.api_server --model "meta-llama/Meta-Llama-3.1-8B-Instruct"

INFO 08-28 09:06:25 [__init__.py:241] Automatically detected platform cuda.
INFO 08-28 09:06:26 [api_server.py:122] vLLM API server version 0.10.1rc2.dev310+g4f35be10a
[...]
INFO 08-28 09:06:41 [llm_engine.py:223] Initializing a V0 LLM engine (v0.10.1rc2.dev310+g4f35be10a) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, 
[...]
INFO:     127.0.0.1:36952 - "POST /generate HTTP/1.1" 200 OK

$ python https://github.com/vllm-project/vllm/blob/main/examples/online_serving/api_client.py

Prompt: 'San Francisco is a'

Beam candidate 0: 'San Francisco is a top tourist destination, and for good reason. The city is known for its iconic'

As V1 is now default engine, there is not longer a need to set class proxy for V1 to V0 engine. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>

gemini-code-assist

Code Review

This pull request removes the module-level class proxies that conditionally switched to the V1 engine. This is a good cleanup step that aligns with making the V1 engine the default. My review includes suggestions to remove other now-redundant conditional logic related to the VLLM_USE_V1 flag to fully complete this refactoring and improve code maintainability.

robertgshaw2-redhat · 2025-08-27T17:43:53Z

this is not ready yet. users can still opt into V0 so this is still needed. Once V0 is compeltely removed we can get rid of this

hickeyma · 2025-08-27T17:55:25Z

this is not ready yet. users can still opt into V0 so this is still needed. Once V0 is compeltely removed we can get rid of this

@robertgshaw2-redhat Ok but can't users still use v0 as shown in the output in the description (VLLM_USE_V1=0)? Or what use case do you mean?

robertgshaw2-redhat · 2025-08-27T18:02:42Z

what is inside simple_test_vllm.py? this override is for users direclty importing AsyncLLM

hickeyma · 2025-08-28T09:20:45Z

what is inside simple_test_vllm.py? this override is for users direclty importing AsyncLLM

@robertgshaw2-redhat I updated the testing info in description (showing simple test from quickstart which is synchronous).

I also tested with https://github.com/vllm-project/vllm/blob/main/examples/online_serving/api_client.py which connects to vLLM API server. This tests using AsyncLLM if I understand it correctly?

hickeyma · 2025-10-01T09:18:40Z

Closing as superseded by #25025

Remove v1 engine proxy as no longer required

d8beca5

As V1 is now default engine, there is not longer a need to set class proxy for V1 to V0 engine. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>

hickeyma requested review from alexm-redhat, comaniac, njhill, youkaichao and zhuohan123 as code owners August 27, 2025 17:30

hickeyma changed the title ~~[Core]: Remove v1 engine proxy as no longer required~~ [Core]: Remove V1 engine proxy as no longer required Aug 27, 2025

gemini-code-assist bot reviewed Aug 27, 2025

View reviewed changes

Merge branch 'main' into core/remove-v1-engine-class-proxy

ca20855

Merge branch 'main' into core/remove-v1-engine-class-proxy

8bfdf8c

Merge branch 'main' into core/remove-v1-engine-class-proxy

558a72d

Merge branch 'main' into core/remove-v1-engine-class-proxy

3ecefc7

hickeyma closed this Oct 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core]: Remove V1 engine proxy as no longer required#23762

[Core]: Remove V1 engine proxy as no longer required#23762
hickeyma wants to merge 5 commits intovllm-project:mainfrom
hickeyma:core/remove-v1-engine-class-proxy

hickeyma commented Aug 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

robertgshaw2-redhat commented Aug 27, 2025

Uh oh!

hickeyma commented Aug 27, 2025 •

edited

Loading

Uh oh!

robertgshaw2-redhat commented Aug 27, 2025

Uh oh!

hickeyma commented Aug 28, 2025 •

edited

Loading

Uh oh!

hickeyma commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

hickeyma commented Aug 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

robertgshaw2-redhat commented Aug 27, 2025

Uh oh!

hickeyma commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertgshaw2-redhat commented Aug 27, 2025

Uh oh!

hickeyma commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hickeyma commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hickeyma commented Aug 27, 2025 •

edited by github-actions bot

Loading

hickeyma commented Aug 27, 2025 •

edited

Loading

hickeyma commented Aug 28, 2025 •

edited

Loading