Skip to content

[Core]: Remove V1 engine proxy as no longer required#23762

Closed
hickeyma wants to merge 5 commits intovllm-project:mainfrom
hickeyma:core/remove-v1-engine-class-proxy
Closed

[Core]: Remove V1 engine proxy as no longer required#23762
hickeyma wants to merge 5 commits intovllm-project:mainfrom
hickeyma:core/remove-v1-engine-class-proxy

Conversation

@hickeyma
Copy link
Contributor

@hickeyma hickeyma commented Aug 27, 2025

As V1 is now default engine, there is not longer a need to set class proxy for V1 to V0 engine.

Related #18571

Local testing shown below.

$ cat simple_test_vllm.py

from vllm import LLM, SamplingParams

def main():
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    model = "meta-llama/Meta-Llama-3.1-8B-Instruct"

    llm = LLM(model=model)

    outputs = llm.generate(prompts, sampling_params)

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == "__main__":
    main()(vllm)
$ python simple_test_vllm.py 

NFO 08-27 16:06:36 [__init__.py:241] Automatically detected platform cuda.
INFO 08-27 16:06:37 [utils.py:326] non-default args: {'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'disable_log_stats': True}
INFO 08-27 16:06:56 [__init__.py:742] Resolved architecture: LlamaForCausalLM
INFO 08-27 16:06:56 [__init__.py:1786] Using max model len 131072
INFO 08-27 16:07:00 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
(EngineCore_0 pid=231141) INFO 08-27 16:07:06 [core.py:644] Waiting for init message from front-end.
(EngineCore_0 pid=231141) INFO 08-27 16:07:06 [core.py:74] Initializing a V1 LLM engine (v0.10.1rc2.dev249+g1fdc73241.d20250827) with config: model='meta-llama/Meta-Llama-3.1-8B-Instru
[...]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  9.80it/s, est. speed input: 63.69 toks/s, output: 156.78 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Helen and I am a devoted animal lover and owner of Angels of Mercy. I'
Prompt: 'The president of the United States is', Generated text: ' trying to convince the American public that he is not a threat to global security.\n'
Prompt: 'The capital of France is', Generated text: ' Paris, a city that has captured the hearts of millions of tourists from all over'
Prompt: 'The future of AI is', Generated text: ' exciting, but also raises questions about the role of humans in this new world.'

$ VLLM_USE_V1=1 python simple_test_vllm.py 

INFO 08-27 16:10:25 [__init__.py:241] Automatically detected platform cuda.
INFO 08-27 16:10:26 [utils.py:326] non-default args: {'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'disable_log_stats': True}
INFO 08-27 16:10:40 [__init__.py:742] Resolved architecture: LlamaForCausalLM
INFO 08-27 16:10:40 [__init__.py:1786] Using max model len 131072
INFO 08-27 16:10:42 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
(EngineCore_0 pid=231435) INFO 08-27 16:10:45 [core.py:644] Waiting for init message from front-end.
(EngineCore_0 pid=231435) INFO 08-27 16:10:45 [core.py:74] Initializing a V1 LLM engine (v0.10.1rc2.dev249+g1fdc73241.d20250827) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct'
[...]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  9.79it/s, est. speed input: 63.64 toks/s, output: 156.66 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Helen and I am a devoted animal lover and owner of Angels of Mercy. I'
Prompt: 'The president of the United States is', Generated text: ' trying to convince the American public that he is not a threat to global security.\n'
Prompt: 'The capital of France is', Generated text: ' Paris, a city that has captured the hearts of millions of tourists from all over'
Prompt: 'The future of AI is', Generated text: ' exciting, but also raises questions about the role of humans in this new world.'

$ VLLM_USE_V1=0 python simple_test_vllm.py 

NFO 08-27 16:21:22 [__init__.py:241] Automatically detected platform cuda.
INFO 08-27 16:21:23 [utils.py:326] non-default args: {'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'disable_log_stats': True}
INFO 08-27 16:21:36 [__init__.py:742] Resolved architecture: LlamaForCausalLM
INFO 08-27 16:21:36 [__init__.py:1786] Using max model len 131072
WARNING 08-27 16:21:36 [arg_utils.py:1578] Chunked prefill is enabled by default for models with max_model_len > 32K. Chunked prefill might not work with some features or models. If you encounter any issues, please disable by launching with --enable-chunked-prefill=False.
INFO 08-27 16:21:39 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 08-27 16:21:39 [llm_engine.py:222] Initializing a V0 LLM engine (v0.10.1rc2.dev249+g1fdc73241.d20250827) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct'
[...]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  9.48it/s, est. speed input: 61.64 toks/s, output: 151.74 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Rob Thomas and I am a highly experienced and skilled Acoustic Guitarist. I'
Prompt: 'The president of the United States is', Generated text: ' the most powerful person in the world. The president is both the head of state'
Prompt: 'The capital of France is', Generated text: ' a popular destination for tourists. It is a city with rich history, art,'
Prompt: 'The future of AI is', Generated text: ' being shaped by the work of researchers like Victor Zhang, a computer science Ph.D'
$ python -m vllm.entrypoints.api_server --model "meta-llama/Meta-Llama-3.1-8B-Instruct"

INFO 08-28 08:41:57 [__init__.py:241] Automatically detected platform cuda.
INFO 08-28 08:41:58 [api_server.py:122] vLLM API server version 0.10.1rc2.dev310+g4f35be10a
[...]
(EngineCore_0 pid=259025) INFO 08-28 08:42:17 [core.py:75] Initializing a V1 LLM engine (v0.10.1rc2.dev310+g4f35be10a) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct', 
[...]
INFO:     127.0.0.1:50376 - "POST /generate HTTP/1.1" 200 OK
$ python https://github.com/vllm-project/vllm/blob/main/examples/online_serving/api_client.py

Prompt: 'San Francisco is a'

Beam candidate 0: 'San Francisco is a top tourist destination, and for good reason. The city is known for its iconic'
$ VLLM_USE_V1=1 python -m vllm.entrypoints.api_server --model "meta-llama/Meta-Llama-3.1-8B-Instruct"
INFO 08-28 09:02:53 [__init__.py:241] Automatically detected platform cuda.
INFO 08-28 09:02:54 [api_server.py:122] vLLM API server version 0.10.1rc2.dev310+g4f35be10a
[...]
(EngineCore_0 pid=259753) INFO 08-28 09:03:13 [core.py:75] Initializing a V1 LLM engine (v0.10.1rc2.dev310+g4f35be10a) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct',
[...]
INFO:     127.0.0.1:36952 - "POST /generate HTTP/1.1" 200 OK
$ python https://github.com/vllm-project/vllm/blob/main/examples/online_serving/api_client.py

Prompt: 'San Francisco is a'

Beam candidate 0: 'San Francisco is a top tourist destination, and for good reason. The city is known for its iconic'
$ VLLM_USE_V1=0 python -m vllm.entrypoints.api_server --model "meta-llama/Meta-Llama-3.1-8B-Instruct"

INFO 08-28 09:06:25 [__init__.py:241] Automatically detected platform cuda.
INFO 08-28 09:06:26 [api_server.py:122] vLLM API server version 0.10.1rc2.dev310+g4f35be10a
[...]
INFO 08-28 09:06:41 [llm_engine.py:223] Initializing a V0 LLM engine (v0.10.1rc2.dev310+g4f35be10a) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, 
[...]
INFO:     127.0.0.1:36952 - "POST /generate HTTP/1.1" 200 OK
$ python https://github.com/vllm-project/vllm/blob/main/examples/online_serving/api_client.py

Prompt: 'San Francisco is a'

Beam candidate 0: 'San Francisco is a top tourist destination, and for good reason. The city is known for its iconic'

As V1 is now default engine, there is not longer a need
to set class proxy for V1 to V0 engine.

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
@hickeyma hickeyma changed the title [Core]: Remove v1 engine proxy as no longer required [Core]: Remove V1 engine proxy as no longer required Aug 27, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the module-level class proxies that conditionally switched to the V1 engine. This is a good cleanup step that aligns with making the V1 engine the default. My review includes suggestions to remove other now-redundant conditional logic related to the VLLM_USE_V1 flag to fully complete this refactoring and improve code maintainability.

@robertgshaw2-redhat
Copy link
Collaborator

this is not ready yet. users can still opt into V0 so this is still needed. Once V0 is compeltely removed we can get rid of this

@hickeyma
Copy link
Contributor Author

hickeyma commented Aug 27, 2025

this is not ready yet. users can still opt into V0 so this is still needed. Once V0 is compeltely removed we can get rid of this

@robertgshaw2-redhat Ok but can't users still use v0 as shown in the output in the description (VLLM_USE_V1=0)? Or what use case do you mean?

@robertgshaw2-redhat
Copy link
Collaborator

what is inside simple_test_vllm.py? this override is for users direclty importing AsyncLLM

@hickeyma
Copy link
Contributor Author

hickeyma commented Aug 28, 2025

what is inside simple_test_vllm.py? this override is for users direclty importing AsyncLLM

@robertgshaw2-redhat I updated the testing info in description (showing simple test from quickstart which is synchronous).

I also tested with https://github.com/vllm-project/vllm/blob/main/examples/online_serving/api_client.py which connects to vLLM API server. This tests using AsyncLLM if I understand it correctly?

@hickeyma
Copy link
Contributor Author

hickeyma commented Oct 1, 2025

Closing as superseded by #25025

@hickeyma hickeyma closed this Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants