[Core]: Remove V1 engine proxy as no longer required#23762
[Core]: Remove V1 engine proxy as no longer required#23762hickeyma wants to merge 5 commits intovllm-project:mainfrom
Conversation
As V1 is now default engine, there is not longer a need to set class proxy for V1 to V0 engine. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
There was a problem hiding this comment.
Code Review
This pull request removes the module-level class proxies that conditionally switched to the V1 engine. This is a good cleanup step that aligns with making the V1 engine the default. My review includes suggestions to remove other now-redundant conditional logic related to the VLLM_USE_V1 flag to fully complete this refactoring and improve code maintainability.
|
this is not ready yet. users can still opt into V0 so this is still needed. Once V0 is compeltely removed we can get rid of this |
@robertgshaw2-redhat Ok but can't users still use v0 as shown in the output in the description ( |
|
what is inside |
@robertgshaw2-redhat I updated the testing info in description (showing simple test from quickstart which is synchronous). I also tested with |
|
Closing as superseded by #25025 |
As V1 is now default engine, there is not longer a need to set class proxy for V1 to V0 engine.
Related #18571
Local testing shown below.
$ python simple_test_vllm.py NFO 08-27 16:06:36 [__init__.py:241] Automatically detected platform cuda. INFO 08-27 16:06:37 [utils.py:326] non-default args: {'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'disable_log_stats': True} INFO 08-27 16:06:56 [__init__.py:742] Resolved architecture: LlamaForCausalLM INFO 08-27 16:06:56 [__init__.py:1786] Using max model len 131072 INFO 08-27 16:07:00 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192. (EngineCore_0 pid=231141) INFO 08-27 16:07:06 [core.py:644] Waiting for init message from front-end. (EngineCore_0 pid=231141) INFO 08-27 16:07:06 [core.py:74] Initializing a V1 LLM engine (v0.10.1rc2.dev249+g1fdc73241.d20250827) with config: model='meta-llama/Meta-Llama-3.1-8B-Instru [...] Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 9.80it/s, est. speed input: 63.69 toks/s, output: 156.78 toks/s] Prompt: 'Hello, my name is', Generated text: ' Helen and I am a devoted animal lover and owner of Angels of Mercy. I' Prompt: 'The president of the United States is', Generated text: ' trying to convince the American public that he is not a threat to global security.\n' Prompt: 'The capital of France is', Generated text: ' Paris, a city that has captured the hearts of millions of tourists from all over' Prompt: 'The future of AI is', Generated text: ' exciting, but also raises questions about the role of humans in this new world.' $ VLLM_USE_V1=1 python simple_test_vllm.py INFO 08-27 16:10:25 [__init__.py:241] Automatically detected platform cuda. INFO 08-27 16:10:26 [utils.py:326] non-default args: {'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'disable_log_stats': True} INFO 08-27 16:10:40 [__init__.py:742] Resolved architecture: LlamaForCausalLM INFO 08-27 16:10:40 [__init__.py:1786] Using max model len 131072 INFO 08-27 16:10:42 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192. (EngineCore_0 pid=231435) INFO 08-27 16:10:45 [core.py:644] Waiting for init message from front-end. (EngineCore_0 pid=231435) INFO 08-27 16:10:45 [core.py:74] Initializing a V1 LLM engine (v0.10.1rc2.dev249+g1fdc73241.d20250827) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct' [...] Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 9.79it/s, est. speed input: 63.64 toks/s, output: 156.66 toks/s] Prompt: 'Hello, my name is', Generated text: ' Helen and I am a devoted animal lover and owner of Angels of Mercy. I' Prompt: 'The president of the United States is', Generated text: ' trying to convince the American public that he is not a threat to global security.\n' Prompt: 'The capital of France is', Generated text: ' Paris, a city that has captured the hearts of millions of tourists from all over' Prompt: 'The future of AI is', Generated text: ' exciting, but also raises questions about the role of humans in this new world.' $ VLLM_USE_V1=0 python simple_test_vllm.py NFO 08-27 16:21:22 [__init__.py:241] Automatically detected platform cuda. INFO 08-27 16:21:23 [utils.py:326] non-default args: {'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'disable_log_stats': True} INFO 08-27 16:21:36 [__init__.py:742] Resolved architecture: LlamaForCausalLM INFO 08-27 16:21:36 [__init__.py:1786] Using max model len 131072 WARNING 08-27 16:21:36 [arg_utils.py:1578] Chunked prefill is enabled by default for models with max_model_len > 32K. Chunked prefill might not work with some features or models. If you encounter any issues, please disable by launching with --enable-chunked-prefill=False. INFO 08-27 16:21:39 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048. INFO 08-27 16:21:39 [llm_engine.py:222] Initializing a V0 LLM engine (v0.10.1rc2.dev249+g1fdc73241.d20250827) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct' [...] Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 9.48it/s, est. speed input: 61.64 toks/s, output: 151.74 toks/s] Prompt: 'Hello, my name is', Generated text: ' Rob Thomas and I am a highly experienced and skilled Acoustic Guitarist. I' Prompt: 'The president of the United States is', Generated text: ' the most powerful person in the world. The president is both the head of state' Prompt: 'The capital of France is', Generated text: ' a popular destination for tourists. It is a city with rich history, art,' Prompt: 'The future of AI is', Generated text: ' being shaped by the work of researchers like Victor Zhang, a computer science Ph.D'