[Minor] Add more detailed explanation on `quantization` argument by WoosukKwon · Pull Request #2145 · vllm-project/vllm

WoosukKwon · 2023-12-17T04:39:10Z

No description provided.

zhuohan123

LGTM! Thanks for the fix!

### What this PR does / why we need it? Support MTP with： - [x] V0 Scheduler - [x] TorchAir - [x] Single DP - [x] Multi DP - [x] Disaggregate PD Known issues： - [ ] Not support V1 Scheduler (chunked prefill), will be supported in a few weeks - [ ] vllm v0.10.0 does not support metrics with `DP > 1` right now, need to comment out the line 171-175 in file `vllm/vllm/v1/metrics/loggers.py` ``` if (len(self.engine_indexes) > 1 and vllm_config.speculative_config is not None): raise NotImplementedError("Prometheus metrics with Spec Decoding " "with >1 EngineCore per AsyncLLM is not " "supported yet.") ``` To start an online server with torchair enabled, here is an example: ``` python -m vllm.entrypoints.openai.api_server \ --model="/weights/DeepSeek-R1_w8a8/" \ --trust-remote-code \ --max-model-len 40000 \ --tensor-parallel-size 4 \ --data_parallel_size 4 \ --max-num-seqs 16 \ --no-enable-prefix-caching \ --enable_expert_parallel \ --served-model-name deepseekr1 \ --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \ --quantization ascend \ --host 0.0.0.0 \ --port 1234 \ --additional-config '{"ascend_scheduler_config":{"enabled":true,"enable_chunked_prefill":false},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]},"enable_weight_nz_layout":true}' \ --gpu_memory_utilization 0.9 ``` offline example with torchair enabled ``` from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=16, temperature=0) # Create an LLM. llm = LLM( model="/home/data/DeepSeek-R1_w8a8/", tensor_parallel_size=16, max_num_seqs=16, gpu_memory_utilization=0.9, distributed_executor_backend="mp", enable_expert_parallel=True, speculative_config={ "method": "deepseek_mtp", "num_speculative_tokens": 1, }, trust_remote_code=True, enforce_eager=False, max_model_len=2000, additional_config = { 'torchair_graph_config': { 'enabled': True, "graph_batch_sizes": [16], 'enable_multistream_shared_expert': False, }, "ascend_scheduler_config": { "enabled": True }, # 'expert_tensor_parallel_size': 16, } ) # Generate texts from the prompts. # llm.start_profile() outputs = llm.generate(prompts, sampling_params) # llm.stop_profile() for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.10.0 - vLLM main: vllm-project@302962e --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com>

More detailed explanation on quantization

5d36071

WoosukKwon requested a review from zhuohan123 December 17, 2023 04:39

Merge branch 'main' into fix-quant-doc

9a5c20a

WoosukKwon changed the title ~~Add more detailed explanation on quantization~~ [Minor] Add more detailed explanation on quantization Dec 17, 2023

WoosukKwon changed the title ~~[Minor] Add more detailed explanation on quantization~~ [Minor] Add more detailed explanation on quantization argument Dec 17, 2023

WoosukKwon changed the title ~~[Minor] Add more detailed explanation on quantization argument~~ [Minor] Add more detailed explanation on quantization argument Dec 17, 2023

zhuohan123 approved these changes Dec 17, 2023

View reviewed changes

WoosukKwon merged commit 30fb095 into main Dec 17, 2023

WoosukKwon deleted the fix-quant-doc branch December 17, 2023 09:56

xjpang pushed a commit to xjpang/vllm that referenced this pull request Dec 18, 2023

[Minor] Add more detailed explanation on quantization argument (vll…

57a42e7

…m-project#2145)

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

[Minor] Add more detailed explanation on quantization argument (vll…

de74579

…m-project#2145)

c0de128 mentioned this pull request Dec 30, 2025

[Bugfix][Hardware][AMD] Fix hardcoded device in MLA sparse attention #31176

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Minor] Add more detailed explanation on `quantization` argument#2145

[Minor] Add more detailed explanation on `quantization` argument#2145
WoosukKwon merged 2 commits intomainfrom
fix-quant-doc

WoosukKwon commented Dec 17, 2023

Uh oh!

zhuohan123 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

WoosukKwon commented Dec 17, 2023

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants