forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 134
Qwen2.5 omni #1269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Qwen2.5 omni #1269
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Switched execution of versioned branches to _next and added logs redirection to file.
Fixed test logs redirection
Adjusted method of extracting synapse build id for release branches
…naAI#1040) This PR implements HPU support for pipeline parallelism. Tested accuracy and it's the same as TP accuracy on: - Llama3.1-70b-Instruct - Llama3.2-3b-Instruct - Mixtral-8x7b To serve with PP: `VLLM_DECODE_BS_BUCKET_MIN=384 VLLM_DECODE_BLOCK_BUCKET_MAX=896 vllm serve /mnt/weka/data/pytorch/llama3.1/Meta-Llama-3.1-70B-Instruct/ --tensor-parallel-size 1 --pipeline-parallel-size 4 --max-num-seqs 384 --disable-log-requests --dtype bfloat16 --gpu-memory-util 0.9 --disable-log-stats --num_scheduler_steps 1 --max-num-batched-tokens 2048 --max-model-len 256 --block-size 128` Known issues: * since for Pipeline Parallelism max_num_seqs acts as a microbatch for a single virtual_engine - for bigger batch_size we fall into a very specific corner case and get flat_pa error -> set batch_size to approximately batch size that you would use in TP but divided by pp_size * delayed sampling is not yet compatible with pipeline parallelism * virtaul_engine ID is passed to HPUGraph which results in pp_size * amount of graphs Signed-off-by: jmaksymczuk <[email protected]> Co-authored-by: Rafal Litka <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]>
…aAI#1028) Cherry-pick of HabanaAI#1023 Co-authored-by: Michał Kuligowski <[email protected]>
…#1038) Cherry-pick of HabanaAI#921 Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]>
…HabanaAI#1059) Same PR as [1020](HabanaAI#1020) but for 1.21
…naAI#1067) Co-authored-by: Iryna Boiko <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]>
migrated from a PR to habana_main: HabanaAI#1014 For Best performance, this PR is recommended to run with INC: [[SW-223553] [VLLM] Merge deepseek changes into habana_main - Habana Labs](https://jira.habana-labs.com/browse/SW-223553) **test acc of G3**: ```bash huggingface-cli download Yi30/inc-woq-default-pile-one-cache-408 --local-dir ./scripts/nc_workspace_measure_kvache cat inc_quant_with_fp8kv_config.json { "mode": "QUANTIZE", "observer": "maxabs", "scale_method": "maxabs_hw", "scale_format": "const", "allowlist": { "types": [], "names": [] }, "blocklist": { "types": [], "names": [ "lm_head", "mlp\\.gate\\b", "block2batch_matmul" ] }, "dump_stats_path": "./inc-woq-default-pile-one-cache-408-for-fp8-mla/inc_measure_output" } QUANT_CONFIG=inc_quant_with_fp8kv_config.json \ PT_HPU_LAZY_MODE=1 \ VLLM_SKIP_WARMUP=true \ PT_HPU_ENABLE_LAZY_COLLECTIVES=true \ PT_HPU_WEIGHT_SHARING=0 \ VLLM_MLA_DISABLE_REQUANTIZATION=1 \ lm_eval --model vllm \ --model_args "pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc" \ --tasks gsm8k --num_fewshot "5" --limit "256" \ --batch_size "8" ``` **test acc of G2**: **convert original DeepSeek-R1** using [convert_for_g2.py](https://github.com/yangulei/vllm-fork/blob/deepseek_r1_g2/scripts/convert_for_g2.py) (this step will be removed as INC updates.) ```bash huggingface-cli download Yi30/inc-woq-default-pile-one-cache-412-g2 --local-dir ./scripts/nc_workspace_measure_kvache cat inc_quant_with_fp8kv_config.json { "mode": "QUANTIZE", "observer": "maxabs", "scale_method": "maxabs_hw", "scale_format": "const", "allowlist": { "types": [], "names": [] }, "blocklist": { "types": [], "names": [ "lm_head", "mlp\\.gate\\b", "block2batch_matmul" ] }, "dump_stats_path": "./nc_workspace_measure_kvache/inc_measure_output" } ``` vllm (pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc), gen_kwargs: (None), limit: 256.0, num_fewshot: 5, batch_size: 128 |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9492|± |0.0137| | | |strict-match | 5|exact_match|↑ |0.9453|± |0.0142| ---------- Need to use vllm-hpu-extension: https://github.com/HabanaAI/vllm-hpu-extension/tree/dev/chendi/deepseek_r1 Status: runnable with Deepseek-R1. Accuracy check: for block fp8 weight => garbage output accuracy check for BF16 weight => looks good. test scripts: ``` from vllm import LLM, SamplingParams import os os.environ['VLLM_SKIP_WARMUP'] = 'true' os.environ['PT_HPU_LAZY_MODE'] = '1' os.environ['PT_HPU_ENABLE_LAZY_COLLECTIVES']='true' os.environ['PT_HPU_WEIGHT_SHARING']='0' #os.environ['HABANA_LOGS']="vllm_inc_debug" #os.environ["LOG_LEVEL_ALL"]="3" os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1' #os.environ["QUANT_CONFIG"] = "inc_quant_with_fp8kv_config.json" #os.environ["LOGLEVEL"] = "DEBUG" prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] if __name__ == "__main__": # Create a sampling params object. sampling_params = SamplingParams(temperature=0.0, max_tokens=16, ignore_eos=True) # Create an LLM. model_path = "/data/models/DeepSeek-R1" llm = LLM(model=model_path, trust_remote_code=True, enforce_eager=True, dtype="bfloat16", use_v2_block_manager=True, max_model_len=1024, max_num_seqs=1, tensor_parallel_size=8, distributed_executor_backend='mp', gpu_memory_utilization=0.8, #kv_cache_dtype="fp8_inc", seed=2024) # Generate texts from the prompts. The output is a list of RequestOutput objects # that contain the prompt, generated text, and other information. outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") if os.environ.get("QUANT_CONFIG", None) is not None: llm.llm_engine.model_executor.shutdown() ``` --------- Signed-off-by: Chendi.Xue <[email protected]> Signed-off-by: kwisniewski98 <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Co-authored-by: kwisniewski98 <[email protected]>
Same PR as HabanaAI#996. Just for v1.21.0_next branch.
…banaAI#1048) The make_attn_bias in hpu_model_runner doesn't cover the non-causal embedding model mask set and also vertical mask off is not set when merged prefill is enabled.
… multiple cards (HabanaAI#1100) - Add `VLLM_DISABLE_MARK_SCALES_AS_CONST=true` for speed up the warmup stage. - Fix the `dist.barrier` issue for single card cc @xuechendi @thuang6 --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>
…erence (HabanaAI#1103) Original PR HabanaAI#897 Co-authored-by: Jiafan Wang <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]>
…AI#1093) Added workflow that allows codeowners and testowners to skip "Summarize Test Results" check in PRs Cherry-pick of 3483068b69fea1070995ffc14f6a3fbd5721f4f2
…odal tests (HabanaAI#1042) (HabanaAI#1104) Original PR HabanaAI#1042
…HabanaAI#1076) (HabanaAI#1105) Original PR HabanaAI#1076 Signed-off-by: Artur Fierka <[email protected]>
Previously it was only checking if it is using quant_config and choosing VllmMixtureOfExpertsOpFP8 as OP, which only difference is that when measuring scales it is assuming block quant. This will only happen when we are using Fp8MoEMethod as quant_method. Kwargs in moe_op call had to be disabled, beacuse of different apis of FP8 and unquantized --------- Signed-off-by: kwisniewski98 <[email protected]>
It's the full list of changes in documentation prepared for the vLLM 1.21 release. --------- Signed-off-by: Artur Fierka <[email protected]> Co-authored-by: Bartosz Kuncer <[email protected]> Co-authored-by: Bartosz Kuncer <[email protected]> Co-authored-by: Mohit Deopujari <[email protected]> Co-authored-by: Artur Fierka <[email protected]> Co-authored-by: AnetaKaczynska <[email protected]>
Fix logging in multidevice scenario (currently all workers log into dir
'0', with this change each worker logs to '{n}' directory)
Bump vllm-hpu-extension hash
Cherry-pick of HabanaAI#1086 Signed-off-by: Michal Adamczyk <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]>
Creates the 1.21.0 version of the UBI Dockerfile for use with Red Hat OpenShift AI.
Fix the llama 3.2 11b/90b accuracy issue that caused by is_causal setting to False.
Fix: https://jira.habana-labs.com/browse/SW-226779 Signed-off-by: Chendi Xue <[email protected]>
Reviewed Gaudi README. --------- Co-authored-by: PatrykWo <[email protected]> Co-authored-by: PatW <[email protected]>
<!--- pyml disable-next-line no-emphasis-as-heading --> Co-authored-by: Bartosz Kuncer <[email protected]>
It's the last change to readme before 1.21 release.
Final touch in the table of supported models.
|
Can you provide an example code on how to run omni model? either on the commit message for one standalone script file under examples folder. |
Official PR: vllm-project#15130 example: python examples/offline_inference/audio_language.py --model-type qwen2_5_omni python examples/offline_inference/vision_language.py --modality image --model-type qwen2_5_omni python examples/offline_inference/vision_language.py --modality video --model-type qwen2_5_omni Signed-off-by: Chen, Wenbin <[email protected]>
PR to Habana_main: HabanaAI#1109
Pad W and H so that W/H don't need to be aligned to 112 Signed-off-by: Chen, Wenbin <[email protected]>
6061766 to
71b0079
Compare
Author
I updated the PR comment and commit message. |
Signed-off-by: Chen, Wenbin <[email protected]>
Signed-off-by: Chen, Wenbin <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
1.Add Qwen2.5-Omni thinker: Adapted from vllm-project#15130
2.Optimize Qwen multi-modal processing: Adapted from #1109
3.Porting optimization to Omni
4.optimize W and H restriction.