Qwen2.5 omni #1269

wenbinc-Bin · 2025-05-19T06:12:55Z

1.Add Qwen2.5-Omni thinker: Adapted from vllm-project#15130
2.Optimize Qwen multi-modal processing: Adapted from #1109
3.Porting optimization to Omni
4.optimize W and H restriction.

example:

# Process audio inputs
python examples/offline_inference/audio_language.py --model-type qwen2_5_omni

# Process image inputs
python examples/offline_inference/vision_language.py --modality image --model-type qwen2_5_omni

# Process video inputs (WIP)
python examples/offline_inference/vision_language.py --modality video --model-type qwen2_5_omni

Switched execution of versioned branches to _next and added logs redirection to file.

Fixed test logs redirection

Adjusted method of extracting synapse build id for release branches

…naAI#1040) This PR implements HPU support for pipeline parallelism. Tested accuracy and it's the same as TP accuracy on: - Llama3.1-70b-Instruct - Llama3.2-3b-Instruct - Mixtral-8x7b To serve with PP: `VLLM_DECODE_BS_BUCKET_MIN=384 VLLM_DECODE_BLOCK_BUCKET_MAX=896 vllm serve /mnt/weka/data/pytorch/llama3.1/Meta-Llama-3.1-70B-Instruct/ --tensor-parallel-size 1 --pipeline-parallel-size 4 --max-num-seqs 384 --disable-log-requests --dtype bfloat16 --gpu-memory-util 0.9 --disable-log-stats --num_scheduler_steps 1 --max-num-batched-tokens 2048 --max-model-len 256 --block-size 128` Known issues: * since for Pipeline Parallelism max_num_seqs acts as a microbatch for a single virtual_engine - for bigger batch_size we fall into a very specific corner case and get flat_pa error -> set batch_size to approximately batch size that you would use in TP but divided by pp_size * delayed sampling is not yet compatible with pipeline parallelism * virtaul_engine ID is passed to HPUGraph which results in pp_size * amount of graphs Signed-off-by: jmaksymczuk <[email protected]> Co-authored-by: Rafal Litka <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]>

…aAI#1028) Cherry-pick of HabanaAI#1023 Co-authored-by: Michał Kuligowski <[email protected]>

…#1038) Cherry-pick of HabanaAI#921 Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]>

…HabanaAI#1059) Same PR as [1020](HabanaAI#1020) but for 1.21

…naAI#1067) Co-authored-by: Iryna Boiko <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]>

migrated from a PR to habana_main: HabanaAI#1014 For Best performance, this PR is recommended to run with INC: [[SW-223553] [VLLM] Merge deepseek changes into habana_main - Habana Labs](https://jira.habana-labs.com/browse/SW-223553) **test acc of G3**: ```bash huggingface-cli download Yi30/inc-woq-default-pile-one-cache-408 --local-dir ./scripts/nc_workspace_measure_kvache cat inc_quant_with_fp8kv_config.json { "mode": "QUANTIZE", "observer": "maxabs", "scale_method": "maxabs_hw", "scale_format": "const", "allowlist": { "types": [], "names": [] }, "blocklist": { "types": [], "names": [ "lm_head", "mlp\\.gate\\b", "block2batch_matmul" ] }, "dump_stats_path": "./inc-woq-default-pile-one-cache-408-for-fp8-mla/inc_measure_output" } QUANT_CONFIG=inc_quant_with_fp8kv_config.json \ PT_HPU_LAZY_MODE=1 \ VLLM_SKIP_WARMUP=true \ PT_HPU_ENABLE_LAZY_COLLECTIVES=true \ PT_HPU_WEIGHT_SHARING=0 \ VLLM_MLA_DISABLE_REQUANTIZATION=1 \ lm_eval --model vllm \ --model_args "pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc" \ --tasks gsm8k --num_fewshot "5" --limit "256" \ --batch_size "8" ``` **test acc of G2**: **convert original DeepSeek-R1** using [convert_for_g2.py](https://github.com/yangulei/vllm-fork/blob/deepseek_r1_g2/scripts/convert_for_g2.py) (this step will be removed as INC updates.) ```bash huggingface-cli download Yi30/inc-woq-default-pile-one-cache-412-g2 --local-dir ./scripts/nc_workspace_measure_kvache cat inc_quant_with_fp8kv_config.json { "mode": "QUANTIZE", "observer": "maxabs", "scale_method": "maxabs_hw", "scale_format": "const", "allowlist": { "types": [], "names": [] }, "blocklist": { "types": [], "names": [ "lm_head", "mlp\\.gate\\b", "block2batch_matmul" ] }, "dump_stats_path": "./nc_workspace_measure_kvache/inc_measure_output" } ``` vllm (pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc), gen_kwargs: (None), limit: 256.0, num_fewshot: 5, batch_size: 128 |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9492|± |0.0137| | | |strict-match | 5|exact_match|↑ |0.9453|± |0.0142| ---------- Need to use vllm-hpu-extension: https://github.com/HabanaAI/vllm-hpu-extension/tree/dev/chendi/deepseek_r1 Status: runnable with Deepseek-R1. Accuracy check: for block fp8 weight => garbage output accuracy check for BF16 weight => looks good. test scripts: ``` from vllm import LLM, SamplingParams import os os.environ['VLLM_SKIP_WARMUP'] = 'true' os.environ['PT_HPU_LAZY_MODE'] = '1' os.environ['PT_HPU_ENABLE_LAZY_COLLECTIVES']='true' os.environ['PT_HPU_WEIGHT_SHARING']='0' #os.environ['HABANA_LOGS']="vllm_inc_debug" #os.environ["LOG_LEVEL_ALL"]="3" os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1' #os.environ["QUANT_CONFIG"] = "inc_quant_with_fp8kv_config.json" #os.environ["LOGLEVEL"] = "DEBUG" prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] if __name__ == "__main__": # Create a sampling params object. sampling_params = SamplingParams(temperature=0.0, max_tokens=16, ignore_eos=True) # Create an LLM. model_path = "/data/models/DeepSeek-R1" llm = LLM(model=model_path, trust_remote_code=True, enforce_eager=True, dtype="bfloat16", use_v2_block_manager=True, max_model_len=1024, max_num_seqs=1, tensor_parallel_size=8, distributed_executor_backend='mp', gpu_memory_utilization=0.8, #kv_cache_dtype="fp8_inc", seed=2024) # Generate texts from the prompts. The output is a list of RequestOutput objects # that contain the prompt, generated text, and other information. outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") if os.environ.get("QUANT_CONFIG", None) is not None: llm.llm_engine.model_executor.shutdown() ``` --------- Signed-off-by: Chendi.Xue <[email protected]> Signed-off-by: kwisniewski98 <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Co-authored-by: kwisniewski98 <[email protected]>

Same PR as HabanaAI#996. Just for v1.21.0_next branch.

…banaAI#1048) The make_attn_bias in hpu_model_runner doesn't cover the non-causal embedding model mask set and also vertical mask off is not set when merged prefill is enabled.

@xuechendi

… multiple cards (HabanaAI#1100) - Add `VLLM_DISABLE_MARK_SCALES_AS_CONST=true` for speed up the warmup stage. - Fix the `dist.barrier` issue for single card cc @xuechendi @thuang6 --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>

…erence (HabanaAI#1103) Original PR HabanaAI#897 Co-authored-by: Jiafan Wang <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]>

…AI#1093) Added workflow that allows codeowners and testowners to skip "Summarize Test Results" check in PRs Cherry-pick of 3483068b69fea1070995ffc14f6a3fbd5721f4f2

…odal tests (HabanaAI#1042) (HabanaAI#1104) Original PR HabanaAI#1042

…HabanaAI#1076) (HabanaAI#1105) Original PR HabanaAI#1076 Signed-off-by: Artur Fierka <[email protected]>

Previously it was only checking if it is using quant_config and choosing VllmMixtureOfExpertsOpFP8 as OP, which only difference is that when measuring scales it is assuming block quant. This will only happen when we are using Fp8MoEMethod as quant_method. Kwargs in moe_op call had to be disabled, beacuse of different apis of FP8 and unquantized --------- Signed-off-by: kwisniewski98 <[email protected]>

It's the full list of changes in documentation prepared for the vLLM 1.21 release. --------- Signed-off-by: Artur Fierka <[email protected]> Co-authored-by: Bartosz Kuncer <[email protected]> Co-authored-by: Bartosz Kuncer <[email protected]> Co-authored-by: Mohit Deopujari <[email protected]> Co-authored-by: Artur Fierka <[email protected]> Co-authored-by: AnetaKaczynska <[email protected]>

Fix logging in multidevice scenario (currently all workers log into dir '0', with this change each worker logs to '{n}' directory)

Bump vllm-hpu-extension hash

Cherry-pick of HabanaAI#1086 Signed-off-by: Michal Adamczyk <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]>

Creates the 1.21.0 version of the UBI Dockerfile for use with Red Hat OpenShift AI.

Fix the llama 3.2 11b/90b accuracy issue that caused by is_causal setting to False.

Fix: https://jira.habana-labs.com/browse/SW-226779 Signed-off-by: Chendi Xue <[email protected]>

Reviewed Gaudi README. --------- Co-authored-by: PatrykWo <[email protected]> Co-authored-by: PatW <[email protected]>

Co-authored-by: Bartosz Kuncer <[email protected]>

It's the last change to readme before 1.21 release.

Final touch in the table of supported models.

czhu15 · 2025-05-19T06:18:45Z

Can you provide an example code on how to run omni model? either on the commit message for one standalone script file under examples folder.

Official PR: vllm-project#15130 example: python examples/offline_inference/audio_language.py --model-type qwen2_5_omni python examples/offline_inference/vision_language.py --modality image --model-type qwen2_5_omni python examples/offline_inference/vision_language.py --modality video --model-type qwen2_5_omni Signed-off-by: Chen, Wenbin <[email protected]>

PR to Habana_main: HabanaAI#1109

Pad W and H so that W/H don't need to be aligned to 112 Signed-off-by: Chen, Wenbin <[email protected]>

wenbinc-Bin · 2025-05-19T06:51:45Z

Can you provide an example code on how to run omni model? either on the commit message for one standalone script file under examples folder.

I updated the PR comment and commit message.

Signed-off-by: Chen, Wenbin <[email protected]>

bmyrcha and others added 30 commits April 8, 2025 09:51

[SW-224648] Redirect test logs to file (HabanaAI#1017)

5dbefd6

Switched execution of versioned branches to _next and added logs redirection to file.

[SW-224648] Fix test logs redirection (HabanaAI#1027)

ff61f89

Fixed test logs redirection

[SW-225233] Adjust method of getting synapse_build (HabanaAI#1045)

b92af9c

Adjusted method of extracting synapse build id for release branches

[1.21 cherry-pick] Fix async callback ordering (HabanaAI#1023) (Haban…

ed47e1e

…aAI#1028) Cherry-pick of HabanaAI#1023 Co-authored-by: Michał Kuligowski <[email protected]>

[1.21 cherry-pick] Make lazy mode autodetection more robust (HabanaAI…

9a06a89

…#1038) Cherry-pick of HabanaAI#921 Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]>

APC - Remove prompt attn with context and use existing implementation (…

035db32

…HabanaAI#1059) Same PR as [1020](HabanaAI#1020) but for 1.21

Cherry pick exponential bucketing integration from HabanaAI#642 (Haba…

b576015

…naAI#1067) Co-authored-by: Iryna Boiko <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]>

Modify RobertaEmbedding forward as custom op method (HabanaAI#1049)

4445dca

Same PR as HabanaAI#996. Just for v1.21.0_next branch.

Fix embedding model accuracy issue when merged prefill is enabled (Ha…

b3c3a2f

…banaAI#1048) The make_attn_bias in hpu_model_runner doesn't cover the non-causal embedding model mask set and also vertical mask off is not set when merged prefill is enabled.

[1.21.0 cherry-pick] Synchronize vLLM flags to support cross-node inf…

5d30a8f

…erence (HabanaAI#1103) Original PR HabanaAI#897 Co-authored-by: Jiafan Wang <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]>

[SW-225980] Allow to skip pytest for non-code related changes (Habana…

c46e620

…AI#1093) Added workflow that allows codeowners and testowners to skip "Summarize Test Results" check in PRs Cherry-pick of 3483068b69fea1070995ffc14f6a3fbd5721f4f2

[1.21.0 cherry-pick] Set VLLM_T_COMPILE_FULLGRAPH=False in CI multi-m…

b2955df

…odal tests (HabanaAI#1042) (HabanaAI#1104) Original PR HabanaAI#1042

[1.21.0 cherry-pick] Enable APC pre-merge tests to compile test suite (…

377d0f9

…HabanaAI#1076) (HabanaAI#1105) Original PR HabanaAI#1076 Signed-off-by: Artur Fierka <[email protected]>

Update hpu_worker.py (HabanaAI#943)

1ee6b61

Fix logging in multidevice scenario (currently all workers log into dir '0', with this change each worker logs to '{n}' directory)

Update requirements-hpu.txt (HabanaAI#1123)

beaeec5

Bump vllm-hpu-extension hash

[1.21 cherry-pick] Restore fsdpa calibration (HabanaAI#1087)

d285a39

Cherry-pick of HabanaAI#1086 Signed-off-by: Michal Adamczyk <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]>

Update CODEOWNERS (HabanaAI#1139)

91a143a

Michalkuligowski patch update workflows (HabanaAI#1019)

da859c0

Add in Dockerfile.hpu.ubi (HabanaAI#1118)

bd508fa

Creates the 1.21.0 version of the UBI Dockerfile for use with Red Hat OpenShift AI.

Fix the llama3.2-11b/90b accuracy drop issue. (HabanaAI#1175)

765b0c8

Fix the llama 3.2 11b/90b accuracy issue that caused by is_causal setting to False.

[SW-226779]Fix attribute not found issue (HabanaAI#1160)

d0754d6

Fix: https://jira.habana-labs.com/browse/SW-226779 Signed-off-by: Chendi Xue <[email protected]>

Update README_GAUDI.md 1.21.0 (HabanaAI#1196)

7461f4a

Reviewed Gaudi README. --------- Co-authored-by: PatrykWo <[email protected]> Co-authored-by: PatW <[email protected]>

Update links and tags for 1.21.0 release (HabanaAI#1204)

e7b5689

 Co-authored-by: Bartosz Kuncer <[email protected]>

Removed OS specification from requirements list (HabanaAI#1221)

b208380

It's the last change to readme before 1.21 release.

Final update of models 1.21. (HabanaAI#1231)

0275ce4

Final touch in the table of supported models.

wenbinc-Bin requested review from afierka-intel, kzawora-intel, madamczyk-intel, mgawarkiewicz-intel, michalkuligowski and vivekgoe as code owners May 19, 2025 06:12

wenbinc-Bin and others added 4 commits May 19, 2025 06:47

Optimize qwen2.5vl phase2

719a4ef

PR to Habana_main: HabanaAI#1109

Porting to Qwen2.5-Omni

703fd42

Qwen2.5VL/Omni: Pad W and H

71b0079

Pad W and H so that W/H don't need to be aligned to 112 Signed-off-by: Chen, Wenbin <[email protected]>

wenbinc-Bin force-pushed the qwen-omni-1.21.0 branch from 6061766 to 71b0079 Compare May 19, 2025 06:48

wenbinc-Bin added 2 commits May 19, 2025 09:19

Fix iteration bug introduced by transformers

bbfdce4

Signed-off-by: Chen, Wenbin <[email protected]>

Fix bug that multi-modal model fails on eager mode

6884a94

Signed-off-by: Chen, Wenbin <[email protected]>

czhu15 force-pushed the aice/v1.21.0 branch from 0275ce4 to 90f1ba8 Compare May 21, 2025 03:43

czhu15 requested review from jikunshang, mswiniarsk and xuechendi as code owners May 21, 2025 03:43

czhu15 force-pushed the aice/v1.21.0 branch from 90f1ba8 to 2c6ebf7 Compare May 22, 2025 02:19

wenbinc-Bin closed this May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen2.5 omni #1269

Qwen2.5 omni #1269

Uh oh!

wenbinc-Bin commented May 19, 2025 •

edited by github-actions bot

Loading

Uh oh!

czhu15 commented May 19, 2025

Uh oh!

wenbinc-Bin commented May 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Qwen2.5 omni #1269

Qwen2.5 omni #1269

Uh oh!

Conversation

wenbinc-Bin commented May 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

czhu15 commented May 19, 2025

Uh oh!

wenbinc-Bin commented May 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

wenbinc-Bin commented May 19, 2025 •

edited by github-actions bot

Loading