Pull upstream changes from Apr 16 to July 22 and fix resulting issues + upgrade torch requirement to 2.7.0 #172

skhorasganiTT · 2025-09-12T15:54:52Z

NOTE: this PR was reverted and re-merged in Pull upstream changes from Apr 16 to July 22 and fix resulting issues + upgrade torch requirement to 2.7.0 (Redo without squash) #189 (via normal merge instead of squash + merge).
Pulled upstream changes between Apr 16 and July 22 and fixed all resulting conflicts
Upgraded torch requirement to 2.7.0
Minor fixes for model loading in tt_loader.py and top-k sampling in tt_model_runner.py after changes from pulled commits
Update readme to mention required max_model_len for llama8b-n150 (max_model_len will need to be set for any models which do not support the default max context length)
This PR requires the tt-metal PR [vLLM] Compatibility fixes for model generators after pulling Apr16-July22 upstream changes - removed legacy input processors and refactored for multi-modal models tt-metal#28406, which refactors multi-modal processors for the Llama3/Gemma3/Qwen2.5-VL model classes after the legacy implementations were removed from vLLM.

vLLM nightly tests - https://github.com/tenstorrent/tt-metal/actions/runs/17680447236

FYI @ppetrovicTT, @rdraskicTT, added as optional reviewers (I realize this is hard to review, the main changes to the TT backend were those mentioned above)

) Signed-off-by: ilmarkov <[email protected]> Co-authored-by: ilmarkov <[email protected]>

Signed-off-by: Trevor Morris <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]>

Signed-off-by: Jee Jee Li <[email protected]>

…break not-cuda-alike devices (vllm-project#20822) Signed-off-by: jiang1.li <[email protected]>

…llm-project#20682) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]>

…#20541) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]>

…roject#20852)

…models (vllm-project#20637) Signed-off-by: NickLucche <[email protected]>

…ct#20854) Signed-off-by: Isotr0py <[email protected]>

…llm-project#20834) Signed-off-by: Linkun <[email protected]>

Signed-off-by: yewentao256 <[email protected]>

Signed-off-by: rzou <[email protected]>

…m-project#20790) Signed-off-by: Boyuan Feng <[email protected]>

Signed-off-by: Max de Bayser <[email protected]>

Signed-off-by: Zhiyu Cheng <[email protected]>

…oject#20694" (vllm-project#20853) Signed-off-by: mgoin <[email protected]>

…llm-project#20702) Signed-off-by: Congcong Chen <[email protected]>

…vllm-project#20843) Signed-off-by: Alex-Brooks <[email protected]>

Signed-off-by: reidliu41 <[email protected]>

Signed-off-by: mgoin <[email protected]>

…m-project#20739)

…vllm-project#20841) Signed-off-by: yewentao256 <[email protected]>

Signed-off-by: ElizaWszola <[email protected]>

Signed-off-by: yewentao256 <[email protected]>

Signed-off-by: NickLucche <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: DarkLight1337 <[email protected]>

Signed-off-by: Qiliang Cui <[email protected]>

Signed-off-by: thechaos16 <[email protected]>

…20857) Signed-off-by: Wang Siyuan <[email protected]> Signed-off-by: Wang Siyuan <[email protected]>

Signed-off-by: liuchenlong <[email protected]> Co-authored-by: liuchenlong <[email protected]>

…m-project#21033)

…hat()` with `model_impl=transformers` (vllm-project#21353) Signed-off-by: ariG23498 <[email protected]>

…1375) Signed-off-by: DarkLight1337 <[email protected]>

…e readme to mention required max_model_len for llama8b-n150 Signed-off-by: Salar <[email protected]>

…ull_upstream_july22

…te torch requirement to 2.7 Signed-off-by: Salar <[email protected]>

…ll_upstream_july22 Signed-off-by: Salar <[email protected]>

…ll_upstream_july22_2 Signed-off-by: Salar <[email protected]>

Signed-off-by: Salar <[email protected]>

vllm/_custom_ops.py

Signed-off-by: Salar <[email protected]>

ppetrovicTT

Yep, it's hard to review. I trust you :)

This reverts commit 8b7f1d3.

…uly22 upstream changes - removed legacy input processors and refactored for multi-modal models (#28406) ### Ticket [N/A](#27285) ### Problem description - Legacy input mappers/processors were removed from vLLM V0 (vllm-project/vllm#15686, vllm-project/vllm#10114). These changes are required to maintain compatibility of existing integrated models after pulling upstream changes in tenstorrent/vllm#172. ### What's changed - Removed legacy vLLM input processors from Llama3, Gemma3, Qwen2.5-VL - Defined new multi-modal input processor classes for Llama3.2-11B-Vision (`MllamaMultiModalProcessor`), Gemma3 / Qwen2.5-VL (`MultiModalProcessor`) and added support multi-modal limits for each - Moved max seq len assertion for Llama8B to model initialization, `--max_model_len` must be set on vLLM side for any models which support less than default max context length - Fixed bug where `create_multimodal_model` import was removed for Llama3.2-11B-Vision and broke the model (from 87b758d) ### Checklist - [x] [All post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml) CI passes - [x] [Blackhole Post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml) CI with demo tests passes (if applicable) - [x] [Model regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-models.yaml) CI passes (if applicable) - [x] [Device performance regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-device-models.yaml) CI passes (if applicable) - [x] (For models and ops writers) [Single-card demo tests](https://github.com/tenstorrent/tt-metal/actions/workflows/single-card-demo-tests.yaml) CI passes (if applicable) See [recommended dev flow](https://github.com/tenstorrent/tt-metal/blob/main/models/docs/MODEL_ADD.md#a-recommended-dev-flow-on-github-for-adding-new-models). - [x] [Galaxy quick](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-quick.yaml) CI passes (if applicable) - [x] [Galaxy demo tests, for Llama](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-demo-tests.yaml) CI passes, if applicable, because of current Llama work - [x] (For runtime and ops writers) [T3000 unit tests](https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-unit-tests.yaml) CI passes (if applicable, since this is run on push to main) - [x] (For models and ops writers) [T3000 demo tests](https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-demo-tests.yaml) CI passes (if applicable, since this is required for release) - [x] New/Existing tests provide coverage for changes vLLM nightly tests - https://github.com/tenstorrent/tt-metal/actions/runs/17680447236 --------- Signed-off-by: Salar <[email protected]> Co-authored-by: Igor Djuric <[email protected]>

…uly22 upstream changes - removed legacy input processors and refactored for multi-modal models (tenstorrent#28406) ### Ticket [N/A](tenstorrent#27285) ### Problem description - Legacy input mappers/processors were removed from vLLM V0 (vllm-project/vllm#15686, vllm-project/vllm#10114). These changes are required to maintain compatibility of existing integrated models after pulling upstream changes in tenstorrent/vllm#172. ### What's changed - Removed legacy vLLM input processors from Llama3, Gemma3, Qwen2.5-VL - Defined new multi-modal input processor classes for Llama3.2-11B-Vision (`MllamaMultiModalProcessor`), Gemma3 / Qwen2.5-VL (`MultiModalProcessor`) and added support multi-modal limits for each - Moved max seq len assertion for Llama8B to model initialization, `--max_model_len` must be set on vLLM side for any models which support less than default max context length - Fixed bug where `create_multimodal_model` import was removed for Llama3.2-11B-Vision and broke the model (from tenstorrent@87b758d) ### Checklist - [x] [All post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml) CI passes - [x] [Blackhole Post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml) CI with demo tests passes (if applicable) - [x] [Model regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-models.yaml) CI passes (if applicable) - [x] [Device performance regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-device-models.yaml) CI passes (if applicable) - [x] (For models and ops writers) [Single-card demo tests](https://github.com/tenstorrent/tt-metal/actions/workflows/single-card-demo-tests.yaml) CI passes (if applicable) See [recommended dev flow](https://github.com/tenstorrent/tt-metal/blob/main/models/docs/MODEL_ADD.md#a-recommended-dev-flow-on-github-for-adding-new-models). - [x] [Galaxy quick](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-quick.yaml) CI passes (if applicable) - [x] [Galaxy demo tests, for Llama](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-demo-tests.yaml) CI passes, if applicable, because of current Llama work - [x] (For runtime and ops writers) [T3000 unit tests](https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-unit-tests.yaml) CI passes (if applicable, since this is run on push to main) - [x] (For models and ops writers) [T3000 demo tests](https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-demo-tests.yaml) CI passes (if applicable, since this is required for release) - [x] New/Existing tests provide coverage for changes vLLM nightly tests - https://github.com/tenstorrent/tt-metal/actions/runs/17680447236 --------- Signed-off-by: Salar <[email protected]> Co-authored-by: Igor Djuric <[email protected]>

…uly22 upstream changes - removed legacy input processors and refactored for multi-modal models (#28406) ### Ticket [N/A](#27285) ### Problem description - Legacy input mappers/processors were removed from vLLM V0 (vllm-project/vllm#15686, vllm-project/vllm#10114). These changes are required to maintain compatibility of existing integrated models after pulling upstream changes in tenstorrent/vllm#172. ### What's changed - Removed legacy vLLM input processors from Llama3, Gemma3, Qwen2.5-VL - Defined new multi-modal input processor classes for Llama3.2-11B-Vision (`MllamaMultiModalProcessor`), Gemma3 / Qwen2.5-VL (`MultiModalProcessor`) and added support multi-modal limits for each - Moved max seq len assertion for Llama8B to model initialization, `--max_model_len` must be set on vLLM side for any models which support less than default max context length - Fixed bug where `create_multimodal_model` import was removed for Llama3.2-11B-Vision and broke the model (from 87b758d) ### Checklist - [x] [All post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml) CI passes - [x] [Blackhole Post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml) CI with demo tests passes (if applicable) - [x] [Model regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-models.yaml) CI passes (if applicable) - [x] [Device performance regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-device-models.yaml) CI passes (if applicable) - [x] (For models and ops writers) [Single-card demo tests](https://github.com/tenstorrent/tt-metal/actions/workflows/single-card-demo-tests.yaml) CI passes (if applicable) See [recommended dev flow](https://github.com/tenstorrent/tt-metal/blob/main/models/docs/MODEL_ADD.md#a-recommended-dev-flow-on-github-for-adding-new-models). - [x] [Galaxy quick](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-quick.yaml) CI passes (if applicable) - [x] [Galaxy demo tests, for Llama](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-demo-tests.yaml) CI passes, if applicable, because of current Llama work - [x] (For runtime and ops writers) [T3000 unit tests](https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-unit-tests.yaml) CI passes (if applicable, since this is run on push to main) - [x] (For models and ops writers) [T3000 demo tests](https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-demo-tests.yaml) CI passes (if applicable, since this is required for release) - [x] New/Existing tests provide coverage for changes vLLM nightly tests - https://github.com/tenstorrent/tt-metal/actions/runs/17680447236 --------- Signed-off-by: Salar <[email protected]> Co-authored-by: Igor Djuric <[email protected]>

ilmarkov and others added 30 commits July 11, 2025 18:58

Integration SM100 FlashInfer fused allreduce RMSNorm (vllm-project#20691

fc0f41d

) Signed-off-by: ilmarkov <[email protected]> Co-authored-by: ilmarkov <[email protected]>

Add pynccl all-gatherv and reducescatterv (vllm-project#20154)

a859323

Signed-off-by: Trevor Morris <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]>

[Misc] Restrict deep_gemm's log output (vllm-project#20827)

44d02f5

Signed-off-by: Jee Jee Li <[email protected]>

[Bugfix] Lazy import fused_experts in BitsAndBytesMoEMethod to avoid …

b1235c3

…break not-cuda-alike devices (vllm-project#20822) Signed-off-by: jiang1.li <[email protected]>

[Bugfix] Fix tensor parallel issue in Qwen3 reranker weight loading (v…

11c0198

…llm-project#20682) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]>

[CI/Build] Ensure compatability with Transformers v4.53 (vllm-project…

01cae37

…#20541) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]>

[Bugfix] : Fix typo - logger.warn_once -> logger.warning_once (vllm-p…

890323d

…roject#20852)

[Frontend] Abstract prompt and SpeechToTextConfig for transcriptions …

3c7d942

…models (vllm-project#20637) Signed-off-by: NickLucche <[email protected]>

[Bugfix] Replace unavailable video url in multimodal test (vllm-proje…

147afb4

…ct#20854) Signed-off-by: Isotr0py <[email protected]>

[Misc] Respect no_use_tqdm_on_load flag while capturing CUDA graph (v…

f56d299

…llm-project#20834) Signed-off-by: Linkun <[email protected]>

[Bug] Fix DeepGemm for EP low latency case (vllm-project#20833)

0d4891c

Signed-off-by: yewentao256 <[email protected]>

[Docs] Update basic.md (vllm-project#20846)

fb25e95

[Bugfix] Fix torch.compile x LoRA for PyTorch 2.8 (vllm-project#20823)

a3a5a47

Signed-off-by: rzou <[email protected]>

[cold start time] add envs.VLLM_COMPILE_DEPYF to guard decompile (vll…

c1c8ca5

…m-project#20790) Signed-off-by: Boyuan Feng <[email protected]>

Remove extra tensor on CPU (vllm-project#20693)

5de8d9f

Signed-off-by: Max de Bayser <[email protected]>

Enable ModelOpt Llama4 fp8 checkpoint deployment (vllm-project#20419)

4afe687

Signed-off-by: Zhiyu Cheng <[email protected]>

Revert "Use NVCC --compress-mode to reduce binary size by 30% vllm-pr…

b639327

…oject#20694" (vllm-project#20853) Signed-off-by: mgoin <[email protected]>

[Model] New model support for microsoft/Phi-4-mini-flash-reasoning (v…

2c11a73

…llm-project#20702) Signed-off-by: Congcong Chen <[email protected]>

[Bugfix] Fix Tensor Parallelism Padding Consistency in Granite Models (…

c2a2f19

…vllm-project#20843) Signed-off-by: Alex-Brooks <[email protected]>

[docs] convert supported configs to table (vllm-project#20858)

a86754a

Signed-off-by: reidliu41 <[email protected]>

[Bugfix] Restrict Machete to only run on Hopper (vllm-project#20830)

6e2c176

Signed-off-by: mgoin <[email protected]>

[Sched] Enhance the logic to remove stopped requests from queues (vll…

f45a332

…m-project#20739)

[Perf] Use Triton instead of Torch for DeepGEMM Per Token Group Quant (…

42d440c

…vllm-project#20841) Signed-off-by: yewentao256 <[email protected]>

[Bugfix] Fix a couple PPLX+CUTLASS MoE bugs (vllm-project#20825)

3b3b778

Signed-off-by: ElizaWszola <[email protected]>

[Refactor] Change the way of import triton (vllm-project#20774)

c1acd6d

Signed-off-by: yewentao256 <[email protected]>

[Core] Support multiple tasks per model (vllm-project#20771)

020f58a

Signed-off-by: NickLucche <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: DarkLight1337 <[email protected]>

Renable google/gemma-3-1b-it accuracy test. (vllm-project#20866)

99b4f08

Signed-off-by: Qiliang Cui <[email protected]>

Support for LlamaForSequenceClassification (vllm-project#20807)

bd4c1e6

Signed-off-by: thechaos16 <[email protected]>

[Bugfix] Fix: add patch_rope_scaling after hf override (vllm-project#…

247102f

…20857) Signed-off-by: Wang Siyuan <[email protected]> Signed-off-by: Wang Siyuan <[email protected]>

[Bugfix] fix define of RerankDocument (vllm-project#20877)

211b6a6

Signed-off-by: liuchenlong <[email protected]> Co-authored-by: liuchenlong <[email protected]>

Receiling and others added 9 commits July 22, 2025 08:24

Add tokenization_kwargs to encode for embedding model truncation (vll…

44554a0

…m-project#21033)

[Bugfix] Decode Tokenized IDs to Strings for hf_processor in `llm.c…

2226d5b

…hat()` with `model_impl=transformers` (vllm-project#21353) Signed-off-by: ariG23498 <[email protected]>

[CI/Build] Fix test failure due to updated model repo (vllm-project#2…

35366ae

…1375) Signed-off-by: DarkLight1337 <[email protected]>

Pass in max_model_len to model generator during initialization, updat…

247fea9

…e readme to mention required max_model_len for llama8b-n150 Signed-off-by: Salar <[email protected]>

Merge branch 'main' of github.com:tenstorrent/vllm into skhorasgani/p…

d3a8205

…ull_upstream_july22

Fix remaining issues with merging upstream changes upto July 22, upda…

b679ec5

…te torch requirement to 2.7 Signed-off-by: Salar <[email protected]>

Merge branch 'dev' of github.com:tenstorrent/vllm into skhorasgani/pu…

0dfe464

…ll_upstream_july22 Signed-off-by: Salar <[email protected]>

Merge branch 'dev' of github.com:tenstorrent/vllm into skhorasgani/pu…

dbd4eeb

…ll_upstream_july22_2 Signed-off-by: Salar <[email protected]>

minor fix for qwenvl in offline inference

8b7f1d3

Signed-off-by: Salar <[email protected]>

skhorasganiTT mentioned this pull request Sep 12, 2025

[vLLM] Compatibility fixes for model generators after pulling Apr16-July22 upstream changes - removed legacy input processors and refactored for multi-modal models tenstorrent/tt-metal#28406

Merged

10 tasks

skhorasganiTT requested review from ppetrovicTT and rdraskicTT September 12, 2025 16:08

fix pre-commit errors

935bf0d

Signed-off-by: Salar <[email protected]>

ppetrovicTT reviewed Sep 12, 2025

View reviewed changes

vllm/_custom_ops.py Outdated Show resolved Hide resolved

more pre-commit fixes

e85cf65

Signed-off-by: Salar <[email protected]>

ppetrovicTT approved these changes Sep 12, 2025

View reviewed changes

Revert "minor fix for qwenvl in offline inference"

e212883

This reverts commit 8b7f1d3.

skhorasganiTT merged commit 99a3e13 into dev Sep 15, 2025
2 checks passed

skhorasganiTT deleted the skhorasgani/pull_upstream_july22_2 branch September 15, 2025 11:17

skhorasganiTT restored the skhorasgani/pull_upstream_july22_2 branch September 15, 2025 13:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pull upstream changes from Apr 16 to July 22 and fix resulting issues + upgrade torch requirement to 2.7.0 #172

Pull upstream changes from Apr 16 to July 22 and fix resulting issues + upgrade torch requirement to 2.7.0 #172

Uh oh!

skhorasganiTT commented Sep 12, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

ppetrovicTT left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

115 participants

Pull upstream changes from Apr 16 to July 22 and fix resulting issues + upgrade torch requirement to 2.7.0 #172

Pull upstream changes from Apr 16 to July 22 and fix resulting issues + upgrade torch requirement to 2.7.0 #172

Uh oh!

Conversation

skhorasganiTT commented Sep 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ppetrovicTT left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

115 participants

skhorasganiTT commented Sep 12, 2025 •

edited by github-actions bot

Loading