Missing updates for Llama4 on main#940
Conversation
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
There was a problem hiding this comment.
Pull request overview
This PR ports Llama4-specific fixes from PRs #881, #862, and #884 to the main branch, focusing on attention scaling and chunked attention layer handling improvements.
Changes:
- Updated
_get_attn_scale_for_hpuimplementation to remove closure dependency and match the actual attention scale calculation - Refactored chunked attention layer detection to be a standalone function and changed the signature of
apply_model_specific_patchesto acceptmodel_runnerinstead ofmodel - Consolidated model-specific patches by removing duplicate
maybe_set_chunked_attention_layersmethod from the class
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # add explicit warning | ||
| pass |
There was a problem hiding this comment.
The comment 'add explicit warning' suggests that an exception handler should log a warning, but the current implementation silently ignores exceptions. Consider adding a proper warning message using a logger to help with debugging when chunked attention setup fails.
Signed-off-by: Luca Calabria <luca.calabria@intel.com>
✅ CI PassedAll checks passed successfully against the following vllm commit: |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Added Llama4 missing fixes from #881 #862 #884 on main branch