Cherry-pick commits#1927
Closed
apsonawane wants to merge 7 commits into
Closed
Conversation
This pull request updates the logic for handling the `block_size` attribute in QMoE (Quantized Mixture of Experts) model building and quantization. The changes ensure that block-wise quantization is only used when explicitly specified, defaulting to tensor-level quantization otherwise. The most important changes are: **Quantization logic updates:** * In `make_qmoe_weights`, block-wise quantization is now only used if `int4_block_size` is explicitly present in `extra_options`; otherwise, tensor-level quantization is used by default. The `block_size` attribute in `moe_attrs` is set accordingly. **Operator construction improvements:** * In `make_qmoe_op`, the `block_size` attribute is only included in the operator's attributes if it was explicitly set in `moe_attrs`, preventing unnecessary or default values from being passed. * The direct passing of `block_size` as a parameter to `make_node` is removed; it is now only included via `extra_kwargs` when appropriate.
…on (#1900) This PR adds support for model caching, to improve 2nd+ load time. It allows the `genai_config.json` to specify a 'cache_dir' provider option (e.g. 'openvino_cache'). When set, it will instruct the OpenVINO EP to create a cache directory within the model folder (by default, unless an absolute path is specified). OpenVINO will save compiled blobs to this directory, and then read on successive runs. This drastically reduces model load time for the 2nd+ runs. This PR moves all OpenVINO-specific provider option handling into a `OpenVINO_AppendProviderOptions` API, provided by `openvino/interface`. The changes in `model.cpp` are mainly to remove all OpenVINO-specific handling that was grouped with the rest of the other EP's, as that's now handled within `OpenVINO_AppendProviderOptions`.
Update extensions commit to latest
There was an issue with cpu and cuda export for gpt-oss
This pull request updates the handling of the Olive quantization format in `quantized_model.py` to match the latest specification and improve code clarity. The main changes include correcting how in/out features are computed for Olive quantized layers, documenting the Olive format, and updating repacking logic for compatibility with ONNX Runtime (ORT). **Olive quantization format support and documentation:** * Updated computation of `in_features` and `out_features` for Olive quantized layers to match the new format, which packs weights along the last dimension (`qweight` is now `(out_features, packed_in_features)`), and adjusted all relevant projections in self-attention and MLP modules. [[1]](diffhunk://#diff-8c2caf775960974ce923934b24e069fae5b819a0fa972976363ab8689f996c23L557-R560) [[2]](diffhunk://#diff-8c2caf775960974ce923934b24e069fae5b819a0fa972976363ab8689f996c23L658-R684) * Added a docstring to the `OliveModel` class explaining the Olive quantization format for weights, scales, and zero points. **Repacking and compatibility improvements:** * Implemented a new `repack` method for Olive quantized modules to reshape tensors for ONNX Runtime (ORT) compatibility, including reshaping `qweight`, flattening `scales`, and flattening `qzeros`. * Added placeholder methods `handle_qzeros` and `unpack` for Olive format to clarify that no offset or unpacking is required.
Add CUDA and CPU architecture support for Qwen-2.5-VL and Fara-7B model Validated NPU model is also working with this change
…1921) Disable CUDA graph for Phi LongRoPE models with IF nodes on TRT-RTX This change disables CUDA graph for the following models when targeting TRT-RTX: - Phi3MiniLongRoPEModel - Phi3SmallLongRoPEModel - Phi3MoELongRoPEModel
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-pick #1 for 0.11.5 release