Cherry-pick commits by apsonawane · Pull Request #1927 · microsoft/onnxruntime-genai

apsonawane · 2025-12-19T22:46:42Z

Cherry-pick #1 for 0.11.5 release

This pull request updates the logic for handling the `block_size` attribute in QMoE (Quantized Mixture of Experts) model building and quantization. The changes ensure that block-wise quantization is only used when explicitly specified, defaulting to tensor-level quantization otherwise. The most important changes are: **Quantization logic updates:** * In `make_qmoe_weights`, block-wise quantization is now only used if `int4_block_size` is explicitly present in `extra_options`; otherwise, tensor-level quantization is used by default. The `block_size` attribute in `moe_attrs` is set accordingly. **Operator construction improvements:** * In `make_qmoe_op`, the `block_size` attribute is only included in the operator's attributes if it was explicitly set in `moe_attrs`, preventing unnecessary or default values from being passed. * The direct passing of `block_size` as a parameter to `make_node` is removed; it is now only included via `extra_kwargs` when appropriate.

…on (#1900) This PR adds support for model caching, to improve 2nd+ load time. It allows the `genai_config.json` to specify a 'cache_dir' provider option (e.g. 'openvino_cache'). When set, it will instruct the OpenVINO EP to create a cache directory within the model folder (by default, unless an absolute path is specified). OpenVINO will save compiled blobs to this directory, and then read on successive runs. This drastically reduces model load time for the 2nd+ runs. This PR moves all OpenVINO-specific provider option handling into a `OpenVINO_AppendProviderOptions` API, provided by `openvino/interface`. The changes in `model.cpp` are mainly to remove all OpenVINO-specific handling that was grouped with the rest of the other EP's, as that's now handled within `OpenVINO_AppendProviderOptions`.

Update extensions commit to latest

There was an issue with cpu and cuda export for gpt-oss

This pull request updates the handling of the Olive quantization format in `quantized_model.py` to match the latest specification and improve code clarity. The main changes include correcting how in/out features are computed for Olive quantized layers, documenting the Olive format, and updating repacking logic for compatibility with ONNX Runtime (ORT). **Olive quantization format support and documentation:** * Updated computation of `in_features` and `out_features` for Olive quantized layers to match the new format, which packs weights along the last dimension (`qweight` is now `(out_features, packed_in_features)`), and adjusted all relevant projections in self-attention and MLP modules. [[1]](diffhunk://#diff-8c2caf775960974ce923934b24e069fae5b819a0fa972976363ab8689f996c23L557-R560) [[2]](diffhunk://#diff-8c2caf775960974ce923934b24e069fae5b819a0fa972976363ab8689f996c23L658-R684) * Added a docstring to the `OliveModel` class explaining the Olive quantization format for weights, scales, and zero points. **Repacking and compatibility improvements:** * Implemented a new `repack` method for Olive quantized modules to reshape tensors for ONNX Runtime (ORT) compatibility, including reshaping `qweight`, flattening `scales`, and flattening `qzeros`. * Added placeholder methods `handle_qzeros` and `unpack` for Olive format to clarify that no offset or unpacking is required.

Add CUDA and CPU architecture support for Qwen-2.5-VL and Fara-7B model Validated NPU model is also working with this change

…1921) Disable CUDA graph for Phi LongRoPE models with IF nodes on TRT-RTX This change disables CUDA graph for the following models when targeting TRT-RTX: - Phi3MiniLongRoPEModel - Phi3SmallLongRoPEModel - Phi3MoELongRoPEModel

apsonawane and others added 7 commits December 19, 2025 22:42

Update extensions commit (#1914)

d210131

Update extensions commit to latest

Fix gpt-oss export (#1915)

3bf7e2b

There was an issue with cpu and cuda export for gpt-oss

Add support for CUDA and CPU arch for Qwen-2.5-VL and Fara-7B (#1919)

04543f0

Add CUDA and CPU architecture support for Qwen-2.5-VL and Fara-7B model Validated NPU model is also working with this change

kunal-vaishnavi closed this Jan 22, 2026

kunal-vaishnavi deleted the asonawane/rel-0.11.5 branch January 22, 2026 11:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick commits#1927

Cherry-pick commits#1927
apsonawane wants to merge 7 commits into
rel-0.11.5from
asonawane/rel-0.11.5

apsonawane commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

apsonawane commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants