Skip to content

Cherry-pick commits#1927

Closed
apsonawane wants to merge 7 commits into
rel-0.11.5from
asonawane/rel-0.11.5
Closed

Cherry-pick commits#1927
apsonawane wants to merge 7 commits into
rel-0.11.5from
asonawane/rel-0.11.5

Conversation

@apsonawane
Copy link
Copy Markdown
Contributor

Cherry-pick #1 for 0.11.5 release

apsonawane and others added 7 commits December 19, 2025 22:42
This pull request updates the logic for handling the `block_size`
attribute in QMoE (Quantized Mixture of Experts) model building and
quantization. The changes ensure that block-wise quantization is only
used when explicitly specified, defaulting to tensor-level quantization
otherwise. The most important changes are:

**Quantization logic updates:**

* In `make_qmoe_weights`, block-wise quantization is now only used if
`int4_block_size` is explicitly present in `extra_options`; otherwise,
tensor-level quantization is used by default. The `block_size` attribute
in `moe_attrs` is set accordingly.

**Operator construction improvements:**

* In `make_qmoe_op`, the `block_size` attribute is only included in the
operator's attributes if it was explicitly set in `moe_attrs`,
preventing unnecessary or default values from being passed.
* The direct passing of `block_size` as a parameter to `make_node` is
removed; it is now only included via `extra_kwargs` when appropriate.
…on (#1900)

This PR adds support for model caching, to improve 2nd+ load time.

It allows the `genai_config.json` to specify a 'cache_dir' provider
option (e.g. 'openvino_cache'). When set, it will instruct the OpenVINO
EP to create a cache directory within the model folder (by default,
unless an absolute path is specified). OpenVINO will save compiled blobs
to this directory, and then read on successive runs. This drastically
reduces model load time for the 2nd+ runs.

This PR moves all OpenVINO-specific provider option handling into a
`OpenVINO_AppendProviderOptions` API, provided by `openvino/interface`.
The changes in `model.cpp` are mainly to remove all OpenVINO-specific
handling that was grouped with the rest of the other EP's, as that's now
handled within `OpenVINO_AppendProviderOptions`.
Update extensions commit to latest
There was an issue with cpu and cuda export for gpt-oss
This pull request updates the handling of the Olive quantization format
in `quantized_model.py` to match the latest specification and improve
code clarity. The main changes include correcting how in/out features
are computed for Olive quantized layers, documenting the Olive format,
and updating repacking logic for compatibility with ONNX Runtime (ORT).

**Olive quantization format support and documentation:**

* Updated computation of `in_features` and `out_features` for Olive
quantized layers to match the new format, which packs weights along the
last dimension (`qweight` is now `(out_features, packed_in_features)`),
and adjusted all relevant projections in self-attention and MLP modules.
[[1]](diffhunk://#diff-8c2caf775960974ce923934b24e069fae5b819a0fa972976363ab8689f996c23L557-R560)
[[2]](diffhunk://#diff-8c2caf775960974ce923934b24e069fae5b819a0fa972976363ab8689f996c23L658-R684)
* Added a docstring to the `OliveModel` class explaining the Olive
quantization format for weights, scales, and zero points.

**Repacking and compatibility improvements:**

* Implemented a new `repack` method for Olive quantized modules to
reshape tensors for ONNX Runtime (ORT) compatibility, including
reshaping `qweight`, flattening `scales`, and flattening `qzeros`.
* Added placeholder methods `handle_qzeros` and `unpack` for Olive
format to clarify that no offset or unpacking is required.
Add CUDA and CPU architecture support for Qwen-2.5-VL and Fara-7B model
Validated NPU model is also working with this change
…1921)

Disable CUDA graph for Phi LongRoPE models with IF nodes on TRT-RTX

This change disables CUDA graph for the following
models when targeting TRT-RTX:
- Phi3MiniLongRoPEModel
- Phi3SmallLongRoPEModel
- Phi3MoELongRoPEModel
@kunal-vaishnavi kunal-vaishnavi deleted the asonawane/rel-0.11.5 branch January 22, 2026 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants