Extra_options `disable_qkv_fusion` to untie qkv_projs from upstream choice #1893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

tianleiwu merged 2 commits into main from jd/disable_fuse_qkv

Nov 26, 2025

src/python/py/models/README.md

-Original file line number
+Diff line change
@@ Expand Up @@
         - [Exclude Language Modeling Head](#exclude-language-modeling-head)
         - [Include Last Hidden States Output](#include-last-hidden-states-output)
         - [Enable Shared Embeddings](#enable-shared-embeddings)
+        - [Disable QKV Projections Fusion](#disable-qkv-projections-fusion)
         - [Enable CUDA Graph](#enable-cuda-graph)
         - [Use 8 Bits Quantization in QMoE](#use-8-bits-quantization-in-qmoe)
         - [Use QDQ Pattern for Quantization](#use-qdq-pattern-for-quantization)
@@ Expand Down Expand Up @@
     python3 builder.py -m model_name -o path_to_output_folder -p fp16 -e cuda --extra_options shared_embeddings=true
     ```
+    #### Disable QKV Projections Fusion
+    This scenario is for when you want to keep Q/K/V projections in the attention layer separate instead of fusing them into a single packed MatMul operation.
+    ```
+    # From wheel:
+    python3 -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options disable_qkv_fusion=true
+    # From source:
+    python3 builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options disable_qkv_fusion=true
+    ```
     #### Enable CUDA Graph
     This scenario is for when you want to enable CUDA graph for your ONNX model.
@@ Expand Down @@

src/python/py/models/builder.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -61,6 +61,7 @@ def check_extra_options(kv_pairs, execution_provider): @@
             "use_cuda_bf16",
             "shared_embeddings",
             "hf_remote",
+            "disable_qkv_fusion",
         ]
         for key in bools:
             if key in kv_pairs:
@@ Expand Down @@

src/python/py/models/builders/base.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -490,11 +490,14 @@ def make_attention_init(self): @@
                 # Some EPs don't support packed Q/K/V for GQA yet
                 # Packed MatMul with LoRA/QLoRA is not currently supported
+                # use_packed_matmul can be overrided by upstream quantization choice
+                # (e.g., when q_proj, k_proj, v_proj have different quantization settings)
                 self.attention_attrs["use_packed_matmul"] = (
                     self.ep not in ["dml"]
                     and not self.matmul_attrs["use_lora"]
                     and not self.attention_attrs["q_norm"]
                     and not self.attention_attrs["k_norm"]
+                    and not self.extra_options.get("disable_qkv_fusion", False)
                 )
                 # Some EPs don't support fusing rotary embeddings inside GQA yet
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extra_options `disable_qkv_fusion` to untie qkv_projs from upstream choice #1893

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Extra_options disable_qkv_fusion to untie qkv_projs from upstream choice #1893

Uh oh!

Extra_options disable_qkv_fusion to untie qkv_projs from upstream choice #1893

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Extra_options `disable_qkv_fusion` to untie qkv_projs from upstream choice #1893

Extra_options `disable_qkv_fusion` to untie qkv_projs from upstream choice #1893