Skip to content

Fix QMoE blockwise quantization support for TRT-RTX execution provider#1926

Merged
kunal-vaishnavi merged 10 commits into
microsoft:mainfrom
anujj:gpt_oss_trt_rtx
Jan 22, 2026
Merged

Fix QMoE blockwise quantization support for TRT-RTX execution provider#1926
kunal-vaishnavi merged 10 commits into
microsoft:mainfrom
anujj:gpt_oss_trt_rtx

Conversation

@anujj
Copy link
Copy Markdown
Contributor

@anujj anujj commented Dec 19, 2025

Add QMoE and BF16 support for TRT-RTX execution provider

  • Enable blockwise quantization for TRT-RTX/NvTensorRtRtx EPs
  • Add gpt_oss_swiglu_fusion option for separate gate/up weights
  • Add int4_qdq_block_size for MatMul quantization block size
  • Add BF16 precision support for TRT-RTX
  • Keep padding in QMoE weights for proper alignment

@anujj anujj marked this pull request as draft December 19, 2025 13:35
@anujj
Copy link
Copy Markdown
Contributor Author

anujj commented Jan 6, 2026

@kunal-vaishnavi @baijumeswani for review

@anujj anujj marked this pull request as ready for review January 6, 2026 08:37
Comment thread src/python/py/models/builders/base.py Outdated
Comment thread src/python/py/models/builders/base.py
Comment thread src/python/py/models/builders/gptoss.py Outdated
Comment thread src/python/py/models/builder.py Outdated
Comment thread src/python/py/models/builders/base.py Outdated
@anskumar01
Copy link
Copy Markdown

#1861 has broken model builder for TRT RTX EP for the cases where we use int4_block_size in olive recipe. We need the fix to that.

- Remove bfloat16 scale conversion workaround (ORT 1.24 supports natively)

- Fix zero_points: skip for TRT-RTX, always include for other EPs

- Remove NvTensorRtRtx from internal EP checks (use 'trt-rtx' only)

- Simplify make_qmoe_weights() to use int4_qmoe_block_size for all supported EPs (trt-rtx defaults to 128, cpu/webgpu default to 0)
@anujj
Copy link
Copy Markdown
Contributor Author

anujj commented Jan 15, 2026

@kunal-vaishnavi : i have addressed the issues, can u please have a look

Comment thread src/python/py/models/builder.py Outdated
Comment thread src/python/py/models/builders/base.py
Comment thread src/python/py/models/builders/base.py Outdated
Comment thread src/python/py/models/builders/base.py Outdated
Comment thread src/python/py/models/builders/base.py Outdated
Comment thread src/python/py/models/builders/base.py
@thiagocrepaldi
Copy link
Copy Markdown

@thpereir FYI

…aults

- Rename int4_qmoe_block_size to qmoe_block_size (op supports int4 and int8)

- Add CUDA to supported blockwise quantization EPs

- Change default qmoe_block_size: 128 (trt-rtx), 32 (others)

- Remove bfloat16 workarounds (ORT 1.24 supports natively)

- Rename quant_attrs key from 'block_size' to 'qmoe_block_size'
@anujj
Copy link
Copy Markdown
Contributor Author

anujj commented Jan 20, 2026

Addressed the comments in the latest commit

@thpereir
Copy link
Copy Markdown
Contributor

Still going over the PR and reviewing it

CUDA does not yet support block-wise quantization for QMoE
@anujj
Copy link
Copy Markdown
Contributor Author

anujj commented Jan 21, 2026

addressed @kunal-vaishnavi all comments

Copy link
Copy Markdown
Contributor

@thpereir thpereir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code lgtm. Also ran a quick gpt-oss regression and everything is working as expected

@thiagocrepaldi
Copy link
Copy Markdown

thiagocrepaldi commented Jan 21, 2026 via email

@kunal-vaishnavi kunal-vaishnavi merged commit 23b0026 into microsoft:main Jan 22, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants