forked from microsoft/onnxruntime
-
Notifications
You must be signed in to change notification settings - Fork 56
Sync ort main 19 7 25 #749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Co-authored-by: Yulong Wang <[email protected]>
### Description This PR adds three new options for the TRT execution provider: - trt_load_user_initializer - trt_external_data_bytestream - trt_external_data_bytestream_size The idea is to use these options to leverage new TRT 10.13 APIs to give the user more control on how the weights are loaded in the ONNX parser. When `trt_load_user_initializer` is set to true, the EP will own the weights instead of serializing the weights to ModelProto. This reduces overhead in having to serialize large weights. When `trt_external_data_bytestream / trt_external_data_bytestream_size` is provided, the refitEngine() function will be able to read from this bytestream directly to extract weights for the refitter. Also fixes graph_proto_serializer to keep information about external weights. --------- Signed-off-by: Kevin Chen <[email protected]>
### Description <!-- Describe your changes. --> Changes from win-onnxruntime to onnxruntime Fix the build break. --------- Co-authored-by: Yulong Wang <[email protected]>
…oft#25425) Enable free dimension override even when the graph optimization level is 0. If the optimization level is above 0, the override will be applied during level 1 optimization.
### Description Update Flash Attention to support softmax sink in GQA. Changes: - [x] Update flash attention to support head_sink - [x] Add test_gqa.py to test cuda, and remove test_gqa_cuda.py. Note that the sink is treated as scaled, while the elements in QK GEMMs is not scaled. The sink value does not need scaling or softcap, and it joins softmax with those scaled or soft-capped values. There are two ways to add sink to softmax: * One way is to [patch normalize_softmax_lse](https://github.com/microsoft/onnxruntime/blob/1cf1aa786f6e7f7e6abd6fba8b8aea2e7a43092c/onnxruntime/contrib_ops/cuda/bert/flash_attention/softmax.h#L143-L178) to use sink to update max and sum. Pros is major change in one function; Cons is the logic is a little complex since row_max is unscaled, while row_sum is scaled. * Another way is to change softmax_rescale_o to handle the sink directly in the first block of online softmax by using an unscaled sink value. It is a robust way to keep core algorithm consistent. Cons is need change in multiple places, and it is little hard to work with softcap. This PR use the the first approach for easy integration. Note: Memory efficient attention change will be in separated PR. ### Motivation and Context microsoft#25269
…crosoft#25344) ### Description <!-- Describe your changes. --> Add QNN-EP aware Reshape handler to optimizae Tranpose-Reshape-Transpose pattern. Fallback to default handler if not applicable. Test: Add new UT folder under QNN for Transpose optimization. ### Motivation and Context One of MEP models performed poorly due to the Transpose optimization after Layout Transform, causing lots of redundant Transpose inserted. Since directly modifying the algorithm may introduce huge impact, introduce an EP-aware Reshape handler instead to control the risk.
This reverts commit 3a5f75b. qnn-runtime is now available in Maven: https://repo1.maven.org/maven2/com/qualcomm/qti/qnn-runtime/2.36.1/
### Description Internal CI Upgraded to 2025.2 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: jatinwadhwa921 <[email protected]>
### Description Add QDQ scale propagation pass Enable dynamic path for NPU when enable_causallm is enabled Allow zero-element tensors to get set Fix the model copies and redefinitions for CPU fallback Added support for 2025.2 and enabled SimplifiedLayerNormalization op ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Javier Martinez <[email protected]> Co-authored-by: Ryan Metcalfe <[email protected]> Co-authored-by: Ankit Maheshkar <[email protected]> Co-authored-by: Preetha Veeramalai <[email protected]> Co-authored-by: n1harika <[email protected]>
### Description To support writing an Execution Provider (EP) using the new EP ABI introduced in microsoft#24887, this PR adds value info for EP Context nodes to prevent shape inference errors during `Graph::Resolve`. ### Motivation and Context When creating a new EP Context node whose input is the output of another EP Context node, Graph::Resolve fails to set the type for the new node's arguments. This is because EP Context nodes do not have a TypeAndShapeInferenceFunction defined, as shown here: https://github.com/microsoft/onnxruntime/blob/5fdd4e4f2a2b6705a9a49a378a3b3496805067ee/onnxruntime/core/graph/contrib_ops/contrib_defs.cc#L3289-L3337 As a result, an exception is thrown during shape inference: https://github.com/microsoft/onnxruntime/blob/5fdd4e4f2a2b6705a9a49a378a3b3496805067ee/onnxruntime/core/graph/graph.cc#L2964 Specifically: EP Context nodes lack TypeAndShapeInferenceFunction, so onnx_inferred_type is unavailable. existing_type is nullptr due to the logic in: https://github.com/microsoft/onnxruntime/blob/9de58ac7a3d18d6ae7f7ae502b3f91361067f1b5/onnxruntime/core/session/ep_plugin_provider_interfaces.cc#L279 https://github.com/microsoft/onnxruntime/blob/9de58ac7a3d18d6ae7f7ae502b3f91361067f1b5/onnxruntime/core/session/ep_plugin_provider_interfaces.cc#L285 ### Implementation This PR attempts to add type information to EP Context nodes with best effort, ensuring that Graph::Resolve can proceed without errors even when type inference is not explicitly defined.
… to process EPContext node for ep_cache_context with bytes stream (microsoft#25389) The existing API ReadOpAttr & CreateOpAttr for string type always assume there '\0' at the end. It blocks the EPs to embed/read the context binary byte buffer into EPContext node ep_cache_context attribute. Update the customer op API ReadOpAttr for string type to avoid adding '\0' at the end. Update CreateOpAttr API to construct the string with len. Keep the strings type processing as it is for now.
) Bumps [on-headers](https://github.com/jshttp/on-headers) and [compression](https://github.com/expressjs/compression). These dependencies needed to be updated together. Updates `on-headers` from 1.0.2 to 1.1.0 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/jshttp/on-headers/releases">on-headers's releases</a>.</em></p> <blockquote> <h2>1.1.0</h2> <h2>Important</h2> <ul> <li>Fix <a href="https://www.cve.org/CVERecord?id=CVE-2025-7339">CVE-2025-7339</a> (<a href="https://github.com/jshttp/on-headers/security/advisories/GHSA-76c9-3jph-rj3q">GHSA-76c9-3jph-rj3q</a>)</li> </ul> <h2>What's Changed</h2> <ul> <li>Migrate CI pipeline to GitHub actions by <a href="https://github.com/carpasse"><code>@carpasse</code></a> in <a href="https://github.com/jshttp/on-headers/pull/12">jshttp/on-headers#12</a></li> <li>fix README.md badges by <a href="https://github.com/carpasse"><code>@carpasse</code></a> in <a href="https://github.com/jshttp/on-headers/pull/13">jshttp/on-headers#13</a></li> <li>add OSSF scorecard action by <a href="https://github.com/carpasse"><code>@carpasse</code></a> in <a href="https://github.com/jshttp/on-headers/pull/14">jshttp/on-headers#14</a></li> <li>fix: use <code>ubuntu-latest</code> as ci runner by <a href="https://github.com/UlisesGascon"><code>@UlisesGascon</code></a> in <a href="https://github.com/jshttp/on-headers/pull/19">jshttp/on-headers#19</a></li> <li>ci: apply OSSF Scorecard security best practices by <a href="https://github.com/UlisesGascon"><code>@UlisesGascon</code></a> in <a href="https://github.com/jshttp/on-headers/pull/20">jshttp/on-headers#20</a></li> <li>👷 add upstream change detection by <a href="https://github.com/ctcpip"><code>@ctcpip</code></a> in <a href="https://github.com/jshttp/on-headers/pull/31">jshttp/on-headers#31</a></li> <li>✨ add script to update known hashes by <a href="https://github.com/ctcpip"><code>@ctcpip</code></a> in <a href="https://github.com/jshttp/on-headers/pull/32">jshttp/on-headers#32</a></li> <li>💚 update CI - add newer node versions by <a href="https://github.com/ctcpip"><code>@ctcpip</code></a> in <a href="https://github.com/jshttp/on-headers/pull/33">jshttp/on-headers#33</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/carpasse"><code>@carpasse</code></a> made their first contribution in <a href="https://github.com/jshttp/on-headers/pull/12">jshttp/on-headers#12</a></li> <li><a href="https://github.com/UlisesGascon"><code>@UlisesGascon</code></a> made their first contribution in <a href="https://github.com/jshttp/on-headers/pull/19">jshttp/on-headers#19</a></li> <li><a href="https://github.com/ctcpip"><code>@ctcpip</code></a> made their first contribution in <a href="https://github.com/jshttp/on-headers/pull/31">jshttp/on-headers#31</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/jshttp/on-headers/compare/v1.0.2...v1.1.0">https://github.com/jshttp/on-headers/compare/v1.0.2...v1.1.0</a></p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/jshttp/on-headers/blob/master/HISTORY.md">on-headers's changelog</a>.</em></p> <blockquote> <h1>1.1.0 / 2025-07-17</h1> <ul> <li>Fix <a href="https://www.cve.org/CVERecord?id=CVE-2025-7339">CVE-2025-7339</a> (<a href="https://github.com/jshttp/on-headers/security/advisories/GHSA-76c9-3jph-rj3q">GHSA-76c9-3jph-rj3q</a>)</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/jshttp/on-headers/commit/4b017af88f5375bbdf3ad2ee732d2c122e4f52b0"><code>4b017af</code></a> 1.1.0</li> <li><a href="https://github.com/jshttp/on-headers/commit/b636f2d08e6c1e0a784b53a13cd61e05c09bb118"><code>b636f2d</code></a> ♻️ refactor header array code</li> <li><a href="https://github.com/jshttp/on-headers/commit/3e2c2d46c3e9592f6a1c3a3a1dbe622401f95d39"><code>3e2c2d4</code></a> ✨ ignore falsy header keys, matching node behavior</li> <li><a href="https://github.com/jshttp/on-headers/commit/172eb41b99a5a290b27a2c43fe602ca33aa1c8ce"><code>172eb41</code></a> ✨ support duplicate headers</li> <li><a href="https://github.com/jshttp/on-headers/commit/c6e384908c9c6127d18831d16ab0bd96e1231867"><code>c6e3849</code></a> 🔒️ fix array handling</li> <li><a href="https://github.com/jshttp/on-headers/commit/6893518341bb4e5363285df086b3158302d3b216"><code>6893518</code></a> 💚 update CI - add newer node versions</li> <li><a href="https://github.com/jshttp/on-headers/commit/56a345d82b51a0dcb8d09f061f87b1fd1dc4c01e"><code>56a345d</code></a> ✨ add script to update known hashes</li> <li><a href="https://github.com/jshttp/on-headers/commit/175ab217155d525371a5416ff059f895a3a532a6"><code>175ab21</code></a> 👷 add upstream change detection (<a href="https://github.com/jshttp/on-headers/issues/31">#31</a>)</li> <li><a href="https://github.com/jshttp/on-headers/commit/ce0b2c8fcd313d38d3534fb731050dc16e105bf6"><code>ce0b2c8</code></a> ci: apply OSSF Scorecard security best practices (<a href="https://github.com/jshttp/on-headers/issues/20">#20</a>)</li> <li><a href="https://github.com/jshttp/on-headers/commit/1a38c543e75cd06217b449531de10b1758e35299"><code>1a38c54</code></a> fix: use <code>ubuntu-latest</code> as ci runner (<a href="https://github.com/jshttp/on-headers/issues/19">#19</a>)</li> <li>Additional commits viewable in <a href="https://github.com/jshttp/on-headers/compare/v1.0.2...v1.1.0">compare view</a></li> </ul> </details> <details> <summary>Maintainer changes</summary> <p>This version was pushed to npm by <a href="https://www.npmjs.com/~ulisesgascon">ulisesgascon</a>, a new releaser for on-headers since your current version.</p> </details> <br /> Updates `compression` from 1.8.0 to 1.8.1 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/expressjs/compression/releases">compression's releases</a>.</em></p> <blockquote> <h2>v1.8.1</h2> <h2>What's Changed</h2> <ul> <li>fix(docs): update multiple links from http to https by <a href="https://github.com/Phillip9587"><code>@Phillip9587</code></a> in <a href="https://github.com/expressjs/compression/pull/222">expressjs/compression#222</a></li> <li>ci: add dependabot for github actions by <a href="https://github.com/bjohansebas"><code>@bjohansebas</code></a> in <a href="https://github.com/expressjs/compression/pull/207">expressjs/compression#207</a></li> <li>build(deps): bump github/codeql-action from 2.23.2 to 3.28.15 by <a href="https://github.com/dependabot"><code>@dependabot</code></a>[bot] in <a href="https://github.com/expressjs/compression/pull/228">expressjs/compression#228</a></li> <li>build(deps): bump ossf/scorecard-action from 2.3.1 to 2.4.1 by <a href="https://github.com/dependabot"><code>@dependabot</code></a>[bot] in <a href="https://github.com/expressjs/compression/pull/229">expressjs/compression#229</a></li> <li>build(deps-dev): bump eslint-plugin-import from 2.26.0 to 2.31.0 by <a href="https://github.com/dependabot"><code>@dependabot</code></a>[bot] in <a href="https://github.com/expressjs/compression/pull/230">expressjs/compression#230</a></li> <li>build(deps-dev): bump supertest from 6.2.3 to 6.3.4 by <a href="https://github.com/dependabot"><code>@dependabot</code></a>[bot] in <a href="https://github.com/expressjs/compression/pull/231">expressjs/compression#231</a></li> <li>[StepSecurity] ci: Harden GitHub Actions by <a href="https://github.com/step-security-bot"><code>@step-security-bot</code></a> in <a href="https://github.com/expressjs/compression/pull/235">expressjs/compression#235</a></li> <li>build(deps): bump github/codeql-action from 3.28.15 to 3.29.2 by <a href="https://github.com/dependabot"><code>@dependabot</code></a>[bot] in <a href="https://github.com/expressjs/compression/pull/243">expressjs/compression#243</a></li> <li>build(deps): bump actions/upload-artifact from 4.3.1 to 4.6.2 by <a href="https://github.com/dependabot"><code>@dependabot</code></a>[bot] in <a href="https://github.com/expressjs/compression/pull/239">expressjs/compression#239</a></li> <li>build(deps): bump ossf/scorecard-action from 2.4.1 to 2.4.2 by <a href="https://github.com/dependabot"><code>@dependabot</code></a>[bot] in <a href="https://github.com/expressjs/compression/pull/240">expressjs/compression#240</a></li> <li>build(deps): bump actions/checkout from 4.1.1 to 4.2.2 by <a href="https://github.com/dependabot"><code>@dependabot</code></a>[bot] in <a href="https://github.com/expressjs/compression/pull/241">expressjs/compression#241</a></li> <li>build(deps-dev): bump eslint-plugin-import from 2.31.0 to 2.32.0 by <a href="https://github.com/dependabot"><code>@dependabot</code></a>[bot] in <a href="https://github.com/expressjs/compression/pull/244">expressjs/compression#244</a></li> <li>deps: [email protected] by <a href="https://github.com/UlisesGascon"><code>@UlisesGascon</code></a> in <a href="https://github.com/expressjs/compression/pull/246">expressjs/compression#246</a></li> <li>Release: 1.8.1 by <a href="https://github.com/UlisesGascon"><code>@UlisesGascon</code></a> in <a href="https://github.com/expressjs/compression/pull/247">expressjs/compression#247</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/dependabot"><code>@dependabot</code></a>[bot] made their first contribution in <a href="https://github.com/expressjs/compression/pull/228">expressjs/compression#228</a></li> <li><a href="https://github.com/step-security-bot"><code>@step-security-bot</code></a> made their first contribution in <a href="https://github.com/expressjs/compression/pull/235">expressjs/compression#235</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/expressjs/compression/compare/1.8.0...v1.8.1">https://github.com/expressjs/compression/compare/1.8.0...v1.8.1</a></p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/expressjs/compression/blob/master/HISTORY.md">compression's changelog</a>.</em></p> <blockquote> <h1>1.8.1 / 2025-07-17</h1> <ul> <li>deps: on-headers@~1.1.0 <ul> <li>Fix <a href="https://www.cve.org/CVERecord?id=CVE-2025-7339">CVE-2025-7339</a> (<a href="https://github.com/expressjs/on-headers/security/advisories/GHSA-76c9-3jph-rj3q">GHSA-76c9-3jph-rj3q</a>)</li> </ul> </li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/expressjs/compression/commit/83a0c45fe190f4fcb8b515c18065db9cb9029dd1"><code>83a0c45</code></a> 1.8.1</li> <li><a href="https://github.com/expressjs/compression/commit/ce62713129f4b33eac4b833e1722410091646395"><code>ce62713</code></a> deps: [email protected] (<a href="https://github.com/expressjs/compression/issues/246">#246</a>)</li> <li><a href="https://github.com/expressjs/compression/commit/f4acb23985fa345318d34d4a96acf555a883efeb"><code>f4acb23</code></a> build(deps-dev): bump eslint-plugin-import from 2.31.0 to 2.32.0 (<a href="https://github.com/expressjs/compression/issues/244">#244</a>)</li> <li><a href="https://github.com/expressjs/compression/commit/6eaebe63f2ecac191d402c570bde140488435c4c"><code>6eaebe6</code></a> build(deps): bump actions/checkout from 4.1.1 to 4.2.2 (<a href="https://github.com/expressjs/compression/issues/241">#241</a>)</li> <li><a href="https://github.com/expressjs/compression/commit/37e062312fd270f84b5f50f7c6f88312609633f5"><code>37e0623</code></a> build(deps): bump ossf/scorecard-action from 2.4.1 to 2.4.2 (<a href="https://github.com/expressjs/compression/issues/240">#240</a>)</li> <li><a href="https://github.com/expressjs/compression/commit/bc436b26283c2f85a9711085dd0e4a580de50ba7"><code>bc436b2</code></a> build(deps): bump actions/upload-artifact from 4.3.1 to 4.6.2 (<a href="https://github.com/expressjs/compression/issues/239">#239</a>)</li> <li><a href="https://github.com/expressjs/compression/commit/2f9f5726751ecf12f7c46a9d1493bcd1966e09a7"><code>2f9f572</code></a> build(deps): bump github/codeql-action from 3.28.15 to 3.29.2 (<a href="https://github.com/expressjs/compression/issues/243">#243</a>)</li> <li><a href="https://github.com/expressjs/compression/commit/5f13b148d2a1a2daaa8647e03592214bb240bf18"><code>5f13b14</code></a> [StepSecurity] ci: Harden GitHub Actions (<a href="https://github.com/expressjs/compression/issues/235">#235</a>)</li> <li><a href="https://github.com/expressjs/compression/commit/76e094548125afbf8089a482d5982dc96c7ce398"><code>76e0945</code></a> build(deps-dev): bump supertest from 6.2.3 to 6.3.4 (<a href="https://github.com/expressjs/compression/issues/231">#231</a>)</li> <li><a href="https://github.com/expressjs/compression/commit/ae6ee809dc0cb40febaf2a5bff298465bd5a207f"><code>ae6ee80</code></a> build(deps-dev): bump eslint-plugin-import from 2.26.0 to 2.31.0 (<a href="https://github.com/expressjs/compression/issues/230">#230</a>)</li> <li>Additional commits viewable in <a href="https://github.com/expressjs/compression/compare/1.8.0...v1.8.1">compare view</a></li> </ul> </details> <br /> Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ts/transformers-test (microsoft#25429) Bumps [transformers](https://github.com/huggingface/transformers) from 4.48.0 to 4.52.1. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/huggingface/transformers/releases">transformers's releases</a>.</em></p> <blockquote> <h2>Patch release v4.51.3</h2> <p>A mix of bugs were fixed in this patch; very exceptionally, we diverge from semantic versioning to merge GLM-4 in this patch release.</p> <ul> <li>Handle torch ver in flexattn (<a href="https://github.com/huggingface/transformers/issues/37400">#37400</a>)</li> <li>handle torch version edge cases (<a href="https://github.com/huggingface/transformers/issues/37399">#37399</a>)</li> <li>Add glm4 (<a href="https://github.com/huggingface/transformers/issues/37388">#37388</a>)</li> </ul> <h1>Patch Release 4.51.2</h1> <p>This is another round of bug fixes, but they are a lot more minor and outputs were not really affected!</p> <ul> <li>Fix Llama4 offset (<a href="https://github.com/huggingface/transformers/issues/37414">#37414</a>) by <a href="https://github.com/Cyrilvallez"><code>@Cyrilvallez</code></a></li> <li>Attention Quantization with FBGemm & TP (<a href="https://github.com/huggingface/transformers/issues/37384">#37384</a>) by <a href="https://github.com/MekkCyber"><code>@MekkCyber</code></a></li> <li>use rms_norm_eps for the L2Norm for Llama4 (<a href="https://github.com/huggingface/transformers/issues/37418">#37418</a>) by <a href="https://github.com/danielhanchen"><code>@danielhanchen</code></a></li> <li>mark llama4 as not supported with fa2 (<a href="https://github.com/huggingface/transformers/issues/37416">#37416</a>) by <a href="https://github.com/winglian"><code>@winglian</code></a></li> </ul> <h1>Patch release v4.51.1</h1> <p>Since the release of Llama 4, we have fixed a few issues that we are now releasing in patch v4.51.1</p> <ul> <li>Fixing flex attention for torch=2.6.0 (<a href="https://github.com/huggingface/transformers/issues/37285">#37285</a>)</li> <li>more fixes for post-training llama4 (<a href="https://github.com/huggingface/transformers/issues/37329">#37329</a>)</li> <li>Remove HQQ from caching allocator warmup (<a href="https://github.com/huggingface/transformers/issues/37347">#37347</a>)</li> <li>fix derived berts _init_weights (<a href="https://github.com/huggingface/transformers/issues/37341">#37341</a>)</li> <li>Fix init empty weights without accelerate (<a href="https://github.com/huggingface/transformers/issues/37337">#37337</a>)</li> <li>Fix deepspeed with quantization (<a href="https://github.com/huggingface/transformers/issues/37324">#37324</a>)</li> <li>fix llama4 training (<a href="https://github.com/huggingface/transformers/issues/37319">#37319</a>)</li> <li>fix flex attn when optional args aren't passed (<a href="https://github.com/huggingface/transformers/issues/37327">#37327</a>)</li> <li>Multiple llama4 fixe (<a href="https://github.com/huggingface/transformers/issues/37353">#37353</a>)</li> </ul> <p>Thanks all for your patience</p> <h2>v4.51.0: Llama 4, Phi4-Multimodal, DeepSeek-v3, Qwen3</h2> <h2>New Model Additions</h2> <h3>Llama 4</h3> <p><img src="https://github.com/user-attachments/assets/d613b292-94b0-4902-9dc7-2d00693222e4" alt="image" /></p> <p>Llama 4, developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture.This generation includes two models:</p> <ul> <li>The highly capable Llama 4 Maverick with 17B active parameters out of ~400B total, with 128 experts.</li> <li>The efficient Llama 4 Scout also has 17B active parameters out of ~109B total, using just 16 experts.</li> </ul> <p>Both models leverage early fusion for native multimodality, enabling them to process text and image inputs. Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages (with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi).</p> <p>For deployment, Llama 4 Scout is designed for accessibility, fitting on a single server-grade GPU via on-the-fly 4-bit or 8-bit quantization, while Maverick is available in BF16 and FP8 formats. These models are released under the custom Llama 4 Community License Agreement, available on the model repositories</p> <p>Getting started with Llama 4 using transformers is straightforward. Make sure you have transformers v4.51.0 or later installed:</p> <pre><code>pip install -U transformers[hf_xet] </tr></table> </code></pre> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/huggingface/transformers/commit/945727948c1143a10ac6f7d811aa58bb0d126b5b"><code>9457279</code></a> Release: v4.52.1</li> <li><a href="https://github.com/huggingface/transformers/commit/eaa301673a0a7a1a8c5d3f11c046d1592a7ae16b"><code>eaa3016</code></a> Revert parallelism temporarily (<a href="https://github.com/huggingface/transformers/issues/38240">#38240</a>)</li> <li><a href="https://github.com/huggingface/transformers/commit/b5f494632c0fff2527dd3140423408644a9b0076"><code>b5f4946</code></a> Protect ParallelInterface</li> <li><a href="https://github.com/huggingface/transformers/commit/113424bcd53b92600f77d82f48add0a60fb41556"><code>113424b</code></a> Release: v4.52.0</li> <li><a href="https://github.com/huggingface/transformers/commit/f834d368f6a21ed54188d9c96fbb9013b1d2c75f"><code>f834d36</code></a> [gemma3] fix bidirectional attention mask (<a href="https://github.com/huggingface/transformers/issues/38080">#38080</a>)</li> <li><a href="https://github.com/huggingface/transformers/commit/2edb0e4b4dda8172d5628ca7497a4125f28bf6fc"><code>2edb0e4</code></a> [mllama] fix loading and inference (<a href="https://github.com/huggingface/transformers/issues/38223">#38223</a>)</li> <li><a href="https://github.com/huggingface/transformers/commit/390f153469dfdc793e7a9c7eb4822ea76f4f796a"><code>390f153</code></a> Add padding-free to bamba (<a href="https://github.com/huggingface/transformers/issues/35861">#35861</a>)</li> <li><a href="https://github.com/huggingface/transformers/commit/2a79471318a9b7b16706f3bb5cd833c7e81919a6"><code>2a79471</code></a> Fixing Bitnet after use_rms_norm introduction (<a href="https://github.com/huggingface/transformers/issues/38229">#38229</a>)</li> <li><a href="https://github.com/huggingface/transformers/commit/9661896083c9d983341afa45cc4b84af01706e72"><code>9661896</code></a> Enable Quantize KV Cache for Mistral Model (<a href="https://github.com/huggingface/transformers/issues/35042">#35042</a>)</li> <li><a href="https://github.com/huggingface/transformers/commit/1c2f36b480e02c9027d2523746d34e27b39e01a4"><code>1c2f36b</code></a> parallelism goes brrr (<a href="https://github.com/huggingface/transformers/issues/37877">#37877</a>)</li> <li>Additional commits viewable in <a href="https://github.com/huggingface/transformers/compare/v4.48.0...v4.52.1">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
add webgpu support for GatherBlockQuantized
### Description <!-- Describe your changes. --> Plugin EP data transfer and Stream support. Add the ability for a plugin EP to provide an IDataTransfer implementation and an OrtSyncStream implementation to do async data copy outside of an inference session. Example usage added for CUDA EP. Caveat: Support for providing the OrtSyncStream from the data copy to Session.Run will be a follow up PR. For the CUDA EP we can pass in the native cudaStream_t from the OrtSyncStream used for the data copy to the Run via CUDA EP provider options. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…soft#25446) ### Description <!-- Describe your changes. --> Set compute capability only on Turing arch ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Setting the native compute capability was causing a regression in performance. @gaugarg-nv @ishwar-raut1 @ankan-ban
…V EP Unit Tests (microsoft#25323) ### Description Remove fast_gelu operator from the base model created in NV TRT RTX EP unit tests. ### Motivation and Context The operator was added in the model to partition the model into subgraphs which can be assigned to NV TRT RTX EP and CUDA EP, which supports fast_gelu. But CUDA EP is not built when building ORT with NV TRT RTX EP hence the unit tests fail with unsupported op error. @ishwar-raut1 @ankan-ban
### Description <!-- Describe your changes. --> The error is: ``` ..2025-07-17 11:21:36.861835596 [E:onnxruntime:, sequential_executor.cc:572 ExecuteKernel] Non-zero status code returned while running main_graph_11957213504832792607_0 node. Name:'CANNExecutionProvider_main_graph_11957213504832792607_0_0' Status Message: ~/code/onnxruntime/onnxruntime/core/framework/op_kernel.cc:83 virtual OrtValue* onnxruntime::OpKernelContext::OutputMLValue(int, const onnxruntime::TensorShape&) status.IsOK() was false. tensor.cc:57 CalculateTensorStorageSize Tensor shape.Size() must be >= 0 [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running main_graph_11957213504832792607_0 node. Name:'CANNExecutionProvider_main_graph_11957213504832792607_0_0' Status Message: ~/code/onnxruntime/onnxruntime/core/framework/op_kernel.cc:83 virtual OrtValue* onnxruntime::OpKernelContext::OutputMLValue(int, const onnxruntime::TensorShape&) status.IsOK() was false. tensor.cc:57 CalculateTensorStorageSize Tensor shape.Size() must be >= 0 ```
sfatimar
approved these changes
Jul 19, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
sync main