Cherry-pick for 1.17.3 #20013

YUNQIUGUO · 2024-03-21T18:43:14Z

Description

Web prs are not included yet.

Motivation and Context

### Description  Windows memory map casts mapped_offset to DWORD directly. It will be truncated if it is larger than 2^32-1. We need to set high dwFileOffsetHigh for this case. ### Motivation and Context  The bug was found from #19450

Answers issue #19640 More details are in the issue, basically I am changing all the include directory and link directory usage to CMake's `CUDA::*` targets

### Description Make Linux logic consistent as Windows ### Motivation and Context onnxruntime_lite_custom_op.h in Windows zip package but not in Linux zip package https://github.com/microsoft/onnxruntime/blob/acbfc29f272b5578145e7600bc42342e116ffbc2/tools/ci_build/github/azure-pipelines/templates/c-api-artifacts-package-and-publish-steps-windows.yml#L67 Co-authored-by: Your Name <[email protected]>

Fix some linker errors that come up when integrating the onnxruntime-training-c pod into another Xcode project. The problematic configuration is a minimal build with training APIs enabled. - training_op_defs.o had some unresolved references to ONNX functions. It should not be included at all in a minimal build. - tree_ensemble_helper.o also had unresolved references to ONNX ParseData. The containing function is unused in a minimal build. Added a test to cover this configuration.

#19845) fix: "UserWarning: Unsupported Windows version (11). ONNX Runtime supports Windows 10 and above, only." ### Description Include Windows 11 in the version check. Now, you will not see the warning “Unsupported Windows version (11). ONNX Runtime supports Windows 10 and above, only.” ### Motivation and Context Warning on Windows 11: Only supports systems above Windows 10, which is somewhat strange.

fix build break caused by image update. tensorrt isn't expected to pass all onnx node tests.

Fixed all CUDA NHWC Pooling operations which were broken and enabled the NHWC CUDA pooling tests. Disabled all pooling tests which are not supported by the CUDA EP. Ensure parity between CUDA NHWC / NCHW and work towards 100% tests enabled for the CUDA EP / CUDA NHWC EP. --------- Co-authored-by: Tianlei Wu <[email protected]>

YUNQIUGUO · 2024-03-21T18:43:46Z

Had to address a couple merge conflicts so please have a look at the current changes - help make sure it's correct.

Craigacp · 2024-03-21T18:50:11Z

Is 1.17.3 a release for all APIs or just the web one? If it's for all APIs could we get #19942 merged in? It fixes a regression in 1.17 from 1.16 which we've hit in practice.

YUNQIUGUO · 2024-03-21T19:13:29Z

Is 1.17.3 a release for all APIs or just the web one? If it's for all APIs could we get #19942 merged in? It fixes a regression in 1.17 from 1.16 which we've hit in practice.

Sure - mind clarifying what packages this change apply to?

Craigacp · 2024-03-21T19:18:22Z

It's a fix to the CPU provider implementation of the SplitToSequence operator. So it's still useful for the web API, but our use is via Python or Java.

YUNQIUGUO · 2024-03-21T20:41:24Z

@tianleiwu @mtavenrath looks like this pr #19889 introduces a bunch of build failures on the release branch. mind sharing is there any dependency prs that I need to take or so? https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1331111&view=logs&j=6df8fe70-7b8f-505a-8ef0-8bf93da2bac7&t=38e4b5a4-9041-5949-6670-6ace666c420b&l=5163 failure error message.

### Description Previously, GQA incorrectly enforced rotary cos and sin cache to be of sequence length equal to present sequence length. Now it enforces that it be greater than or equal to present sequence length since to match Rotary Embedding Op it should be of max_sequence_length ### Motivation and Context Fixes issue with fusing Rotary Embedding and GQA for certain models which prefer this optimization.

### Description GQA Rotary Dimension 1 incorrectly assumed to be based on head size. ### Motivation and Context This change should enable us to run phi-2 with GQA and Rotary Embedding fused.

### Description This PR updates the replacement of MultiHeadAttention (MHA) with GroupQueryAttention (GQA). It is related to the changes in [this PR](#18906). ### Motivation and Context The updated replacement of MHA with GQA includes the following fusion changes. - Apply sliding window within GQA - Fuse the rotary embeddings within GQA - Fuse the 3 MatMuls into 1 packed MatMul if possible - Fuse the 3 Adds into 1 packed Add if possible

tianleiwu · 2024-03-22T00:24:25Z

@tianleiwu @mtavenrath looks like this pr #19889 introduces a bunch of build failures on the release branch. mind sharing is there any dependency prs that I need to take or so? https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1331111&view=logs&j=6df8fe70-7b8f-505a-8ef0-8bf93da2bac7&t=38e4b5a4-9041-5949-6670-6ace666c420b&l=5163 failure error message.

There is no need to take dependency PRs, which might bring more changes. I added a commit to manually fix the build errors.

YUNQIUGUO · 2024-03-22T00:38:22Z

@tianleiwu @mtavenrath looks like this pr #19889 introduces a bunch of build failures on the release branch. mind sharing is there any dependency prs that I need to take or so? https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1331111&view=logs&j=6df8fe70-7b8f-505a-8ef0-8bf93da2bac7&t=38e4b5a4-9041-5949-6670-6ace666c420b&l=5163 failure error message.

There is no need to take dependency PRs, which might bring more changes. I added a commit to manually fix the build errors.

cool, thanks!

…osoft/onnxruntime into yguo/cherry-pick-for-1.17.3

…oad TRT binaries in every build (#19919) ### Description Change nuget pipeline's "Final_Jar_Testing_Windows_GPU" job to download TRT binaries in every build. Now all the other build jobs are already doing this. This is the only one left. Similar to #19909 ### Motivation and Context As a follow up of #19118

…ad TRT binaries in every build (#19909) ### Description Change nuget pipeline's "Final_Jar_Testing_Windows_GPU" job to download TRT binaries in every build. Now all the other build jobs are already doing this. This is the only one left. ### Motivation and Context As a follow up of #19118

### Description Add support for packed qkv input and rotary embedding with sm<80 using memory efficient attention kernel. ### Motivation and Context Allows lower-end gpus to run gqa with packed qkv input and rotary embedding.

### Description This PR adds a benchmarking script to measure end-to-end performance and saves the results in a CSV file. ### Motivation and Context With this PR, end-to-end performance can be easily measured for many large-language models such as LLaMA-2. The performance numbers for LLaMA-2 are located [here](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/python/models/llama).

### Description This PR removes early stopping from the end-to-end LLaMA-2 benchmark script. ### Motivation and Context This allows models to always generate the requested number of new tokens.

edgchen1

Inclusion of #19858 looks good.

The `CreateTensorRTCustomOpDomainList()` is not thread-safe due to its static variables, `created_custom_op_list` and `custom_op_domain`. This PR makes sure synchronization using mutex. see issue: #20089

YUNQIUGUO · 2024-03-27T17:03:59Z

The cherry-pick process so far should be completed. please @microsoft/onnxruntime-es take a look and help approve, thanks.

YUNQIUGUO · 2024-03-28T23:20:00Z

Looking at the required CI testing, I think there has been a dml test failing on the windows GPU CI pipeline: https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1335887&view=logs&j=4b364901-e881-5aa9-a2fc-f7beb70b2ce1&s=e4766ded-b3c3-5183-8adb-9072bf8f96b3&t=b32ab5b7-b815-50a4-8a25-e0aac0264208&l=44468

anyone knows is this a known issue or intermittent error?

checked it had failed a couple times in the history of the CI runs.

snnn · 2024-03-28T23:27:54Z

You need this: https://github.com/microsoft/onnxruntime/pull/20073/files

### Description 1. change in build.py is to fix DML exception (https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=10&_a=summary) 2. change in requirements.txt is to fix exception in python packaging pipeline. https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=430433&view=results ### Motivation and Context  --------- Co-authored-by: Yi Zhang <[email protected]>

YUNQIUGUO · 2024-03-28T23:35:21Z

You need this: https://github.com/microsoft/onnxruntime/pull/20073/files

thanks added the fix commit.

yufenglee and others added 7 commits March 21, 2024 11:12

Use CMake's find package for CUDA libs (#19673)

6772107

Answers issue #19640 More details are in the issue, basically I am changing all the include directory and link directory usage to CMake's `CUDA::*` targets

skip onnx node_tests for tensorrt ep (#19880)

a13e5d5

fix build break caused by image update. tensorrt isn't expected to pass all onnx node tests.

YUNQIUGUO requested a review from a team as a code owner March 21, 2024 18:43

update version number 1.17.3 + a commit for update marker

0941cc7

Craigacp and others added 5 commits March 21, 2024 13:43

String Tensor SplitToSequence fix (#19942)

674c359

fix gqa rotary dim 1 (#19874)

42ab62c

### Description GQA Rotary Dimension 1 incorrectly assumed to be based on head size. ### Motivation and Context This change should enable us to run phi-2 with GQA and Rotary Embedding fused.

fix build error

0eeeadc

rachguo and others added 7 commits March 21, 2024 18:28

Merge branch 'rel-1.17.3' into yguo/cherry-pick-for-1.17.3

38b8c67

Merge branch 'yguo/cherry-pick-for-1.17.3' of https://github.com/micr…

fe0c113

…osoft/onnxruntime into yguo/cherry-pick-for-1.17.3

Remove early stopping from LLaMA end-to-end benchmarking (#20033)

c9ebded

### Description This PR removes early stopping from the end-to-end LLaMA-2 benchmark script. ### Motivation and Context This allows models to always generate the requested number of new tokens.

YUNQIUGUO requested review from kunal-vaishnavi and snnn March 25, 2024 17:52

YUNQIUGUO requested review from tianleiwu, edgchen1, jywu-msft and aciddelgado March 25, 2024 17:53

edgchen1 previously approved these changes Mar 25, 2024

View reviewed changes

tianleiwu previously approved these changes Mar 27, 2024

View reviewed changes

[TensorRT EP] Fix concurrency issue for TRT custom op list (#20093)

ed4edfe

The `CreateTensorRTCustomOpDomainList()` is not thread-safe due to its static variables, `created_custom_op_list` and `custom_op_domain`. This PR makes sure synchronization using mutex. see issue: #20089

YUNQIUGUO dismissed stale reviews from tianleiwu and edgchen1 via ed4edfe March 27, 2024 17:00

snnn previously approved these changes Mar 28, 2024

View reviewed changes

tianleiwu previously approved these changes Mar 28, 2024

View reviewed changes

kunal-vaishnavi previously approved these changes Mar 28, 2024

View reviewed changes

YUNQIUGUO dismissed stale reviews from kunal-vaishnavi, tianleiwu, and snnn via 5ac3b6f March 28, 2024 23:34

snnn approved these changes Mar 29, 2024

View reviewed changes

tianleiwu approved these changes Mar 29, 2024

View reviewed changes

YUNQIUGUO merged commit 046d06f into rel-1.17.3 Mar 29, 2024
102 of 108 checks passed

YUNQIUGUO deleted the yguo/cherry-pick-for-1.17.3 branch March 29, 2024 20:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick for 1.17.3 #20013

Cherry-pick for 1.17.3 #20013

YUNQIUGUO commented Mar 21, 2024

YUNQIUGUO commented Mar 21, 2024

Craigacp commented Mar 21, 2024

YUNQIUGUO commented Mar 21, 2024

Craigacp commented Mar 21, 2024 •

edited

Loading

YUNQIUGUO commented Mar 21, 2024

tianleiwu commented Mar 22, 2024

YUNQIUGUO commented Mar 22, 2024

edgchen1 left a comment

YUNQIUGUO commented Mar 27, 2024 •

edited

Loading

YUNQIUGUO commented Mar 28, 2024 •

edited

Loading

snnn commented Mar 28, 2024

YUNQIUGUO commented Mar 28, 2024 •

edited

Loading

Cherry-pick for 1.17.3 #20013

Cherry-pick for 1.17.3 #20013

Conversation

YUNQIUGUO commented Mar 21, 2024

Description

Motivation and Context

YUNQIUGUO commented Mar 21, 2024

Craigacp commented Mar 21, 2024

YUNQIUGUO commented Mar 21, 2024

Craigacp commented Mar 21, 2024 • edited Loading

YUNQIUGUO commented Mar 21, 2024

tianleiwu commented Mar 22, 2024

YUNQIUGUO commented Mar 22, 2024

edgchen1 left a comment

Choose a reason for hiding this comment

YUNQIUGUO commented Mar 27, 2024 • edited Loading

YUNQIUGUO commented Mar 28, 2024 • edited Loading

snnn commented Mar 28, 2024

YUNQIUGUO commented Mar 28, 2024 • edited Loading

Craigacp commented Mar 21, 2024 •

edited

Loading

YUNQIUGUO commented Mar 27, 2024 •

edited

Loading

YUNQIUGUO commented Mar 28, 2024 •

edited

Loading

YUNQIUGUO commented Mar 28, 2024 •

edited

Loading