Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROCm] fix: obtain AMD GPU memory info through rocm_smi library #21190

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

hann-wang
Copy link

Description

Previously ROCMExecutionProvider uses hipMemGetInfo to obtain the sizes of total memory and available memory. However, this API has been broken since ROCm 5.7. In this PR, we use rocm_smi library instead of hipMemGetInfo.

Motivation and Context

hipMemGetInfo API has been broken since ROCm 5.7 and inference with ROCMExecutionProvider will lead to following errors:

HIP failure 1: invalid argument ; GPU=0 ; hostname=4cc4900475fe ; file=/onnxruntime/onnxruntime/core/providers/rocm/rocm_execution_provider.cc ; line=229 ; expr=hipMemGetInfo(&free, &total);

MIOpen has a brute-force fix for this (https://github.com/ROCm/MIOpen/blob/911e67189592c311374940493f2099f3abced60d/src/hip/handlehip.cpp#L72). Instead of hard-coding available memory to 16GB, I suppose we could obtain memory info through rocm_smi library as in this PR.

@hann-wang
Copy link
Author

@hann-wang please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree [company="AMD"]

@hann-wang
Copy link
Author

@microsoft-github-policy-service agree company="your company"

@microsoft-github-policy-service agree company="AMD Inc."

@hann-wang hann-wang changed the title fix: obtain memory info through rocm_smi library fix: obtain AMD GPU memory info through rocm_smi library Jun 27, 2024
@hann-wang hann-wang changed the title fix: obtain AMD GPU memory info through rocm_smi library [ROCm] fix: obtain AMD GPU memory info through rocm_smi library Jun 27, 2024
@tianleiwu tianleiwu requested a review from cloudhan June 27, 2024 22:55
@hann-wang
Copy link
Author

hann-wang commented Jun 28, 2024 via email

@tianleiwu
Copy link
Contributor

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@tianleiwu
Copy link
Contributor

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

@tianleiwu
Copy link
Contributor

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Copy link

Pipelines were unable to run due to time out waiting for the pull request to finish merging.

2 similar comments
Copy link

Pipelines were unable to run due to time out waiting for the pull request to finish merging.

Copy link

Pipelines were unable to run due to time out waiting for the pull request to finish merging.

@tianleiwu
Copy link
Contributor

@hann-wang, the python format pipeline failed. Please fix it by running lintrunner at the root like

pip install -r requirements-lintrunner.txt
pip install lintrunner
lintrunner init
lintrunner -a

@tianleiwu
Copy link
Contributor

/azp run orttraining-amd-gpu-ci-pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@hann-wang
Copy link
Author

@hann-wang, the python format pipeline failed. Please fix it by running lintrunner at the root like

pip install -r requirements-lintrunner.txt
pip install lintrunner
lintrunner init
lintrunner -a

got it, thank you!

9058961

@tianleiwu
Copy link
Contributor

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@tianleiwu
Copy link
Contributor

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

@tianleiwu
Copy link
Contributor

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

1 similar comment
Copy link

Azure Pipelines successfully started running 10 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants