-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Describe the issue
I have an Onnx model exported with opset 11 and 17 (I have tried both) using fp32 and I find the CPU and CUDA forward pass yields grossly different outputs. The CUDA inference appears to be nonsense, whereas the CPU one is akin to the pytorch model from which it is exported.
This is using onnxruntime-gpu 1.22.0.
I have multiple onnx models, but this is a problem unique to this model. It is a UNet trained to perform keypoint detection. It produces gaussian heatmaps that I then take an argmax of to find a coordinate.
I'm happy to provide the models and example inference data if helpful to the dev team, or perform further testing. I can't share the models openly, however.
To reproduce
An FP32 forward inference path using this model:
so = ort.SessionOptions()
providers = ['CUDAExecutionProvider']
gpu_session = ort.InferenceSession(AMPLAX2SAX_MODEL_PATH, so, providers=providers)
Yields nonsense results that are grossly different from loading the model on the CPUExecutionProvider.
However, this DOES give identical output to the CPU provider.
so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL
providers = ['CUDAExecutionProvider']
gpu_session = ort.InferenceSession(AMPLAX2SAX_MODEL_PATH, so, providers=providers)
The error is re-introduced by using ort.GraphOptimizationLevel.ORT_ENABLE_BASIC
The outputs are not just subtly different, but hugely so.
Urgency
I can temporarily disable the optimizations so not urgent.
Platform
Windows
OS Version
Windows 11
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.22.0
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
12.4