-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference failure of TensorRT 10.0.x.x when running my internal model on GPU(T4, A100) #4066
Comments
Can you provide the full log with |
@lix19937 Can I send this to you via email? I prefer not to expose my model publicly. The --verbose option also reveals too much information. |
@kimdwkimdw ok, my email is [email protected] |
@lix19937 i've sent it with gzipped |
@kimdwkimdw I didn't receive your e-mail. |
Please check your spam mail inbox. I've sent mail via Gmail. |
Can you upload the goole drive ? |
OK I upload log file to gdrive and shared with your email |
Any updates? |
@kimdwkimdw So sorry ! Current I has no env, can you upload the full log with |
@lix19937 let me know your google address for Google drive. Your email address |
This is same kind of issue from #3292 TensorRT 10.x have significant errors. cc. @zerollzeng @ttyio |
sent log to email [email protected] |
@lix19937 I've sent it to [email protected] |
From your logs, it has no valid errors or warnings. Maybe you can use polygraphy, like follow polygraphy run model.onnx --trt --onnxrt --input-shapes source:[2,160000] wav_lens:[2,1] to check which layer begin to arise the big nan/diff, check whether a BN after conv, etc. Also you can check the weights max-min range. Another hand, you can try to use latest version. |
Thank you for the suggestion, but I have already tried using I did not encounter this issue with TensorRT 8.5.3, including versions like 23.02 and 23.03, where there were no nan values. However, starting from TensorRT 8.6.1.6, the errors have become more pronounced, and with all versions of TensorRT 10 (e.g., 10.3.0.26, 10.2.0, 10.1.0), the model’s errors seem to overflow dramatically. |
In my opinion, form 8.6, tensorrt add more feature, like builder optimization level(the default optimization level is 3. Valid values include integers from 0 to the maximum optimization level 5), and import llm layers fusion(like mha, ln), for normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset that contains the corresponding function ops, for example: opset 17 for LayerNormalization or opset 18 GroupNormalization. Numerical accuracy using function ops is superior to corresponding implementation with primitive ops for normalization layers. And move cuda-x lib depends. You can try to add |
I have already tried the suggestions, including setting Despite these adjustments, including experimenting with different precision combinations (fp16, bf16, tf32, fp32 with --noTF32), the output from polygraphy continues to show significant degradation as follows:
This indicates a severe accuracy issue. Features that were functional in TensorRT 8.5 no longer work correctly in the later versions. |
can you provide the result of polygraphy compare trt with onnxrt in trt8.5 ? |
Here are two version of polygraphy compare trt with onnxrt between 8.5 and TensorRT 10.3.0.26 trt8.5 with fp32
trt10.x with fp32
|
@kimdwkimdw can you send the model to [email protected], I can instance an internal bug for this |
@moraxu I've shared google drive link with And I've shared it with |
@kimdwkimdw Can you please also try the latest TRT 10.5? There are some known accuracy issues that have been fixed in 10.5. |
Currently, all of our tests are being conducted using the following official TensorRT container, but it appears that version 10.5 is not yet included in the 24.09 release: Once an update is available, I will test with version 10.5. |
Description
After updating to TensorRT 10.0.1.6, we expected the previously reported issue to be resolved. Unfortunately, not only does the issue persist, but the model’s outputs have deteriorated even further. Specifically, all output values are now nan, making it impossible to use our models. This issue affects both fp16 and fp32 precision settings, rendering the model completely non-functional.
#3292
Environment
TensorRT Version: All version of 10.0.x.x. NGC Container 24.05~24.07.
NVIDIA GPU: T4, A100
NVIDIA Driver Version: 550.90.07
CUDA Version: 12.4
CUDNN Version: x
Operating System:
Container (if so, version): NGC Container from 23.03 and 24.07.
https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt
Operating System:
Python Version (if applicable):
Tensorflow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if so, version):
Relevant Files
Model link:
Steps To Reproduce
Commands or scripts:
Have you tried the latest release?:
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (
polygraphy run <model.onnx> --onnxrt
):fail on polygraphy
related to #3292
The text was updated successfully, but these errors were encountered: