Inference failure of TensorRT 10.0.x.x when running my internal model on GPU(T4, A100) #4066

kimdwkimdw · 2024-08-09T04:37:40Z

Description

After updating to TensorRT 10.0.1.6, we expected the previously reported issue to be resolved. Unfortunately, not only does the issue persist, but the model’s outputs have deteriorated even further. Specifically, all output values are now nan, making it impossible to use our models. This issue affects both fp16 and fp32 precision settings, rendering the model completely non-functional.

#3292

Environment

TensorRT Version: All version of 10.0.x.x. NGC Container 24.05~24.07.

NVIDIA GPU: T4, A100

NVIDIA Driver Version: 550.90.07

CUDA Version: 12.4

CUDNN Version: x

Operating System:

Container (if so, version): NGC Container from 23.03 and 24.07.
https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

fail on polygraphy

related to #3292

The text was updated successfully, but these errors were encountered:

lix19937 · 2024-08-09T05:41:22Z

Can you provide the full log with trtexec --verbose ?

kimdwkimdw · 2024-08-09T07:14:40Z

@lix19937 Can I send this to you via email? I prefer not to expose my model publicly. The --verbose option also reveals too much information.

lix19937 · 2024-08-09T10:15:50Z

@kimdwkimdw ok, my email is [email protected]

kimdwkimdw · 2024-08-09T10:26:02Z

@lix19937 i've sent it with gzipped

lix19937 · 2024-08-12T00:48:33Z

@kimdwkimdw I didn't receive your e-mail.

kimdwkimdw · 2024-08-12T00:51:50Z

@kimdwkimdw I didn't receive your e-mail.

Please check your spam mail inbox.

I've sent mail via Gmail.

lix19937 · 2024-08-21T11:15:48Z

Can you upload the goole drive ?

kimdwkimdw · 2024-08-23T04:14:23Z

Can you upload the goole drive ?

OK I upload log file to gdrive and shared with your email

kimdwkimdw · 2024-09-11T09:10:57Z

Any updates?

lix19937 · 2024-09-11T10:27:37Z

@kimdwkimdw So sorry ! Current I has no env, can you upload the full log with trtexec --verbose 2>&1 |tee full_log to google drive to share me ? I will analysis it as soon as possible.

kimdwkimdw · 2024-09-11T15:52:35Z

@lix19937 let me know your google address for Google drive. Your email address [email protected] seems doesn't work.

kimdwkimdw · 2024-09-11T16:03:39Z

This is same kind of issue from #3292

TensorRT 10.x have significant errors.

cc. @zerollzeng @ttyio

lix19937 · 2024-09-12T04:33:50Z

@lix19937 let me know your google address for Google drive. Your email address [email protected] seems doesn't work.

sent log to email [email protected]

kimdwkimdw · 2024-09-12T07:49:55Z

@lix19937 I've sent it to [email protected]

lix19937 · 2024-09-12T10:11:54Z

@kimdwkimdw

but the model’s outputs have deteriorated even further. Specifically, all output values are now nan, making it impossible to use our models.

From your logs, it has no valid errors or warnings. Maybe you can use polygraphy, like follow

polygraphy run model.onnx --trt  --onnxrt --input-shapes source:[2,160000] wav_lens:[2,1]

to check which layer begin to arise the big nan/diff, check whether a BN after conv, etc. Also you can check the weights max-min range.

Another hand, you can try to use latest version.

kimdwkimdw · 2024-09-12T10:16:56Z

@lix19937

Thank you for the suggestion, but I have already tried using polygraphy along with other methods. My question goes back to the root of the issue: why do nan values in the relative difference output from polygraphy start appearing when using TensorRT version 10.x?

I did not encounter this issue with TensorRT 8.5.3, including versions like 23.02 and 23.03, where there were no nan values. However, starting from TensorRT 8.6.1.6, the errors have become more pronounced, and with all versions of TensorRT 10 (e.g., 10.3.0.26, 10.2.0, 10.1.0), the model’s errors seem to overflow dramatically.

lix19937 · 2024-09-12T10:33:54Z

In my opinion, form 8.6, tensorrt add more feature, like builder optimization level(the default optimization level is 3. Valid values include integers from 0 to the maximum optimization level 5), and import llm layers fusion(like mha, ln), for normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset that contains the corresponding function ops, for example: opset 17 for LayerNormalization or opset 18 GroupNormalization. Numerical accuracy using function ops is superior to corresponding implementation with primitive ops for normalization layers. And move cuda-x lib depends.

You can try to add --builderOptimizationLevel=5 --noTF32 and adjust the size of --memPoolSize . @kimdwkimdw

kimdwkimdw · 2024-09-19T04:11:31Z

I have already tried the suggestions, including setting --builderOptimizationLevel=5 and --noTF32, and updating the ONNX opset from 15 to 17. Currently, proper export for opset 18 is not supported in PyTorch (Ref: PyTorch ONNX Export).

Despite these adjustments, including experimenting with different precision combinations (fp16, bf16, tf32, fp32 with --noTF32), the output from polygraphy continues to show significant degradation as follows:

[I]         Error Metrics: x
[I]             Minimum Required Tolerance: elemwise error | [abs=11.761] OR [rel=1393.1] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=1.6023, std-dev=1.614, var=2.605, median=1.0086, min=0 at (1, 234, 1208), max=11.761 at (0, 14, 619), avg-magnitude=1.6023, p90=4.0612, p95=4.0612, p99=6.56
[I]                 ---- Histogram ----
                    Bin Range    |  Num Elems | Visualization
                    (0   , 1.18) |    8723143 | ########################################
                    (1.18, 2.35) |    3338792 | ###############
                    (2.35, 3.53) |    1736355 | #######
                    (3.53, 4.7 ) |    1026219 | ####
                    (4.7 , 5.88) |     718419 | ###
                    (5.88, 7.06) |     308960 | #
                    (7.06, 8.23) |      65386 | 
                    (8.23, 9.41) |      16262 | 
                    (9.41, 10.6) |       2064 | 
                    (10.6, 11.8) |        400 | 
[I]             Relative Difference | Stats: mean=0.1035, std-dev=0.84023, var=0.70599, median=0.068093, min=0 at (1, 234, 1208), max=1393.1 at (0, 107, 0), avg-magnitude=0.1035, p90=0.24082, p95=0.24082, p99=0.40837
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (0       , 139     ) |   15935951 | ########################################
                    (139     , 279     ) |         36 | 
                    (279     , 418     ) |          3 | 
                    (418     , 557     ) |          4 | 
                    (557     , 697     ) |          2 | 
                    (697     , 836     ) |          0 | 
                    (836     , 975     ) |          1 | 
                    (975     , 1.11e+03) |          0 | 
                    (1.11e+03, 1.25e+03) |          0 | 
                    (1.25e+03, 1.39e+03) |          3 |

This indicates a severe accuracy issue. Features that were functional in TensorRT 8.5 no longer work correctly in the later versions.

@lix19937

lix19937 · 2024-09-20T09:01:57Z

I have already tried the suggestions, including setting --builderOptimizationLevel=5 and --noTF32, and updating the ONNX opset from 15 to 17. Currently, proper export for opset 18 is not supported in PyTorch (Ref: PyTorch ONNX Export).

Despite these adjustments, including experimenting with different precision combinations (fp16, bf16, tf32, fp32 with --noTF32), the output from polygraphy continues to show significant degradation as follows:

[I]         Error Metrics: x
[I]             Minimum Required Tolerance: elemwise error | [abs=11.761] OR [rel=1393.1] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=1.6023, std-dev=1.614, var=2.605, median=1.0086, min=0 at (1, 234, 1208), max=11.761 at (0, 14, 619), avg-magnitude=1.6023, p90=4.0612, p95=4.0612, p99=6.56
[I]                 ---- Histogram ----
                    Bin Range    |  Num Elems | Visualization
                    (0   , 1.18) |    8723143 | ########################################
                    (1.18, 2.35) |    3338792 | ###############
                    (2.35, 3.53) |    1736355 | #######
                    (3.53, 4.7 ) |    1026219 | ####
                    (4.7 , 5.88) |     718419 | ###
                    (5.88, 7.06) |     308960 | #
                    (7.06, 8.23) |      65386 | 
                    (8.23, 9.41) |      16262 | 
                    (9.41, 10.6) |       2064 | 
                    (10.6, 11.8) |        400 | 
[I]             Relative Difference | Stats: mean=0.1035, std-dev=0.84023, var=0.70599, median=0.068093, min=0 at (1, 234, 1208), max=1393.1 at (0, 107, 0), avg-magnitude=0.1035, p90=0.24082, p95=0.24082, p99=0.40837
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (0       , 139     ) |   15935951 | ########################################
                    (139     , 279     ) |         36 | 
                    (279     , 418     ) |          3 | 
                    (418     , 557     ) |          4 | 
                    (557     , 697     ) |          2 | 
                    (697     , 836     ) |          0 | 
                    (836     , 975     ) |          1 | 
                    (975     , 1.11e+03) |          0 | 
                    (1.11e+03, 1.25e+03) |          0 | 
                    (1.25e+03, 1.39e+03) |          3 |

This indicates a severe accuracy issue. Features that were functional in TensorRT 8.5 no longer work correctly in the later versions.

@lix19937

can you provide the result of polygraphy compare trt with onnxrt in trt8.5 ?

kimdwkimdw · 2024-09-24T07:27:53Z

Here are two version of polygraphy compare trt with onnxrt between 8.5 and TensorRT 10.3.0.26

trt8.5 with fp32

[I] trt-runner-N0-08/14/24-04:33:32    
    ---- Inference Input(s) ----
    {source [dtype=float32, shape=(1, 160000)],
     wav_lens [dtype=float32, shape=(1, 1)]}
[I] trt-runner-N0-08/14/24-04:33:32    
    ---- Inference Output(s) ----
    {x [dtype=float32, shape=(1, 999, 2000)]}
[38;5;10m[I] trt-runner-N0-08/14/24-04:33:32     | Completed 1 iteration(s) in 45.12 ms | Average inference time: 45.12 ms.[0m
[38;5;14m[I] Accuracy Comparison | onnxrt-runner-N0-08/14/24-04:33:32 vs. trt-runner-N0-08/14/24-04:33:32[0m
[38;5;14m[I]     Comparing Output: 'x' (dtype=float32, shape=(1, 999, 2000)) with 'x' (dtype=float32, shape=(1, 999, 2000))[0m
[I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         onnxrt-runner-N0-08/14/24-04:33:32: x | Stats: mean=-15.392, std-dev=3.0801, var=9.487, median=-15.933, min=-27.929 at (0, 489, 2), max=-0.0010415 at (0, 0, 4), avg-magnitude=15.392
[I]             ---- Histogram ----
                Bin Range            |  Num Elems | Visualization
                (-27.9   , -25.1   ) |       1126 | 
                (-25.1   , -22.3   ) |       5617 | 
                (-22.3   , -19.6   ) |     107791 | ######
                (-19.6   , -16.8   ) |     679836 | ########################################
                (-16.8   , -14     ) |     585686 | ##################################
                (-14     , -11.2   ) |     384150 | ######################
                (-11.2   , -8.38   ) |     209469 | ############
                (-8.38   , -5.59   ) |      22838 | #
                (-5.59   , -2.79   ) |        479 | 
                (-2.79   , -0.00104) |       1008 | 
[I]         trt-runner-N0-08/14/24-04:33:32: x | Stats: mean=-15.392, std-dev=3.0801, var=9.487, median=-15.933, min=-27.929 at (0, 489, 2), max=-0.0010418 at (0, 0, 4), avg-magnitude=15.392
[I]             ---- Histogram ----
                Bin Range            |  Num Elems | Visualization
                (-27.9   , -25.1   ) |       1126 | 
                (-25.1   , -22.3   ) |       5617 | 
                (-22.3   , -19.6   ) |     107793 | ######
                (-19.6   , -16.8   ) |     679836 | ########################################
                (-16.8   , -14     ) |     585686 | ##################################
                (-14     , -11.2   ) |     384146 | ######################
                (-11.2   , -8.38   ) |     209471 | ############
                (-8.38   , -5.59   ) |      22838 | #
                (-5.59   , -2.79   ) |        479 | 
                (-2.79   , -0.00104) |       1008 | 
[I]         Error Metrics: x
[I]             Minimum Required Tolerance: elemwise error | [abs=0.00096989] OR [rel=0.00058633] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=2.6239e-05, std-dev=3.3076e-05, var=1.094e-09, median=1.7166e-05, min=0 at (0, 1, 150), max=0.00096989 at (0, 489, 4), avg-magnitude=2.6239e-05
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (0       , 9.7e-05 ) |    1945723 | ########################################
                    (9.7e-05 , 0.000194) |      42371 | 
                    (0.000194, 0.000291) |       4457 | 
                    (0.000291, 0.000388) |       2794 | 
                    (0.000388, 0.000485) |       1888 | 
                    (0.000485, 0.000582) |        648 | 
                    (0.000582, 0.000679) |         71 | 
                    (0.000679, 0.000776) |         31 | 
                    (0.000776, 0.000873) |         14 | 
                    (0.000873, 0.00097 ) |          3 | 
[I]             Relative Difference | Stats: mean=1.8316e-06, std-dev=2.9922e-06, var=8.9532e-12, median=1.1515e-06, min=0 at (0, 1, 150), max=0.00058633 at (0, 493, 0), avg-magnitude=1.8316e-06
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (0       , 5.86e-05) |    1997412 | ########################################
                    (5.86e-05, 0.000117) |        477 | 
                    (0.000117, 0.000176) |         91 | 
                    (0.000176, 0.000235) |         14 | 
                    (0.000235, 0.000293) |          2 | 
                    (0.000293, 0.000352) |          2 | 
                    (0.000352, 0.00041 ) |          0 | 
                    (0.00041 , 0.000469) |          1 | 
                    (0.000469, 0.000528) |          0 | 
                    (0.000528, 0.000586) |          1 | 
[38;5;9m[E]         FAILED | Output: 'x' | Difference exceeds tolerance (rel=1e-05, abs=1e-05)[0m
[38;5;9m[E]     FAILED | Mismatched outputs: ['x'][0m

trt10.x with fp32

[I] trt-runner-N0-08/14/24-04:04:19    
    ---- Inference Input(s) ----
    {source [dtype=float32, shape=(1, 160000)],
     wav_lens [dtype=float32, shape=(1, 1)]}
[I] trt-runner-N0-08/14/24-04:04:19    
    ---- Inference Output(s) ----
    {x [dtype=float32, shape=(1, 999, 2000)]}
[38;5;10m[I] trt-runner-N0-08/14/24-04:04:19     | Completed 1 iteration(s) in 39.22 ms | Average inference time: 39.22 ms.[0m
[38;5;14m[I] Accuracy Comparison | onnxrt-runner-N0-08/14/24-04:04:19 vs. trt-runner-N0-08/14/24-04:04:19[0m
[38;5;14m[I]     Comparing Output: 'x' (dtype=float32, shape=(1, 999, 2000)) with 'x' (dtype=float32, shape=(1, 999, 2000))[0m
[I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         onnxrt-runner-N0-08/14/24-04:04:19: x | Stats: mean=-15.392, std-dev=3.0801, var=9.487, median=-15.933, min=-27.929 at (0, 489, 2), max=-0.0010415 at (0, 0, 4), avg-magnitude=15.392
[I]             ---- Histogram ----
                Bin Range            |  Num Elems | Visualization
                (-27.9   , -25.1   ) |       1126 | 
                (-25.1   , -22.3   ) |       5617 | 
                (-22.3   , -19.6   ) |     107791 | ######
                (-19.6   , -16.8   ) |     679836 | ########################################
                (-16.8   , -14     ) |     585686 | ##################################
                (-14     , -11.2   ) |     384150 | ######################
                (-11.2   , -8.38   ) |     209469 | ############
                (-8.38   , -5.59   ) |      22838 | #
                (-5.59   , -2.79   ) |        479 | 
                (-2.79   , -0.00104) |       1008 | 
[I]         trt-runner-N0-08/14/24-04:04:19: x | Stats: mean=nan, std-dev=nan, var=nan, median=nan, min=nan at (0, 0, 0), max=nan at (0, 0, 0), avg-magnitude=nan
[I]             None
[I]         Error Metrics: x
[I]             Minimum Required Tolerance: elemwise error | [abs=nan] OR [rel=nan] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=nan, std-dev=nan, var=nan, median=nan, min=nan at (0, 0, 0), max=nan at (0, 0, 0), avg-magnitude=nan
[I]                 
[I]             Relative Difference | Stats: mean=nan, std-dev=nan, var=nan, median=nan, min=nan at (0, 0, 0), max=nan at (0, 0, 0), avg-magnitude=nan
[I]                 
[38;5;9m[E]         FAILED | Output: 'x' | Difference exceeds tolerance (rel=1e-05, abs=1e-05)[0m
[38;5;9m[E]     FAILED | Mismatched outputs: ['x'][0m
[38;5;9m[E] Accuracy Summary | onnxrt-runner-N0-08/14/24-04:04:19 vs. trt-runner-N0-08/14/24-04:04:19 | Passed: 0/1 iterations | Pass Rate: 0.0%[0m
[38;5;9m[E] FAILED | Runtime: 32.87

moraxu · 2024-09-27T22:59:17Z

@kimdwkimdw can you send the model to [email protected], I can instance an internal bug for this

kimdwkimdw · 2024-10-16T02:16:08Z

@moraxu I've shared google drive link with [email protected].
Checkout 'raw_model' folder. There is 2024.model.onnx model file

And I've shared it with [email protected] last year.

yuanyao-nv · 2024-10-16T18:35:35Z

@kimdwkimdw Can you please also try the latest TRT 10.5? There are some known accuracy issues that have been fixed in 10.5.

kimdwkimdw · 2024-10-17T04:56:05Z

@yuanyao-nv

Currently, all of our tests are being conducted using the following official TensorRT container, but it appears that version 10.5 is not yet included in the 24.09 release:
TensorRT 10.4.0.26 in NGC 24.09 - https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt/tags

Once an update is available, I will test with version 10.5.

moraxu added triaged Issue has been triaged by maintainers Accuracy labels Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference failure of TensorRT 10.0.x.x when running my internal model on GPU(T4, A100) #4066

Inference failure of TensorRT 10.0.x.x when running my internal model on GPU(T4, A100) #4066

kimdwkimdw commented Aug 9, 2024 •

edited

Loading

lix19937 commented Aug 9, 2024

kimdwkimdw commented Aug 9, 2024

lix19937 commented Aug 9, 2024

kimdwkimdw commented Aug 9, 2024

lix19937 commented Aug 12, 2024

kimdwkimdw commented Aug 12, 2024 •

edited

Loading

lix19937 commented Aug 21, 2024

kimdwkimdw commented Aug 23, 2024

kimdwkimdw commented Sep 11, 2024

lix19937 commented Sep 11, 2024 •

edited

Loading

kimdwkimdw commented Sep 11, 2024 •

edited

Loading

kimdwkimdw commented Sep 11, 2024

lix19937 commented Sep 12, 2024

kimdwkimdw commented Sep 12, 2024

lix19937 commented Sep 12, 2024

kimdwkimdw commented Sep 12, 2024

lix19937 commented Sep 12, 2024

kimdwkimdw commented Sep 19, 2024 •

edited

Loading

lix19937 commented Sep 20, 2024

kimdwkimdw commented Sep 24, 2024 •

edited

Loading

moraxu commented Sep 27, 2024

kimdwkimdw commented Oct 16, 2024 •

edited

Loading

yuanyao-nv commented Oct 16, 2024

kimdwkimdw commented Oct 17, 2024 •

edited

Loading

Inference failure of TensorRT 10.0.x.x when running my internal model on GPU(T4, A100) #4066

Inference failure of TensorRT 10.0.x.x when running my internal model on GPU(T4, A100) #4066

Comments

kimdwkimdw commented Aug 9, 2024 • edited Loading

Description

Environment

Relevant Files

Steps To Reproduce

lix19937 commented Aug 9, 2024

kimdwkimdw commented Aug 9, 2024

lix19937 commented Aug 9, 2024

kimdwkimdw commented Aug 9, 2024

lix19937 commented Aug 12, 2024

kimdwkimdw commented Aug 12, 2024 • edited Loading

lix19937 commented Aug 21, 2024

kimdwkimdw commented Aug 23, 2024

kimdwkimdw commented Sep 11, 2024

lix19937 commented Sep 11, 2024 • edited Loading

kimdwkimdw commented Sep 11, 2024 • edited Loading

kimdwkimdw commented Sep 11, 2024

lix19937 commented Sep 12, 2024

kimdwkimdw commented Sep 12, 2024

lix19937 commented Sep 12, 2024

kimdwkimdw commented Sep 12, 2024

lix19937 commented Sep 12, 2024

kimdwkimdw commented Sep 19, 2024 • edited Loading

lix19937 commented Sep 20, 2024

kimdwkimdw commented Sep 24, 2024 • edited Loading

moraxu commented Sep 27, 2024

kimdwkimdw commented Oct 16, 2024 • edited Loading

yuanyao-nv commented Oct 16, 2024

kimdwkimdw commented Oct 17, 2024 • edited Loading

kimdwkimdw commented Aug 9, 2024 •

edited

Loading

kimdwkimdw commented Aug 12, 2024 •

edited

Loading

lix19937 commented Sep 11, 2024 •

edited

Loading

kimdwkimdw commented Sep 11, 2024 •

edited

Loading

kimdwkimdw commented Sep 19, 2024 •

edited

Loading

kimdwkimdw commented Sep 24, 2024 •

edited

Loading

kimdwkimdw commented Oct 16, 2024 •

edited

Loading

kimdwkimdw commented Oct 17, 2024 •

edited

Loading