Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accuracy loss of TensorRT 8.6 when running INT8 Quantized Resnet18 on GPU A4000 #4079

Open
YixuanSeanZhou opened this issue Aug 14, 2024 · 11 comments
Labels

Comments

@YixuanSeanZhou
Copy link

Description

When performing Resnet18 PTQ using TRT-modelopt, I encountered the following issue when compiling the model with TRT.

First off, I started with a pretrained resnet18 from torchvision. I replaced the last fully connected layer to fit on my dataset (for example, CIFRA-10). I also updated all the skip layers (the plus) with a ElementwiseAdd layer and I defined its quantization layer as follow myself (code attached at the end). The reason I do this is to facilitate the Q/DQ fusion so that every layer can be in INT8.

Then, when compiling the exported onnx model with TRT, I found that TRT outputs is very different from the fake Q/DQ model in python, and the fake Q/DQ onnx model as well when running with onnx runtime. (np.allclose with 1e-3 as the threshold failed). Comparing TRT and native output, the classification result disagrees for ~2.3%

I discussed with TRT modelopt in this issue and they suggested to file a bug report here

Environment

TensorRT Version: 8.6.1

NVIDIA GPU: A4000

NVIDIA Driver Version: 535.183.01

CUDA Version: 12.2

Python Version (if applicable): 3.10.1

PyTorch Version (if applicable): '2.4.0+cu124'

Relevant Files

Model link: You can download the onnx model and the TRT engine here: https://file.io/GnuiEMNeebQ1

Steps To Reproduce

Run the TRT model using Python API and the onnx model with Cifar-10 datasets using the following data loader, and compares the result.

testset = datasets.CIFAR10(root='./data', download=True, train=False, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=1, shuffle=False)

Have you tried the latest release?: Haven't tried TRT10, but we don't plan to upgrade in the short period. I was under the impression 8.6 should be okay.

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):
Yes, Onnx Runtime generates 1% disagreement in the Native model

Appendix

Visualizing the TRT engine, I think it is completely within my expectation with everything being fused as INT8 kernels.
trt_engine_0

@YixuanSeanZhou YixuanSeanZhou changed the title Accuracy loss of TensorRT 8.6 when running Quantized Resnet18 on GPU A4000 Accuracy loss of TensorRT 8.6 when running INT8 Quantized Resnet18 on GPU A4000 Aug 14, 2024
@lix19937
Copy link

Do you start with a pretrained model to QAT?

If yes, does the fp32 model (that is un-quantized) also shows inconsistent results between trt and onnxruntime ?

Do you try to change a version of trt?

Do you try to use other calibration method between finetune ?

@YixuanSeanZhou
Copy link
Author

YixuanSeanZhou commented Aug 21, 2024

@lix19937 Thank you very much for helping out!

Do you start with a pretrained model to QAT?

Yes, to be specific, what I did was PTQ using TRT modelopt.

does the fp32 model (that is un-quantized) also shows inconsistent results between trt and onnxruntime

No, FP32 model's outputs when converted to TRT was almost identical to the native Pytorch model, I didn't verify onnxruntime for this.

Do you try to change a version of trt?

Unfortunately this is not easy to do right now on my side. Are you expecting this will be fixed in TRT 10? I was under the impression that QDQ quantization has been supported even earlier than 8.6, so it won't be a version issue. But correct me if I am wrong.

Do you try to use other calibration method between finetune ?

Do you mean the implicit calibration methods within TRT? If so, I am not using that. The onnx model I provided to TRT already contains the QDQ nodes. If you mean other calibration in TRT modelopt, I tried both Smoothing and the Default (MINMAX) calibration methods, they both shows the same regression.

Thanks and looking forward to your responses!

@akhilg-nv
Copy link
Collaborator

akhilg-nv commented Aug 30, 2024

Following up from the discussion on your issue in TRT ModelOpt, it is possible the accuracy degradation comes for the fusions performed with Int8 convolution. Could you try removing QDQ for all convolutional layers (not just first one) and compare the accuracy with torch? You should be able to do this by modifying the config either as you've done before, or to use a filter function for convolutional layers in your torch model. It is possible your specific application may work better for now without quantizing conv layers, though this will also help us in understanding & investigating the root cause of the accuracy discrepancy you are seeing.

def filter_func(name):
    pattern = re.compile(
        r".*(|conv_in|conv_out|conv_shortcut|etc).*"
    )
    return pattern.match(name) is not None

# and/or apply filter func when creating your quantization config. see demo/Diffusion/utils_modelopt.py for an example.
mtq.disable_quantizer(model, filter_func)

@akhilg-nv akhilg-nv added triaged Issue has been triaged by maintainers Accuracy Quantization: PTQ labels Aug 30, 2024
@YixuanSeanZhou
Copy link
Author

YixuanSeanZhou commented Sep 12, 2024

Got it, thanks for the follow up @akhilg-nv. Sorry for the delay, I was away last & this week. I can try the experiment you suggested next week.

To clarify: If we skip quantizing the conv layers, it means we will only have the last layer -- the fully connected classification layer. Is that okay?

It is possible your specific application may work better for now without quantizing conv layers,

This won't be true unfortunately, the goal of quantizing those layers is to achieve acceleration in the model inference latency. However, i will certainly conduct the experiment to see if it will no longer have the regression.

@YixuanSeanZhou
Copy link
Author

Hi @akhilg-nv, I disabled quantization of all the Conv layers: what got quantized are:

  1. maxpool and average pool
  2. skip connections (adds)
  3. final layer that does the classification

I found that the TRT outputs still differ from the native outputs (classification distribution). It differ by >=1e-3. I also noticed that for all the examples where the native model makes a correct prediction, the TRT model also will make a correct prediction. In other words, although the distribution differs, the logits are equal after taking argmax.

However, on the classifications that both versions generates wrong output, the logits differs (for example the correct result is 0, the native gives 1, and trt can give 2).

Does this means there are something wrong with how TRT resolves QDQ in general? I would really expect the output of TRT model should be close to the output of native model, pointwise. Otherwise, if I want to perform a segmentation task for example, i won't be able to take a argmax on the predicted pixel.

@akhilg-nv
Copy link
Collaborator

akhilg-nv commented Sep 18, 2024

Thanks for running this experiment @YixuanSeanZhou. To confirm, you're seeing that the partially-quantized model run with ONNX runtime has matching accuracy with the torch quantized model, but the TRT model's accuracy differs even if you only quantize pooling, residual connections, and fully connected layer? This is a bit strange since iirc your previous bisection experiment revealed accuracy differences after convolution. Some suggestions to look into:

  1. Could you verify what precision batch normalization layers after convolution are running in? It is possible there is a fusion that is introducing accuracy discrepancy.
  2. Perhaps try bisecting the TRT model with fp32 convolutions to see where accuracy discrepancies start appearing?

We will investigate this as well on our end, thanks for the detailed information.

@YixuanSeanZhou
Copy link
Author

Hi @akhilg-nv, thanks for the response. I didn't check the diff between onnx and native for this partially quantized model. I will double check that later and get back to you.

Correct, this is surprising as well. I thought it was convolution that causes the issue but maybe there are more under the hood.

Also, to eliminate the issue of batch norm, in this model, I actually chose the most basic resnet18 which doesn't even have batchnorm in it.

Bisecting requires more effort, I can hopefully get to it next week!

To provide some data point for your investigation, here are two onnx models i have, quantized and unquantized: https://file.io/nA8lR0aZHmXz. I hope this could help in your investigation on what could potentially go wrong in TRT. (Or maybe you can figure out what user mistake i could be making!)

Thanks again!

@akhilg-nv
Copy link
Collaborator

akhilg-nv commented Sep 19, 2024

Hi @YixuanSeanZhou, it's worth noting that there is some expected error after quantizing, so difference of 1e-3 with no change in positive classification may be expected. Double checking with torch/ORT will help confirm if the partially quantized model has important error or not.

You mention your model does not have BN layers, could you share which model you use from torchvision? Also, the link for the ONNX file seems to have expired, could you re-upload it?

@YixuanSeanZhou
Copy link
Author

YixuanSeanZhou commented Sep 20, 2024

Hi @akhilg-nv, thanks for taking a look!

Could you try this link: https://file.io/3M5bla347qa0? Seems to be working for me when i try to re-download. I apologize the previous link didn't work.

The torchivsion model i used is the basic resnet with the last layer substituted to a class of 10.

    from torchvision import models
    resnet18 = models.resnet18(pretrained=True)
    resnet18.fc = nn.Linear(resnet18.fc.in_features, 10)

it's worth noting that there is some expected error after quantizing, so difference of 1e-3 with no change in positive classification may be expected.

I think the diff can be worse than that. For negative classifications, we don't have alignment, and I think the element diff in the distribution can sometimes be pretty large (1e-1 level). Notice that 1e-1 is pretty significant as this is a distribution of 10 items (softmax on tensor of size 10).

If we have such a regression how are we going to apply the model on other tasks -- e.g. segmentations?

Double checking with torch/ORT will help confirm if the partially quantized model has important error or not.

I will try to find time to do it and get back to you.

@akhilg-nv
Copy link
Collaborator

Hi @YixuanSeanZhou, the new link also doesn't work - I get the following error: "The transfer you requested has been deleted." Perhaps you could try sharing the ONNX model a different way?

Regarding the resnet architecture, I am a bit confused why you say there is no batch norm, since I don't see you removing it in your sample code. Below I've pasted a snippet:

>>> import torchvision
>>> resnet18 = torchvision.models.resnet18(pretrained=True)
>>> resnet18
...
  (layer4): Sequential(
...
    (1): BasicBlock(
      (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=512, out_features=1000, bias=True)
)

@YixuanSeanZhou
Copy link
Author

Hi @akhilg-nv,

I am terribly sorry for the confusion... I didn't re-check the model architecture. I looked at the onnx graph and I see there is no BatchNorm and falsely assumed I picked a model with batchnorm removed. I think maybe it's just fused with the Conv layers. If it is indeed fused with Conv layers, then batchnorm is running in FP32. If you see the attached screenshot, the conv layers (and things in between) are running FP32

Image

Regarding the onnx, can you try this google drive link: https://drive.google.com/file/d/1AGHoPgYIRg3dt0ZJz7yVOgTnTR6hPGMw/view?usp=sharing. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants