Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does Int8 quantization occupy more GPU graphics memory than float16, TensorRT quantization ? #69

Open
nameli0722 opened this issue May 29, 2023 · 8 comments
Assignees

Comments

@nameli0722
Copy link

please descript your problem in English if possible. it will to helpful to more people
Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:
1.
2.

Screenshots
If applicable, add screenshots to help explain your problem.

System environment (please complete the following information):

  • Device:
  • OS:
  • Driver version:
  • CUDA version:
  • TensorRT version:
  • Others:

Cmake output

Running output

@nameli0722
Copy link
Author

我是通过Ttiny-tensorrt来做量化

@zerollzeng
Copy link
Owner

it's expected, the process of int8 quantization require FP32 inference to compute the scale.

@nameli0722
Copy link
Author

I can't understand what you mean. I used int8 quantization and calibration set, and the inference result is also correct. It's large, but the GPU memory usage is larger than float16.
我不能理解您的意思,我使用了int8量化,用了校准集,推理结果也是对的,大,但是gpu显存占用比float16还大

@nameli0722
Copy link
Author

@zerollzeng Thank you very much!

@QiangZhangCV
Copy link

Hello, could you please provide the gpu usage and inference speed, with int8 and FP16?

@nameli0722
Copy link
Author

Hello, could you please provide the gpu usage and inference speed, with int8 and FP16?

thank you!

origin pt model: gpu usage 5099MB, inference time 1.7s;

tiny-tensorrt float16: gpu usage 3993 MB, inference time 0.4s;

tiny-tensorrt int 8 : gpu usage 4509 MB, inference time 0.4s;

all result is ok.

@zerollzeng
Copy link
Owner

How about building the engine first and then load the engine, I think it can save some memory.

Anyway I'll try to improve this.

@nameli0722
Copy link
Author

How about building the engine first and then load the engine, I think it can save some memory.

Anyway I'll try to improve this.

./tinyexec --onnx /data/sdb/manager/RX0249_liming/coronary_model/onnx_model/unet.onnx --mode 2 --batch_size 1 --save_engine /data/sdb/manager/RX0249_liming/coronary_model/tiny_trt_model/float16_int8_calib/unet.trt --int8 --calibrate_data /data/sdb/manager/RX0249_liming/calib_data/tinyrt_data/

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants