Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: GridSample in converted model runs very slowly on Arc770 dGPU #28448

Open
3 tasks done
schrodingho opened this issue Jan 15, 2025 · 4 comments
Open
3 tasks done
Assignees
Labels
performance Performance related topics support_request

Comments

@schrodingho
Copy link

schrodingho commented Jan 15, 2025

OpenVINO Version

Master Branch

Operating System

Windows System

Device used for inference

dGPU

OpenVINO installation

PyPi

Programming Language

Python

Hardware Architecture

x86 (64 bits)

Model used

https://github.com/autonomousvision/unimatch

Model quantization

No

Target Platform

OS Name: Microsoft Windows 11 Enterprise
OS Version: 10.0.22631 N/A Build 22631
CPU: 13th Gen Intel(R) Core(TM) i9-13900K
GPU.0: Intel(R) UHD Graphics 770
GPU.1: Intel(R) Arc(TM) A770 Graphics
OpenVINO version: 2024.6.0

Performance issue description

I used OpenVINO to accelerate Unimatch flow inference on a dGPU (Arc A770) and profiled the converted model using benchmark_app. The profiling report revealed that GridSample is the bottleneck, accounting for 80% of the total execution time.

To reduce latency, I replaced the PyTorch function F.grid_sample(input, grid, mode="bilinear", padding_mode="zeros", align_corners=True) with a decomposed version (from this implementation). After benchmarking, this modification reduced the latency from 458.70ms to 215.41ms without affecting the generated flows. I am curious why the original GridSample operator is slow on the Arc A770. Do you have any insights, or suggest some other optimizations, like customizing GridSample OpenCL kernel? I've attached the benchmark_app results and reports for reference (ori_unimatch for the original model and opt_unimatch for the modified one).

ori_unimatch:
Image

benchmark_app -m ori_unimatch.xml -d GPU.1 -api sync -infer_precision f16 -hint throughput -report_type detailed_counters -report_folder "%report_folder%"

[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ] Device info:
[ INFO ] GPU
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Turn on performance counters for GPU.1 device since report type is detailed_counters.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 75.63 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : f32 / [...] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : f32 / [...] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_7) : f32 / [...] / [2,2,320,576]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 2
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_7) : f32 / [...] / [2,2,320,576]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 3059.41 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ]   NETWORK_NAME: Model0
[ INFO ]   OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ]   PERF_COUNT: True
[ INFO ]   ENABLE_CPU_PINNING: False
[ INFO ]   MODEL_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_HOST_TASK_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_THROTTLE: Priority.MEDIUM
[ INFO ]   GPU_ENABLE_LOOP_UNROLLING: True
[ INFO ]   GPU_DISABLE_WINOGRAD_CONVOLUTION: False
[ INFO ]   CACHE_DIR:
[ INFO ]   CACHE_MODE: CacheMode.OPTIMIZE_SPEED
[ INFO ]   PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
[ INFO ]   EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]   COMPILATION_NUM_THREADS: 32
[ INFO ]   NUM_STREAMS: 2
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]   INFERENCE_PRECISION_HINT: f16
[ INFO ]   DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
[ INFO ]   ACTIVATIONS_SCALE_FACTOR: 0.0
[ INFO ]   DEVICE_ID: 1
[ INFO ]   EXECUTION_DEVICES: ['GPU.1']
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'img0'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'img1'!. This input will be filled with random values!
[ INFO ] Fill input 'img0' with random values
[ INFO ] Fill input 'img1' with random values 
[Step 10/11] Measuring performance (Start inference synchronously, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 460.74 ms
[Step 11/11] Dumping statistics report
[ INFO ] Performance counters report is stored to 
[ INFO ] Statistics report is stored to 
[ INFO ] Execution Devices:['GPU.1']
[ INFO ] Count:            131 iterations
[ INFO ] Duration:         60207.90 ms
[ INFO ] Latency:
[ INFO ]    Median:        458.77 ms
[ INFO ]    Average:       458.70 ms
[ INFO ]    Min:           452.05 ms
[ INFO ]    Max:           465.72 ms
[ INFO ] Throughput:   4.35 FPS

opt_unimatch:
Image

benchmark_app -m opt_unimatch.xml -d GPU.1 -api sync -infer_precision f16 -hint throughput -report_type detailed_counters -report_folder "%report_folder%"

[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ] Device info:
[ INFO ] GPU
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Turn on performance counters for GPU.1 device since report type is detailed_counters.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 80.84 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : f32 / [...] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : f32 / [...] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_16) : f32 / [...] / [2,2,320,576]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 2
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_16) : f32 / [...] / [2,2,320,576]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 8530.97 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ]   NETWORK_NAME: Model0
[ INFO ]   OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ]   PERF_COUNT: True
[ INFO ]   ENABLE_CPU_PINNING: False
[ INFO ]   MODEL_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_HOST_TASK_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_THROTTLE: Priority.MEDIUM
[ INFO ]   GPU_ENABLE_LOOP_UNROLLING: True
[ INFO ]   GPU_DISABLE_WINOGRAD_CONVOLUTION: False
[ INFO ]   CACHE_DIR:
[ INFO ]   CACHE_MODE: CacheMode.OPTIMIZE_SPEED
[ INFO ]   PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
[ INFO ]   EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]   COMPILATION_NUM_THREADS: 32
[ INFO ]   NUM_STREAMS: 2
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]   INFERENCE_PRECISION_HINT: f16
[ INFO ]   DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
[ INFO ]   ACTIVATIONS_SCALE_FACTOR: 0.0
[ INFO ]   DEVICE_ID: 1
[ INFO ]   EXECUTION_DEVICES: ['GPU.1']
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'img0'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'img1'!. This input will be filled with random values!
[ INFO ] Fill input 'img0' with random values
[ INFO ] Fill input 'img1' with random values 
[Step 10/11] Measuring performance (Start inference synchronously, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 242.54 ms
[Step 11/11] Dumping statistics report
[ INFO ] Performance counters report is stored to
[ INFO ] Statistics report is stored to 
[ INFO ] Execution Devices:['GPU.1']
[ INFO ] Count:            278 iterations
[ INFO ] Duration:         60109.22 ms
[ INFO ] Latency:
[ INFO ]    Median:        215.37 ms
[ INFO ]    Average:       215.41 ms
[ INFO ]    Min:           205.85 ms
[ INFO ]    Max:           229.31 ms
[ INFO ] Throughput:   9.25 FPS

Step-by-step reproduction

  1. Clone the Unimatch.
  2. Download the pretrained model GMFlow-scale2-regrefine6-mixdata from the Model_Zoo and save it the pretrained folder.
  3. Follow the script gmflow_demo.sh in Scripts to run the model:
python main_flow.py \
--inference_dir demo/flow-davis \
--resume pretrained/gmflow-scale2-regrefine6-mixdata-train320x576-4e7b215d.pth \
--output_path output/gmflow-scale2-regrefine6-davis \
--padding_factor 16 \
--upsample_factor 4 \
--num_scales 2 \
--attn_splits_list 2 8 \
--corr_radius_list -1 4 \
--prop_radius_list -1 1 \
--reg_refine \
--num_reg_refine 2
  1. Add OpenVINO converting code in it and compile the model.
from pathlib import Path
import openvino as ov
ov_opt_device = "cpu"
model_without_ddp = model_without_ddp.to(ov_opt_device)

FIG_H = 320
FIG_W = 576

dummy_input1 = torch.randn(2, 3, FIG_H, FIG_W)
dummy_input2 = torch.randn(2, 3, FIG_H, FIG_W)

example_inputs = (
    dummy_input1,
    dummy_input2,
)
inputs = {
    "img0": dummy_input1,
    "img1": dummy_input2,
}
input_info = [(name, list(inp.shape)) for name, inp in inputs.items()]
UNIMATCH_OV_PATH = Path(f"opt_unimatch.xml")
model_without_ddp.eval()

with torch.no_grad():
    ov_model = ov.convert_model(model_without_ddp, input=input_info, example_input=example_inputs)
    ov.save_model(ov_model, UNIMATCH_OV_PATH, compress_to_fp16=True)
  1. Use benchmark_app to profile it.
benchmark_app -m %converted_model%.xml -d GPU.1 -api sync -infer_precision f16 -hint throughput -report_type detailed_counters -report_folder "%report_folder%"
  1. Change the F.gridsample in /unimatch/matching.py to this implementation, and redo the step 4 and 5.

Issue submission checklist

  • I'm reporting a performance issue. It's not a question.
  • I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
  • There is reproducer code and related data files such as images, videos, models, etc.
@schrodingho schrodingho added performance Performance related topics support_request labels Jan 15, 2025
@dnkurek
Copy link
Contributor

dnkurek commented Jan 15, 2025

Hi, do you also have the same issue with the iGPU or CPU in your system?

Could be that simply grid_sample kernel was not optimized at all, since you are running the slow reference version. This would probably involve writing a opt version instead

@schrodingho
Copy link
Author

Hi, I just ran benchmarks on the iGPU (UHD 770) and the CPU (i9-13900K). The iGPU has the same issue (the grid_sample_ref is slow):

ori_unimatch
Image

opt_unimatch
Image

For CPU, it seems to have no such issue (original is better):

ori_unimatch
Image

opt_unimatch
Image

@dnkurek
Copy link
Contributor

dnkurek commented Jan 16, 2025

Yeah so it looks like grid_sample_ref needs to be optimized and make a grid_sample_opt version perhaps...

@mlukasze
Copy link
Contributor

mlukasze commented Jan 24, 2025

ref ticket: CVS-161002

hey @schrodingho
we've checked few things and it fails as "attn_splits_list" should not be None, but when we try to trace the model you suggested this attn_splits_list is not set, which leads to fail of tracing.
Could you share with us how exactly pytorch model was created before passing to convert_model() or provide a working script?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance related topics support_request
Projects
None yet
Development

No branches or pull requests

4 participants