Improve whisper inference speed

The profiling of Whisper-tiny models (encoder and decoder) revealed a significant inference slowdown due to certain operators not being delegated to the XNNPACK backend during the export stage.

The non-delegated operators account for approximately two-thirds of the inference time in the decoder module and around 40% in the encoder module, as shown below (decoder's profiling results, OPERATOR_CALL represents the aggregated result for all non-delegated methods, while DELEGATE_CALL represents the aggregated result for all delegated methods):

| Op Name                              | Total Time (ms) | Share (%) | Calls | Delegated | Delegated (%) |
|--------------------------------------|-----------------|-----------|-------|-----------|---------------|
| Method::execute                      | 138.49          | 100.00%   | 1     | 0         | 0.00%         |
| OPERATOR_CALL                        | 90.902          | 65.64%    | 286   | 0         | 0.00%         |
| native_call_mm.out                   | 70.643          | 51.01%    | 1     | 0         | 0.00%         |
| DELEGATE_CALL                        | 47.446          | 34.26%    | 89    | 0         | 0.00%         |
| Fully Connected (NC, F32) GEMM #1    | 29.328          | 21.18%    | 40    | 40        | 100.00%       |
| Batch Matrix Multiply (NC, F32) GEMM #1 | 8.133        | 5.87%     | 16    | 16        | 100.00%       |
| Transpose (ND, X32) #1               | 6.457           | 4.66%     | 41    | 41        | 100.00%       |
| native_call_where.self_out           | 5.47            | 3.95%     | 9     | 0         | 0.00%         |
| native_call_eq.Scalar_out            | 3.23            | 2.33%     | 8     | 0         | 0.00%         |
| native_call_expand_copy.out          | 2.635           | 1.90%     | 36    | 0         | 0.00%         |
| native_call_gelu.out                 | 2.438           | 1.76%     | 4     | 0         | 0.00%         |
| Softmax (NC, F32) #1                 | 1.318           | 0.95%     | 8     | 8         | 100.00%       |
| native_call_index.Tensor_out         | 1.1             | 0.79%     | 1     | 0         | 0.00%         |
| native_call_slice_copy.Tensor_out    | 0.838           | 0.61%     | 24    | 0         | 0.00%         |
| native_call_clone.out                | 0.815           | 0.59%     | 49    | 0         | 0.00%         |
| native_call_full_like.out            | 0.644           | 0.47%     | 8     | 0         | 0.00%         |
| native_call_view_copy.out            | 0.618           | 0.45%     | 1     | 0         | 0.00%         |
| native_call_native_layer_norm.out    | 0.598           | 0.43%     | 13    | 0         | 0.00%         |
| Transpose (ND, X32) #2               | 0.541           | 0.39%     | 16    | 16        | 100.00%       |
| Add (ND) #1                          | 0.514           | 0.37%     | 17    | 17        | 100.00%       |
| native_call_mul.Scalar_out           | 0.438           | 0.32%     | 16    | 0         | 0.00%         |
| native_call_any.out                  | 0.329           | 0.24%     | 8     | 0         | 0.00%         |
| native_call_logical_not.out          | 0.317           | 0.23%     | 16    | 0         | 0.00%         |
| native_call_gt.Tensor_out            | 0.204           | 0.15%     | 1     | 0         | 0.00%         |
| native_call_unsqueeze_copy.out       | 0.092           | 0.07%     | 11    | 0         | 0.00%         |
| native_call_sub.out                  | 0.092           | 0.07%     | 1     | 0         | 0.00%         |
| native_call__to_dim_order_copy.out   | 0.07            | 0.05%     | 1     | 0         | 0.00%         |
| native_call_ge.Scalar_out            | 0.049           | 0.04%     | 1     | 0         | 0.00%         |
| native_call_embedding.out            | 0.036           | 0.03%     | 1     | 0         | 0.00%         |
| Multiply (ND) #1                     | 0.032           | 0.02%     | 1     | 1         | 100.00%       |
| native_call_arange.start_out         | 0.016           | 0.01%     | 4     | 0         | 0.00%         |
| native_call_full.out                 | 0.009           | 0.01%     | 1     | 0         | 0.00%         |
| native_call_repeat.out               | 0.008           | 0.01%     | 1     | 0         | 0.00%         |
| native_call_scalar_tensor.out        | 0.004           | 0.00%     | 1     | 0         | 0.00%         |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve whisper inference speed #684

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Op Name	Total Time (ms)	Share (%)	Calls	Delegated	Delegated (%)
Method::execute	138.49	100.00%	1	0	0.00%
OPERATOR_CALL	90.902	65.64%	286	0	0.00%
native_call_mm.out	70.643	51.01%	1	0	0.00%
DELEGATE_CALL	47.446	34.26%	89	0	0.00%
Fully Connected (NC, F32) GEMM #1	29.328	21.18%	40	40	100.00%
Batch Matrix Multiply (NC, F32) GEMM #1	8.133	5.87%	16	16	100.00%
Transpose (ND, X32) #1	6.457	4.66%	41	41	100.00%
native_call_where.self_out	5.47	3.95%	9	0	0.00%
native_call_eq.Scalar_out	3.23	2.33%	8	0	0.00%
native_call_expand_copy.out	2.635	1.90%	36	0	0.00%
native_call_gelu.out	2.438	1.76%	4	0	0.00%
Softmax (NC, F32) #1	1.318	0.95%	8	8	100.00%
native_call_index.Tensor_out	1.1	0.79%	1	0	0.00%
native_call_slice_copy.Tensor_out	0.838	0.61%	24	0	0.00%
native_call_clone.out	0.815	0.59%	49	0	0.00%
native_call_full_like.out	0.644	0.47%	8	0	0.00%
native_call_view_copy.out	0.618	0.45%	1	0	0.00%
native_call_native_layer_norm.out	0.598	0.43%	13	0	0.00%
Transpose (ND, X32) #2	0.541	0.39%	16	16	100.00%
Add (ND) #1	0.514	0.37%	17	17	100.00%
native_call_mul.Scalar_out	0.438	0.32%	16	0	0.00%
native_call_any.out	0.329	0.24%	8	0	0.00%
native_call_logical_not.out	0.317	0.23%	16	0	0.00%
native_call_gt.Tensor_out	0.204	0.15%	1	0	0.00%
native_call_unsqueeze_copy.out	0.092	0.07%	11	0	0.00%
native_call_sub.out	0.092	0.07%	1	0	0.00%
native_call__to_dim_order_copy.out	0.07	0.05%	1	0	0.00%
native_call_ge.Scalar_out	0.049	0.04%	1	0	0.00%
native_call_embedding.out	0.036	0.03%	1	0	0.00%
Multiply (ND) #1	0.032	0.02%	1	1	100.00%
native_call_arange.start_out	0.016	0.01%	4	0	0.00%
native_call_full.out	0.009	0.01%	1	0	0.00%
native_call_repeat.out	0.008	0.01%	1	0	0.00%
native_call_scalar_tensor.out	0.004	0.00%	1	0	0.00%

Improve whisper inference speed #684

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions