Skip to content

Improve whisper inference speed #684

@IgorSwat

Description

@IgorSwat

The profiling of Whisper-tiny models (encoder and decoder) revealed a significant inference slowdown due to certain operators not being delegated to the XNNPACK backend during the export stage.

The non-delegated operators account for approximately two-thirds of the inference time in the decoder module and around 40% in the encoder module, as shown below (decoder's profiling results, OPERATOR_CALL represents the aggregated result for all non-delegated methods, while DELEGATE_CALL represents the aggregated result for all delegated methods):

Op Name Total Time (ms) Share (%) Calls Delegated Delegated (%)
Method::execute 138.49 100.00% 1 0 0.00%
OPERATOR_CALL 90.902 65.64% 286 0 0.00%
native_call_mm.out 70.643 51.01% 1 0 0.00%
DELEGATE_CALL 47.446 34.26% 89 0 0.00%
Fully Connected (NC, F32) GEMM #1 29.328 21.18% 40 40 100.00%
Batch Matrix Multiply (NC, F32) GEMM #1 8.133 5.87% 16 16 100.00%
Transpose (ND, X32) #1 6.457 4.66% 41 41 100.00%
native_call_where.self_out 5.47 3.95% 9 0 0.00%
native_call_eq.Scalar_out 3.23 2.33% 8 0 0.00%
native_call_expand_copy.out 2.635 1.90% 36 0 0.00%
native_call_gelu.out 2.438 1.76% 4 0 0.00%
Softmax (NC, F32) #1 1.318 0.95% 8 8 100.00%
native_call_index.Tensor_out 1.1 0.79% 1 0 0.00%
native_call_slice_copy.Tensor_out 0.838 0.61% 24 0 0.00%
native_call_clone.out 0.815 0.59% 49 0 0.00%
native_call_full_like.out 0.644 0.47% 8 0 0.00%
native_call_view_copy.out 0.618 0.45% 1 0 0.00%
native_call_native_layer_norm.out 0.598 0.43% 13 0 0.00%
Transpose (ND, X32) #2 0.541 0.39% 16 16 100.00%
Add (ND) #1 0.514 0.37% 17 17 100.00%
native_call_mul.Scalar_out 0.438 0.32% 16 0 0.00%
native_call_any.out 0.329 0.24% 8 0 0.00%
native_call_logical_not.out 0.317 0.23% 16 0 0.00%
native_call_gt.Tensor_out 0.204 0.15% 1 0 0.00%
native_call_unsqueeze_copy.out 0.092 0.07% 11 0 0.00%
native_call_sub.out 0.092 0.07% 1 0 0.00%
native_call__to_dim_order_copy.out 0.07 0.05% 1 0 0.00%
native_call_ge.Scalar_out 0.049 0.04% 1 0 0.00%
native_call_embedding.out 0.036 0.03% 1 0 0.00%
Multiply (ND) #1 0.032 0.02% 1 1 100.00%
native_call_arange.start_out 0.016 0.01% 4 0 0.00%
native_call_full.out 0.009 0.01% 1 0 0.00%
native_call_repeat.out 0.008 0.01% 1 0 0.00%
native_call_scalar_tensor.out 0.004 0.00% 1 0 0.00%

Metadata

Metadata

Assignees

Labels

improvementPRs or issues focused on improvements in the current codebasemodelIssues related to exporting, improving, fixing ML models

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions