Skip to content

Conversation

@Aleksei-grovety
Copy link
Contributor

NPU has a restriction that weights must be constant, so the matrix multiplication operation was expressed using split, elementwise multiplication, reduce sum, concatenations operations.

cc @lhutton1, @ekalda, @leandron

NPU has a restriction that weights must be constant so the matrix multiplication operation was expressed using split, elementwise multiplication, reduce sum, concatenations operations
@github-actions github-actions bot requested a review from lhutton1 September 19, 2023 14:07
@Aleksei-grovety
Copy link
Contributor Author

@tvm-bot rerun

Copy link
Contributor

@lhutton1 lhutton1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @Aleksei-grovety it's an interesting idea. I'm curious how the performance differs compared to the CPU, is there any point at which falling back to the CPU becomes a more attractive option as the size of the matmul increases?

@Aleksei-grovety
Copy link
Contributor Author

For a single matrix multiplication operation, the performance decreases, for case 1x16@16x8 on high-Performance (HP) subsystem on NPU we get 39.793 μs, on CPU 31.163 μs, it seems that with increasing size there is a linear dependence, for case 1x256@256x32 on NPU we get 406.993 μs, on CPU 323.863 μs. But in this situation, when there is only a matmul operation, the problem is that reshape and concatenate operations are offloaded on CPU. If we change model input to 1d array and use split and reshape operations (in this case all operations will be offloaded to NPU) for case 1x16@16x8on NPU we get 34.128 μs (it's still worse than on the CPU) but for case 1x256@256x32 on NPU we get 111.208 μs (almost three times better than on the CPU). I think it's worth adding an option to enable the offloading of the matmul operation on the NPU.

Copy link
Contributor

@ekalda ekalda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Aleksei-grovety I think this is really good improvement to the microNPU support! I suppose in the case where there is only matmul on the model the reshape and concatenate are not offloaded because there is no consumer to inline them into? I'd expect this is a special case problem since most of real networks have other operators after the matmul.

@Aleksei-grovety
Copy link
Contributor Author

I suppose in the case where there is only matmul on the model the reshape and concatenate are not offloaded because there is no consumer to inline them into?

This is due to 2 inputs and the preprocess_ext_io pass. Before it, all operations are offloaded to the NPU, and after that the reshape and concatenate operations are added. And these operations are not added to the composites because MergeComposite is called before that.

Copy link
Contributor

@ekalda ekalda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Aleksei-grovety, LGTM!

@ekalda ekalda merged commit fdfd16c into apache:main Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants