Kleidi 4b blockwise gemv prototype #997

digantdesai · 2024-10-02T23:35:03Z

This integrates a couple of neon dot prod Kleidi kernel with TorchAO GEMM lower level interface.

The op level wiring is not part of this PR.

All tests pass for both kernels with 1e-4 :)

pytorch-bot · 2024-10-02T23:35:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/997

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f6e22fb with merge base 7038f8b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

metascroy · 2024-10-08T20:39:23Z

torchao/experimental/kernels/cpu/aarch64/CMakeLists.txt

+FetchContent_MakeAvailable(kleidiai)
+
+# Disabled by default. Force enable if we are on a suitable system.
+# TODO: Introduce ISA specific flags for i8mm.


Can you leave it disabled by default until we benchmark it against existing kernel in torchchat? I want to make sure we don't regress torchchat perf.

This doesn't wire it up at the op level, and we enable only for armv8 and we only have dotprod kernels so this should be OK. Before we add i8mm kernels we have to fix the CMake and also the op level wiring.

...kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp1x8_qsi4c32p4x8_1x4x32_neon_dotprod.h

metascroy · 2024-10-08T20:42:58Z

...kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp1x8_qsi4c32p8x8_1x8x32_neon_dotprod.h

@@ -0,0 +1,124 @@
+// Copyright (c) Meta Platforms, Inc. and affiliates.


This file is very similar to the 1x4x32 one above. Do you think it's possible to reuse some code? Same comment with next file.

yes! I want to lean on you c++ experts 😅

If you wanna do this as follow up thats also ok, but I do agree that it can probably be structured differently. e.g get_ukernel can be factored out to take type of the kernel as arg

metascroy · 2024-10-08T20:44:42Z

torchao/experimental/kernels/cpu/aarch64/tests/test_linear.cpp

+// #ifdef TORCHAO_ENABLE_KLEIDI
+// TODO: Wire up the the compile defination for TORCHAO_ENABLE_KLEIDI
+
+template <int weight_nbit, bool has_weight_zeros, bool has_bias, bool has_clamp>


Is this templating needed for Kleidi?

will remove wight_nbit and has_weight_zeros.

has_bias is something we will need. And adding new tests for has_clamp :P

torchao/experimental/kernels/cpu/aarch64/tests/test_linear.cpp

torchao/experimental/kernels/cpu/aarch64/tests/test_utils.h

metascroy · 2024-10-08T20:50:33Z

torchao/experimental/ops/linear_8bit_act_xbit_weight/op_linear_8bit_act_xbit_weight-impl.h

 #endif // defined(__aarch64__) || defined(__ARM_NEON)

 #include <torchao/experimental/ops/linear_8bit_act_xbit_weight/linear_8bit_act_xbit_weight.h>
+#include <torchao/experimental/kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp_qsi4c32p.h>


protect with TORCHAO_ENABLE_KLEIDI

I guess I can drop op level completely.

metascroy · 2024-10-08T20:50:48Z

torchao/experimental/ops/linear_8bit_act_xbit_weight/op_linear_8bit_act_xbit_weight-impl.h

@@ -8,9 +8,11 @@

 #if defined(__aarch64__) || defined(__ARM_NEON)
 #include <torchao/experimental/kernels/cpu/aarch64/linear/linear.h>
+#include <torchao/experimental/kernels/cpu/aarch64/kleidi/pack.h>


protect with TORCHAO_ENABLE_KLEIDI

kimishpatel

Mine are mostly nits at this point

torchao/experimental/build_torchao_ops.sh

kimishpatel · 2024-10-09T04:19:09Z

torchao/experimental/kernels/cpu/aarch64/CMakeLists.txt

+# KleidiAI is an open-source library that provides optimized
+# performance-critical routines, also known as micro-kernels, for artificial
+# intelligence (AI) workloads tailored for Arm® CPUs.
+FetchContent_Declare(kleidiai


Why add this as build time dependency instead of 3p-lib? Wait I guess gitlab?

Do you mean as opposed to a git submodule? Just to keep it simple for now.

kimishpatel · 2024-10-10T14:31:44Z

...kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp1x8_qsi4c32p4x8_1x4x32_neon_dotprod.h

+
+#include <torchao/experimental/kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp_qsi4c32p.h>
+
+namespace torchao::kernels::cpu::aarch64::kleidi {


what is the cpp compliance on using namespace like this? Just confirm that it is atleast c++17

CMake dictates we can assume c++17

kimishpatel · 2024-10-10T14:34:21Z

...kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp1x8_qsi4c32p8x8_1x8x32_neon_dotprod.h

+namespace neon_dotprod_1x8x32 {
+const Ukernel get_ukernel() {
+  return Ukernel{
+      .get_m_step =


Also what are m/n step?

https://gitlab.arm.com/kleidi/kleidiai/-/blob/main/kai/ukernels/matmul/matmul_clamp_f32_qai8dxp_qsi4c32p/kai_matmul_clamp_f32_qai8dxp1x8_qsi4c32p4x8_1x4x32_neon_dotprod.h?ref_type=heads#L22

kimishpatel · 2024-10-10T14:36:35Z

...kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp1x8_qsi4c32p8x8_1x8x32_neon_dotprod.h

+namespace torchao::kernels::cpu::aarch64::kleidi {
+namespace kai_matmul_clamp_f32_qai8dxp_qsi4c32p {
+namespace neon_dotprod_1x8x32 {
+const Ukernel get_ukernel() {


For furture: I presume you will have to parameterize this for different kernels?

Also would it make sense to structure this in a way that this function moves to kleidi?

need to think some more to support (1) AoT/Runtime weight packing, (2) per cpu uArch based uKernel selection. These logic would dictate how this interface looks like. So did something minimal here for the "prototype" but agree we can improve.

kimishpatel · 2024-10-10T14:52:57Z

...kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp1x8_qsi4c32p4x8_1x4x32_neon_dotprod.h

+    int k,
+    int group_size,
+    const void* weight_data,
+    const void* activation_data,


is this packed activaiton and packed weight? if so maybe worth naming such

kimishpatel · 2024-10-10T14:53:26Z

...kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp1x8_qsi4c32p8x8_1x8x32_neon_dotprod.h

+      clamp_max);
+}
+
+size_t get_alignement() {


It's part of the high-level op interface. FYI, @digantdesai, a landing bootcamper diff [D63873383] renamed things to preferred alignment to address a BE/EE backlog task. So make sure you rebase and retest before landing.

oh this is for the op level which can come in the later diffs

kimishpatel · 2024-10-10T14:55:06Z

...kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp1x8_qsi4c32p8x8_1x8x32_neon_dotprod.h

@@ -0,0 +1,124 @@
+// Copyright (c) Meta Platforms, Inc. and affiliates.


If you wanna do this as follow up thats also ok, but I do agree that it can probably be structured differently. e.g get_ukernel can be factored out to take type of the kernel as arg

kimishpatel · 2024-10-10T15:18:35Z

torchao/experimental/kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp_qsi4c32p.h

+      ukernel.get_kr(),
+      ukernel.get_sr(),
+      group_size,
+      kai_datatype::kai_dt_bf16);


I think the kleidi kernel keeps scales as bf16 to save space.

yeah we asked Kleidi to do this :p

kimishpatel · 2024-10-10T15:28:26Z

torchao/experimental/kernels/cpu/aarch64/tests/test_linear.cpp

+#include <torchao/experimental/kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp1x8_qsi4c32p4x8_1x4x32_neon_dotprod.h>
+#include <torchao/experimental/kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp1x8_qsi4c32p8x8_1x8x32_neon_dotprod.h>


so this should be behind TORCHAO_ENABLE_KLEIDI?

metascroy · 2024-10-10T16:07:34Z

torchao/experimental/kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp_qsi4c32p.h

+  size_t n_groups = n * k / group_size;
+  auto weight_scales_bf16 = std::vector<uint16_t>(n_groups, 0);
+  for (size_t i = 0; i < n_groups; i++) {
+    assert(weight_zeros[i] == 0);


Maybe assert weight_zeros is a nullptr or all of its entries are zero?

can it be a nullptr?

Use nullptr to mean "no weight zeros", rather than create a thing of zeros.

checking this only if !nullptr, else unused.

metascroy · 2024-10-10T16:11:26Z

...kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp1x8_qsi4c32p4x8_1x4x32_neon_dotprod.h

+    const float* bias,
+    float clamp_min,
+    float clamp_max) {
+  (void)bias; // unused - needs API fixing


That or it could be added in this wrapper after the ukernel.run_matmul call.

not sure I follow, can you elaborate? kleidi wants bias in weight packing not here.

facebook-github-bot · 2024-10-10T18:54:53Z

@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-10-10T20:31:30Z

@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-10-11T00:07:17Z

@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Differential Revision: D64194844 Pull Request resolved: #997

As I was looking through documentations, In `/docs/Models.md`, I noticed one relative link `docs/GGUF.md` has a typo, which should be `GGUF.md`, so I changed it.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 2, 2024

digantdesai force-pushed the kleidi_prototype_gemv branch 2 times, most recently from 343f7d4 to fc0cd6d Compare October 8, 2024 01:53

digantdesai marked this pull request as ready for review October 8, 2024 03:48

digantdesai requested review from metascroy and kimishpatel October 8, 2024 03:49

digantdesai force-pushed the kleidi_prototype_gemv branch from 61694eb to a3a49c6 Compare October 8, 2024 03:51

metascroy reviewed Oct 8, 2024

View reviewed changes

digantdesai force-pushed the kleidi_prototype_gemv branch 2 times, most recently from 7a766df to b1e6f8e Compare October 10, 2024 15:19

kimishpatel reviewed Oct 10, 2024

View reviewed changes

metascroy reviewed Oct 10, 2024

View reviewed changes

metascroy approved these changes Oct 10, 2024

View reviewed changes

digantdesai force-pushed the kleidi_prototype_gemv branch from b42fd69 to e68a9e2 Compare October 10, 2024 18:53

digantdesai force-pushed the kleidi_prototype_gemv branch from e68a9e2 to f9a68f9 Compare October 10, 2024 20:30

digantdesai added 10 commits October 10, 2024 19:05

[experimental] simple script UX fixes

036b782

[experimental][kleidi] Add build support

c4b9f1e

[experimental][kleidi] Add uConfig support for qb4w 1x4x32 neon dotprod

4a85c4d

[experimental][kleidi] Add a basic test - compiles

49afa4a

[experimental][kleidi] Pin kleidiai repo

569c069

[experimental][kleidi] Clean up pack.h

fd1423f

[experimental][kleidi] Refactor interface header

c323fb1

[experimental][kleidi] Improve unit-tests

8aa27c4

[experimental][kleidi] move common functions to interface

44ca4de

[experimental][kleidi] Add 1x8x32 neon dotprod kernel

c272739

digantdesai added 9 commits October 10, 2024 19:05

[experimental][kleidi] linter

ee62be5

[experimental][kleidi] Reduce template types for tests

ee49c6e

[experimental][kleidi] Add m>1 tests

a905ec3

[experimental][kleidi] rename bf16 weight scale flag

7429bea

[experimental][kleidi] Build kernel tests in debug mode

f28e556

[experimental][kleidi] Add TODO tasks

17f2b43

[experimental][kleidi] Allow weight zeros to be a nullptr

3049ded

[experimental][kleidi] rebase fixes with int to size_t

d4bb3ed

[experimental][kleidi] compile-time preprocessor switch for kleidi tests

f6e22fb

digantdesai force-pushed the kleidi_prototype_gemv branch from f9a68f9 to f6e22fb Compare October 11, 2024 00:06

facebook-github-bot merged commit db72dd1 into main Oct 11, 2024
18 of 19 checks passed

jainapurva pushed a commit that referenced this pull request Oct 15, 2024

Kleidi 4b blockwise gemv prototype

bb4ab4f

Differential Revision: D64194844 Pull Request resolved: #997

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

Update Models.md (pytorch#997)

50d1bb3

As I was looking through documentations, In `/docs/Models.md`, I noticed one relative link `docs/GGUF.md` has a typo, which should be `GGUF.md`, so I changed it.

		@@ -0,0 +1,124 @@
		// Copyright (c) Meta Platforms, Inc. and affiliates.


		#include <torchao/experimental/kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp_qsi4c32p.h>

		namespace torchao::kernels::cpu::aarch64::kleidi {

Kleidi 4b blockwise gemv prototype #997

Kleidi 4b blockwise gemv prototype #997

Conversation

digantdesai commented Oct 2, 2024 • edited Loading

pytorch-bot bot commented Oct 2, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/997

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kimishpatel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 10, 2024

facebook-github-bot commented Oct 10, 2024

facebook-github-bot commented Oct 11, 2024

digantdesai commented Oct 2, 2024 •

edited

Loading

pytorch-bot bot commented Oct 2, 2024 •

edited

Loading