Squashed commit of the following:

commit df29a4c Author: ddavis-2015 <[email protected]> Date: Wed Oct 23 03:47:27 2024 -0700 header file cleanup. commit 7dc34a9 Author: ddavis-2015 <[email protected]> Date: Mon Oct 21 17:16:24 2024 -0700 update to latest Cadence decompression code. commit b43c16c Author: ddavis-2015 <[email protected]> Date: Mon Oct 21 05:05:15 2024 -0700 fix code style errors. commit 2d825e3 Author: ddavis-2015 <[email protected]> Date: Mon Oct 21 04:09:39 2024 -0700 fix code style errors. fix BUILD file style errors. commit 459569a Author: ddavis-2015 <[email protected]> Date: Sun Oct 20 19:17:43 2024 -0700 use kernel optimzer level -O3 and -LNO:simd for Xtensa HIFI5 commit 122db20 Author: ddavis-2015 <[email protected]> Date: Sun Oct 20 19:07:19 2024 -0700 add compression build/test to bazel default test script commit d96b614 Author: ddavis-2015 <[email protected]> Date: Sun Oct 20 18:56:31 2024 -0700 fix CI code style errors fix CI BUILD file style errors commit 821dfdf Author: ddavis-2015 <[email protected]> Date: Sun Oct 20 12:19:34 2024 -0700 Cleanup header file usage. Add decompression code to Bazel BUILD files commit 4a02b22 Author: ddavis-2015 <[email protected]> Date: Sat Oct 19 13:31:11 2024 -0700 cleanup commit 3d765e6 Author: ddavis-2015 <[email protected]> Date: Fri Oct 18 18:01:21 2024 -0700 Squashed commit of the following: commit eaee851 Author: ddavis-2015 <[email protected]> Date: Fri Oct 18 17:48:48 2024 -0700 Squashed commit of the following: commit 4894265 Author: ddavis-2015 <[email protected]> Date: Fri Oct 18 17:48:05 2024 -0700 pre-merge empty commit commit a110e41 Author: ddavis-2015 <[email protected]> Date: Fri Oct 18 16:17:13 2024 -0700 fix C++ bitwidth 6 & 7 decompression commit efedcc2 Author: ddavis-2015 <[email protected]> Date: Fri Oct 18 10:18:50 2024 -0700 working decompression unit test commit 81ecf2e Author: ddavis-2015 <[email protected]> Date: Thu Oct 17 18:17:06 2024 -0700 decompression unit test improvements commit b318421 Author: ddavis-2015 <[email protected]> Date: Wed Oct 16 17:34:09 2024 -0700 add decompression unit test commit 9bb2b63 Author: ddavis-2015 <[email protected]> Date: Sun Oct 13 18:34:01 2024 -0700 cleanup commit 77bb05d Author: ddavis-2015 <[email protected]> Date: Sun Oct 13 18:29:33 2024 -0700 align compressed tensor data as per schema commit ad2b1c3 Author: ddavis-2015 <[email protected]> Date: Sat Oct 12 22:35:54 2024 -0700 reduce HIFI5 decompression code size commit 99c6e35 Author: ddavis-2015 <[email protected]> Date: Fri Oct 11 14:02:58 2024 -0700 revert to original Cadence bit width 4 code commit 2388549 Author: ddavis-2015 <[email protected]> Date: Thu Oct 10 17:50:29 2024 -0700 refactor decompression code into reference and platform specific Apply some Xtensa acceleration code changes commit b84853c Author: ddavis-2015 <[email protected]> Date: Tue Oct 8 16:08:55 2024 -0700 testing commit c107f42 Author: Ryan Kuester <[email protected]> Date: Thu Oct 17 14:31:03 2024 -0500 refactor: move misplaced TF_LITE_REMOVE_VIRTUAL_DELETEs to private: Move several TF_LITE_REMOVE_VIRTUAL_DELETE declarations that are wrongly in a public section of their classes. To have the intended effect, as documented in t/l/m/compatibility.h, these must be in a private section. commit 7b3a2bd Author: Ryan Kuester <[email protected]> Date: Thu Oct 17 12:36:46 2024 -0500 build(bazel): always build with TF_LITE_STATIC_MEMORY Add TF_LITE_STATIC_MEMORY to the defines set globally for TFLM builds in Bazel. TFLM always builds with this set in Make, and it appears to have been an oversight that it wasn't set during Bazel builds. Not having it set in Bazel caused some unit tests to pass under Bazel that failed under Make. At the same time, add -fno-exceptions. This flag is also always set in Make builds. Without it, setting TF_LITE_STATIC_MEMORY breaks the build. TF_LITE_STATIC_MEMORY triggers TF_LITE_REMOVE_VIRTUAL_DELETE in t/l/m/compatibility.h, which makes operator delete private in certain classes. When exceptions are enabled, a placement new with those classes is allowed to throw an exception, and operator delete is implicitly called during the unwind. The build breaks because operator delete can't be called if it's private. Disabling exceptions eliminates the unwind code that calls operator delete implicitly, and thus the build succeeds. In any case, -fno-exceptions should have been used in Bazel builds, matching the flags used in Make and the no-exceptions design requirement of the TFLM project. commit 1eb4e0d Author: Ryan Kuester <[email protected]> Date: Thu Oct 17 11:05:45 2024 -0500 feat(python): don't check .sparsity in interpreter Remove the check for sparse tensors in the Python interpreter wrapper. This fixes a broken build when TF_LITE_STATIC_MEMORY is set, which should always be the case in TFLM. TfLiteTensor objects don't have a .sparsity member when TF_LITE_STATIC_MEMORY is set. This prepares for an upcoming commit setting TF_LITE_STATIC_MEMORY during Bazel builds. This hasn't caused build failures in Make builds, which have always set TF_LITE_STATIC_MEMORY, because Make builds don't build the Python interpreter wrapper. commit 7217095 Author: Ryan Kuester <[email protected]> Date: Wed Oct 16 14:03:25 2024 -0500 fix(memory_arena_threshold): with TF_LITE_STATIC_MEMORY Fix the broken build due to redefinition of the threshold when TF_LITE_STATIC_MEMORY is set. Apparently this case isn't triggered in any Bazel test, only in Make. Simplify the threshold specification by only depending on whether compression is enabled and not also on whether TF_LITE_STATIC_MEMORY is in use. commit 8e4e55e Author: Ryan Kuester <[email protected]> Date: Thu Oct 10 12:38:03 2024 -0500 build(bazel): disable codegen when building --//:with_compression The codegen prototype code is not compatible with the changes which implement model compression made to the core TFLM components. For now, disable codegen targets when building with compression enabled. commit 884a234 Author: Ryan Kuester <[email protected]> Date: Tue Oct 15 18:31:01 2024 -0500 build(bazel): compile in compression when --//:with_compression Conditionally compile in support for compressed tensors when the option --//:with_compression is given. commit a1d459b Author: Ryan Kuester <[email protected]> Date: Thu Oct 10 12:28:39 2024 -0500 build(bazel): add --//with_compression build setting Add a --//with_compression user-defined build setting and a corresponding configuration setting. commit 4edc564 Author: Ryan Kuester <[email protected]> Date: Thu Oct 10 12:24:53 2024 -0500 build(bazel): fix compression-related dependencies of micro_allocator commit a52f97f Author: Ryan Kuester <[email protected]> Date: Tue Oct 15 17:28:09 2024 -0500 build(bazel): replace cc_* with tflm_cc_* in remaining TFLM code Replace cc_* targets remaining in TFLM code with tflm_cc_* targets. These are targets which did not formerly use the common copts. Avoid changing imported TFLite code, if for no other reason than to avoid merge conflicts during the automatic sync with upstream TFLite. commit a6368f4 Author: Ryan Kuester <[email protected]> Date: Fri Oct 11 16:08:34 2024 -0500 build(bazel): introduce tflm_cc_* macros, refactoring away micro_copts Remove micro_copts() by replacing every cc_* target that used them with a tflm_cc_* equivalent, and setting those common copts in one place, inside the tflm_cc_* macro. This is the first of several commits introducing tflm_cc_* macros in place of cc_binary, cc_library, and cc_test. Motivated by the upcoming need to support conditional compilation, the objective is to centralize build configuration rather than requiring (and remembering that) each cc_* target in the project add the same common attributes such as compiler options and select()ed #defines. Alternatives such as setting global options on the command line or in .bazelrc, even if simplified with a --config option, fail to preserve flags and hooks for configuration in the case TFLM is used as an external repository by an application project. Nor is it easy in that case for individual targets to override an otherwise global setting. commit 1518422 Author: Ryan Kuester <[email protected]> Date: Thu Oct 10 23:56:49 2024 -0500 chore: remove obsolete ci/temp_patches Remove ci/temp_patches, which was obsoleted in 23f608f once it was no longer used by the sync script. It should have been deleted then. Remove it not only to clean up dead code, but because it contains a reference to `micro_copts`, which is about to be refactored away, and we don't want to leave stray references to it in the tree. commit 18ef080 Author: Ryan Kuester <[email protected]> Date: Tue Oct 8 17:58:12 2024 -0500 refactor: use metadata_saved.h instead of metadata_generated.h Use the generated file metadata_saved.h instead of metadata_generated.h for the reasons explained in t/l/m/compression/BUILD:metadata_saved. Delete metadata_generated.h from the source tree as it is not maintained. commit 5a02e30 Author: Ryan Kuester <[email protected]> Date: Thu Oct 10 13:46:46 2024 -0500 test(memory_arena_threshold): adjust expected value with compression Fix a test failure by setting a different expected value for the persistent buffer allocation when compression is configured in. The allocation was allowed to vary by 3%; however, compression adds ~10%. Set the expected value to the measured value when compression is configured in. commit 01bc582 Author: Ryan Kuester <[email protected]> Date: Thu Oct 10 13:35:10 2024 -0500 test(memory_arena_threshold): don't expect exact allocation values Remove the check for allocation sizes to exactly match expected values. This check immediately followed--and thus rendered pointless---a check that sizes are within a certain percentage, which seems to be the true intent of the test. commit e0aae77 Merge: e328029 e86d97b Author: Ryan Kuester <[email protected]> Date: Wed Oct 16 13:39:56 2024 -0500 Merge branch 'main' into compress-testing commit e328029 Author: Ryan Kuester <[email protected]> Date: Mon Oct 7 12:52:23 2024 -0500 build(bazel): fix dependencies in work-in-progress compression code In the Bazel build, add dependencies needed by the code added to t/l/m:micro_context for decompression. The Bazel build with or without compression was broken without this. commit e86d97b Author: RJ Ascani <[email protected]> Date: Mon Oct 7 10:36:26 2024 -0700 Replace rascani with suleshahid on OWNERS (tensorflow#2715) BUG=none commit b773428 Author: Ryan Kuester <[email protected]> Date: Fri Oct 4 09:59:10 2024 -0500 feat(compression): add work-in-progress compression and viewer tools commit f6bd486 Merge: 487c17a e3f6dc1 Author: Ryan Kuester <[email protected]> Date: Fri Oct 4 09:36:24 2024 -0500 Merge branch 'main' into compress-prerelease commit e3f6dc1 Author: David Davis <[email protected]> Date: Thu Oct 3 10:45:00 2024 -0700 Compression documentation (tensorflow#2711) @tensorflow/micro Add documentation describing some compression/decompression internals and makefile build procedures. bug=tensorflow#2710 commit b3967a9 Author: Ryan Kuester <[email protected]> Date: Wed Oct 2 13:36:01 2024 -0500 style: add .style.yapf to control yapf styling of Python code (tensorflow#2709) Add a .style.yapf file so yapf can be used to style Python code without passing the project's style via command line option. Remove the corresponding patch to pigweed's call to yapf, used by CI, and instead let it too rely on .style.yapf. Remove the developer documentation's instruction to use the command line option. BUG=description commit d249577 Author: Ryan Kuester <[email protected]> Date: Tue Oct 1 16:16:45 2024 -0500 build(codegen): suppress noise in console output (tensorflow#2708) Add a --quiet option to the code_generator binary so that when it's used within the build system, it doesn't print unexpected, distracting noise to the console. Generally, compiler or generator commands don't print output unless there's an error. BUG=description commit 4894265 Author: ddavis-2015 <[email protected]> Date: Fri Oct 18 17:48:05 2024 -0700 pre-merge empty commit commit a110e41 Author: ddavis-2015 <[email protected]> Date: Fri Oct 18 16:17:13 2024 -0700 fix C++ bitwidth 6 & 7 decompression commit efedcc2 Author: ddavis-2015 <[email protected]> Date: Fri Oct 18 10:18:50 2024 -0700 working decompression unit test commit 81ecf2e Author: ddavis-2015 <[email protected]> Date: Thu Oct 17 18:17:06 2024 -0700 decompression unit test improvements commit b318421 Author: ddavis-2015 <[email protected]> Date: Wed Oct 16 17:34:09 2024 -0700 add decompression unit test commit 9bb2b63 Author: ddavis-2015 <[email protected]> Date: Sun Oct 13 18:34:01 2024 -0700 cleanup commit 77bb05d Author: ddavis-2015 <[email protected]> Date: Sun Oct 13 18:29:33 2024 -0700 align compressed tensor data as per schema commit ad2b1c3 Author: ddavis-2015 <[email protected]> Date: Sat Oct 12 22:35:54 2024 -0700 reduce HIFI5 decompression code size commit 99c6e35 Author: ddavis-2015 <[email protected]> Date: Fri Oct 11 14:02:58 2024 -0700 revert to original Cadence bit width 4 code commit 2388549 Author: ddavis-2015 <[email protected]> Date: Thu Oct 10 17:50:29 2024 -0700 refactor decompression code into reference and platform specific Apply some Xtensa acceleration code changes commit b84853c Author: ddavis-2015 <[email protected]> Date: Tue Oct 8 16:08:55 2024 -0700 testing
rkuester · Oct 23, 2024 · b7bc438 · b7bc438
1 parent 052a6b8
commit b7bc438
Show file tree

Hide file tree

Showing 4 changed files with 255 additions and 53 deletions.
diff --git a/tensorflow/lite/micro/compression/BUILD b/tensorflow/lite/micro/compression/BUILD
@@ -1,7 +1,3 @@
-load("//tensorflow/lite/micro:build_def.bzl",
-    "tflm_cc_library",
-    "tflm_cc_test",
-)
 load(
     "//tensorflow/lite/micro:build_def.bzl",
     "tflm_cc_library",

diff --git a/tensorflow/lite/micro/kernels/decompress.h b/tensorflow/lite/micro/kernels/decompress.h
@@ -13,6 +13,9 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+#ifndef TENSORFLOW_LITE_MICRO_MICRO_KERNELS_DECOMPRESS_H_
+#define TENSORFLOW_LITE_MICRO_MICRO_KERNELS_DECOMPRESS_H_
+
 #include <cstdint>
 
 #include "tensorflow/lite/micro/compression.h"
@@ -82,3 +85,5 @@ struct DecompressionState {
 #endif  // USE_TFLM_COMPRESSION
 
 }  // namespace tflite
+
+#endif  // TENSORFLOW_LITE_MICRO_MICRO_KERNELS_DECOMPRESS_H_
diff --git a/tensorflow/lite/micro/kernels/xtensa/decompress.cc b/tensorflow/lite/micro/kernels/xtensa/decompress.cc
@@ -43,16 +43,15 @@ struct DecompressionStateXtensa : DecompressionState {
       : DecompressionState(other) {}
 
   void DecompressToBufferWidth4_Xtensa(int8_t* buffer);
-  void DecompressToBufferWidth4_Xtensa_Old(int8_t* buffer);
+  void DecompressToBufferWidth3_Xtensa(int8_t* buffer);
+  void DecompressToBufferWidth2_Xtensa(int8_t* buffer);
 
   void DecompressToBufferWidthAnyInt8_Xtensa(int8_t* buffer);
   void DecompressToBufferWidthAnyInt16_Xtensa(int16_t* buffer);
   void DecompressToBufferWidthAnyInt32_Xtensa(int32_t* buffer);
   void DecompressToBufferWidthAnyInt64_Xtensa(int64_t* buffer);
 };
 
-// TODO(ddavis-2015): unaligned/stride code has error, method not currently
-// used.
 void DecompressionStateXtensa::DecompressToBufferWidth4_Xtensa(int8_t* buffer) {
   ScopedMicroProfiler scoped_profiler(__func__, micro_profiler_);
 
@@ -76,6 +75,8 @@ void DecompressionStateXtensa::DecompressToBufferWidth4_Xtensa(int8_t* buffer) {
 
   const uint8_t* __restrict value_table_t = value_table;
 
+  ae_valignx2 align_store = AE_ZALIGN128();
+
   for (size_t i = 0; i < num_channels_; i++) {
     value_table_t = value_table;
     ae_valignx2 align_vtab = AE_LA128_PP(value_table_t);
@@ -84,7 +85,6 @@ void DecompressionStateXtensa::DecompressToBufferWidth4_Xtensa(int8_t* buffer) {
     AE_DSEL8X8(d_value_0, d_value_1, d_value_0_t, d_value_1_t,
                d_shuffle_value_t);
 
-    ae_valignx2 align_store = AE_ZALIGN128();
     ae_valign align_load = AE_LA64_PP(pIn_tmp);
 
     for (j = 0; j < elements_per_channel_t_by_4; j++) {
@@ -95,57 +95,257 @@ void DecompressionStateXtensa::DecompressToBufferWidth4_Xtensa(int8_t* buffer) {
     }
 
     value_table += stride;
-
-    ae_valignx2 align_index = AE_LA128_PP(pIn_tmp);
-    AE_LAV8X8X2_XP(d_index, d_dummy, align_index, (ae_int8x16*)pIn_tmp,
-                   (elements_per_channel_t_rem >>
-                    1)); /* Loading 48 bits for decoding 16 weight values */
-    AE_DSEL8X8(d_out1, d_out2, d_value_0, d_value_1, d_index);
-    AE_DSEL8X8(d_out1, d_out2, d_out1, d_out2, d_shuffle_t);
-    AE_SAV8X8X2_XP(d_out1, d_out2, align_store, (ae_int8x16*)p_out_tmp,
-                   elements_per_channel_t_rem);
-    AE_SA128POS_FP(align_store, (ae_int8x16*)p_out_tmp);
+    if (elements_per_channel_t_rem) {
+      ae_valignx2 align_index = AE_LA128_PP(pIn_tmp);
+      AE_LAV8X8X2_XP(d_index, d_dummy, align_index, (ae_int8x16*)pIn_tmp,
+                     (elements_per_channel_t_rem >>
+                      1)); /* Loading 48 bits for decoding 16 weight values */
+      AE_DSEL8X8(d_out1, d_out2, d_value_0, d_value_1, d_index);
+      AE_DSEL8X8(d_out1, d_out2, d_out1, d_out2, d_shuffle_t);
+      AE_SAV8X8X2_XP(d_out1, d_out2, align_store, (ae_int8x16*)p_out_tmp,
+                     elements_per_channel_t_rem);
+    }
   }
+  AE_SA128POS_FP(align_store, (ae_int8x16*)p_out_tmp);
 }
 
-void DecompressionStateXtensa::DecompressToBufferWidth4_Xtensa_Old(
-    int8_t* buffer) {
+void DecompressionStateXtensa::DecompressToBufferWidth3_Xtensa(int8_t* buffer) {
   ScopedMicroProfiler scoped_profiler(__func__, micro_profiler_);
 
-  char shuffle_pattern_1[8] = {0x08, 0x19, 0x2A, 0x3B, 0x4C, 0x5D, 0x6E, 0x7F};
-  ae_int8x8 d_shuffle_t = *(ae_int8x8*)&shuffle_pattern_1[0];
+  int i, j;
+  ae_int8* __restrict p_out_tmp = (ae_int8*)buffer;
+  ae_int8x8* pIn_tmp = (ae_int8x8*)compressed_indices_;
+  const uint8_t* __restrict value_table =
+      static_cast<const uint8_t*>(comp_data_.data.lut_data->value_table);
+
+  const uint8_t* __restrict value_table_t = value_table;
 
-  char shuffle_pattern_2[8] = {0xFB, 0x73, 0xEA, 0x62, 0xD9, 0x51, 0xC8, 0x40};
-  ae_int8x8 d_d_shuffle_t2 = *(ae_int8x8*)&shuffle_pattern_2[0];
+  int num_channels_t = num_channels_;
+  const size_t stride = comp_data_.data.lut_data->value_table_channel_stride;
 
+  int elements_per_channel_t_by_4 = elements_per_channel_ >> 4;
+  int elements_per_channel_t_rem = elements_per_channel_ & 0xF;
+
+  ae_int8x8 d_index, d_dummy;
+  ae_int8x8 d1, d2, d3, d4, d5, d6, d7, d8, d9, d10, d11;
   ae_int8x8 d_out1, d_out2;
-  ae_int8x8 d_value_0, d_value_1;
-  ae_int8x8 d_index;
 
-  int elements_per_channel_t = elements_per_channel_;
-  int num_channels_t = num_channels_;
-  ae_int8x8* __restrict pIn_tmp = (ae_int8x8*)compressed_indices_;
-  ae_int8* __restrict p_out_tmp = (ae_int8*)buffer;
+  ae_valignx2 align_index = AE_LA128_PP(pIn_tmp);
 
-  const size_t stride = comp_data_.data.lut_data->value_table_channel_stride;
+  ae_int8x8 d_shuffle_value_t = AE_MOVINT8X8_FROMINT64(0x08192A3B4C5D6E7FLL);
+  ae_int8x8 d_shuffle_t1 = AE_MOVINT8X8_FROMINT64(0x0F00050C00020000LL);
+  ae_int8x8 d_shuffle_t2 = AE_MOVINT8X8_FROMINT64(0x000E00040B000100LL);
+  ae_int8x8 d_shuffle_t3 = AE_MOVINT8X8_FROMINT64(0x0F060D040C030A01LL);
+  ae_int8x8 d_shuffle_t = AE_MOVINT8X8_FROMINT64(0xFB73EA62D951C840LL);
+
+  ae_valignx2 align_store = AE_ZALIGN128();
+
+  for (i = 0; i < num_channels_t; i++) {
+    ae_int8x8 d_value_0 = AE_MOVINT8X8_FROMINT64(AE_ZERO());
+    ae_int8x8 d_value_1 = AE_MOVINT8X8_FROMINT64(AE_ZERO());
+
+    value_table_t = value_table;
+
+    ae_valign align_vtab = AE_LA64_PP(value_table_t);
+    AE_LA8X8_IP(d_value_0, align_vtab, (ae_int8x8*)value_table_t);
+    AE_DSEL8X8(d_value_0, d_value_1, d_value_0, d_value_1, d_shuffle_value_t);
+
+    for (j = 0; j < elements_per_channel_t_by_4; j++) {
+      AE_LAV8X8X2_XP(d_index, d_dummy, align_index, (ae_int8x16*)pIn_tmp,
+                     6); /* Loading 48 bits for decoding 16 weight values */
+
+      d1 =
+          AE_MOVINT8X8_FROMINT64(AE_SRLI64(AE_MOVINT64_FROMINT8X8(d_index), 1));
+      d2 =
+          AE_MOVINT8X8_FROMINT64(AE_SRLI64(AE_MOVINT64_FROMINT8X8(d_index), 2));
+      d3 =
+          AE_MOVINT8X8_FROMINT64(AE_SRLI64(AE_MOVINT64_FROMINT8X8(d_index), 3));
+      d4 =
+          AE_MOVINT8X8_FROMINT64(AE_SRLI64(AE_MOVINT64_FROMINT8X8(d_index), 4));
+
+      d1 = AE_MOVINT8X8_FROMINT64(
+          AE_AND64(AE_MOVINT64_FROMINT8X8(d1), 0x7007007007000000LL));
+      d2 = AE_MOVINT8X8_FROMINT64(
+          AE_AND64(AE_MOVINT64_FROMINT8X8(d2), 0x0700700700700000LL));
+      d3 = AE_MOVINT8X8_FROMINT64(
+          AE_AND64(AE_MOVINT64_FROMINT8X8(d3), 0x0070070070070000LL));
+      d4 = AE_MOVINT8X8_FROMINT64(
+          AE_AND64(AE_MOVINT64_FROMINT8X8(d4), 0x0007007007007000LL));
+
+      d5 = d1 | d2;
+      d6 = d3 | d4;
+
+      d7 = AE_MOVINT8X8_FROMINT64(AE_SRLI64(AE_MOVINT64_FROMINT8X8(d5), 4));
+      d8 = AE_MOVINT8X8_FROMINT64(AE_SRLI64(AE_MOVINT64_FROMINT8X8(d6), 4));
+
+      d9 = AE_SEL8X8(d5, d7, d_shuffle_t1);
+      d10 = AE_SEL8X8(d6, d8, d_shuffle_t2);
+      d11 = AE_SEL8X8(d9, d10, d_shuffle_t3);
+
+      AE_DSEL8X8(d_out1, d_out2, d_value_0, d_value_1, d11);
+      AE_DSEL8X8(d_out1, d_out2, d_out1, d_out2, d_shuffle_t);
+
+      AE_SA8X8X2_IP(d_out1, d_out2, align_store, (ae_int8x16*)p_out_tmp);
+    }
+    if (elements_per_channel_t_rem) {
+      AE_LAV8X8X2_XP(d_index, d_dummy, align_index, (ae_int8x16*)pIn_tmp,
+                     3); /* Loading 48 bits for decoding 16 weight values */
+
+      d1 =
+          AE_MOVINT8X8_FROMINT64(AE_SRLI64(AE_MOVINT64_FROMINT8X8(d_index), 1));
+      d2 =
+          AE_MOVINT8X8_FROMINT64(AE_SRLI64(AE_MOVINT64_FROMINT8X8(d_index), 2));
+      d3 =
+          AE_MOVINT8X8_FROMINT64(AE_SRLI64(AE_MOVINT64_FROMINT8X8(d_index), 3));
+      d4 =
+          AE_MOVINT8X8_FROMINT64(AE_SRLI64(AE_MOVINT64_FROMINT8X8(d_index), 4));
+
+      d1 = AE_MOVINT8X8_FROMINT64(
+          AE_AND64(AE_MOVINT64_FROMINT8X8(d1), 0x7007007007000000LL));
+      d2 = AE_MOVINT8X8_FROMINT64(
+          AE_AND64(AE_MOVINT64_FROMINT8X8(d2), 0x0700700700700000LL));
+      d3 = AE_MOVINT8X8_FROMINT64(
+          AE_AND64(AE_MOVINT64_FROMINT8X8(d3), 0x0070070070070000LL));
+      d4 = AE_MOVINT8X8_FROMINT64(
+          AE_AND64(AE_MOVINT64_FROMINT8X8(d4), 0x0007007007007000LL));
+
+      d5 = d1 | d2;
+      d6 = d3 | d4;
+
+      d7 = AE_MOVINT8X8_FROMINT64(AE_SRLI64(AE_MOVINT64_FROMINT8X8(d5), 4));
+      d8 = AE_MOVINT8X8_FROMINT64(AE_SRLI64(AE_MOVINT64_FROMINT8X8(d6), 4));
+
+      d9 = AE_SEL8X8(d5, d7, d_shuffle_t1);
+      d10 = AE_SEL8X8(d6, d8, d_shuffle_t2);
+      d11 = AE_SEL8X8(d9, d10, d_shuffle_t3);
+
+      AE_DSEL8X8(d_out1, d_out2, d_value_0, d_value_1, d11);
+      AE_DSEL8X8(d_out1, d_out2, d_out1, d_out2, d_shuffle_t);
+
+      AE_SAV8X8X2_XP(d_out1, d_out2, align_store, (ae_int8x16*)p_out_tmp,
+                     elements_per_channel_t_rem);
+    }
+
+    value_table = value_table + stride;
+  }
+  AE_SA128POS_FP(align_store, (ae_int8x16*)p_out_tmp);
+}
+
+void DecompressionStateXtensa::DecompressToBufferWidth2_Xtensa(int8_t* buffer) {
+  ScopedMicroProfiler scoped_profiler(__func__, micro_profiler_);
+
+  int i, j;
+  ae_int8* __restrict p_out_tmp = (ae_int8*)buffer;
+  ae_int8x8* pIn_tmp = (ae_int8x8*)compressed_indices_;
   const uint8_t* __restrict value_table =
       static_cast<const uint8_t*>(comp_data_.data.lut_data->value_table);
 
-  for (int i = 0; i < num_channels_t; i++) {
-    ae_int8x8 d_value_0_t = *(ae_int8x8*)&value_table[0];
-    ae_int8x8 d_value_1_t = *(ae_int8x8*)&value_table[8];
+  const uint8_t* __restrict value_table_t = value_table;
 
-    AE_DSEL8X8(d_value_0, d_value_1, d_value_0_t, d_value_1_t, d_shuffle_t);
+  int num_channels_t = num_channels_;
+  const size_t stride = comp_data_.data.lut_data->value_table_channel_stride;
 
-    for (int j = 0; j < elements_per_channel_t; j += 16) {
-      AE_L8X8_IP(d_index, pIn_tmp, 8);
-      AE_DSEL8X8(d_out1, d_out2, d_value_0, d_value_1, d_index);
-      AE_DSEL8X8(d_out1, d_out2, d_out1, d_out2, d_d_shuffle_t2);
-      AE_S8X8X2_IP(d_out1, d_out2, (ae_int8x16*)p_out_tmp, 16);
+  int elements_per_channel_t_by_5 = elements_per_channel_ >> 5;
+  int elements_per_channel_t_rem = elements_per_channel_ & 0x1F;
+  int elements_per_channel_t_rem_minus_16 = 0;
+  if (elements_per_channel_t_rem > 16) {
+    elements_per_channel_t_rem_minus_16 = elements_per_channel_t_rem - 16;
+  }
+
+  ae_int8x8 d_index, d_dummy;
+  ae_int8x8 d0, d1, d2, d3, d4, d5;
+  ae_int8x8 q0, q1, q2, q3;
+  ae_int8x8 d_out1, d_out2;
+
+  ae_valignx2 align_index = AE_LA128_PP(pIn_tmp);
+
+  ae_int8x8 d_shuffle_value_t = AE_MOVINT8X8_FROMINT64(0x08192A3B4C5D6E7FLL);
+  ae_int8x8 d_shuffle_t1 = AE_MOVINT8X8_FROMINT64(0xFB73EA62D951C840LL);
+  ae_int8x8 d_shuffle_t2 = AE_MOVINT8X8_FROMINT64(0xFBEA7362D9C85140LL);
+
+  ae_valignx2 align_store = AE_ZALIGN128();
+
+  for (i = 0; i < num_channels_t; i++) {
+    ae_int8x8 d_value_0 = AE_MOVINT8X8_FROMINT64(AE_ZERO());
+    ae_int8x8 d_value_1 = AE_MOVINT8X8_FROMINT64(AE_ZERO());
+
+    value_table_t = value_table;
+
+    ae_valign align_vtab = AE_LA64_PP(value_table_t);
+    AE_LA8X8_IP(d_value_0, align_vtab, (ae_int8x8*)value_table_t);
+    AE_DSEL8X8(d_value_0, d_value_1, d_value_0, d_value_1, d_shuffle_value_t);
+
+    for (j = 0; j < elements_per_channel_t_by_5; j++) {
+      // AE_LA8X8_IP( d_index, align_index, pIn_tmp );    /* Loading 64 bits
+      // for decoding 32 weight values */
+
+      AE_LAV8X8X2_XP(d_index, d_dummy, align_index, (ae_int8x16*)pIn_tmp,
+                     8); /* Loading 64 bits for decoding 32 weight values  */
+      d0 = d_index;
+      d1 =
+          AE_MOVINT8X8_FROMINT64(AE_SRLI64(AE_MOVINT64_FROMINT8X8(d_index), 2));
+
+      d2 = AE_MOVINT8X8_FROMINT64(
+          AE_AND64(AE_MOVINT64_FROMINT8X8(d0),
+                   0x3333333333333333LL));  // i1,i3,i5, ....
+      d3 = AE_MOVINT8X8_FROMINT64(
+          AE_AND64(AE_MOVINT64_FROMINT8X8(d1),
+                   0x3333333333333333LL));  // i0,i2,i4, ....
+
+      AE_DSEL8X8(d4, d5, d3, d2,
+                 d_shuffle_t1);  // d4 = i0,i2,i1,i3,i4,i6,...    d5 =
+                                 // i16,i18, i17,i19, ....
+
+      AE_DSEL8X8(q0, q1, d_value_0, d_value_1,
+                 d4);  // q0 = 0,1,4,5,8,9,12,13        q1 = 2,3,6,7,10,11,14,15
+      AE_DSEL8X8(
+          q2, q3, d_value_0, d_value_1,
+          d5);  // q2 = 16,17,20,21,24,25,28,29  q3 = 18,19,22,23,26,27,30,31
+
+      AE_DSEL8X8(d_out1, d_out2, q0, q1, d_shuffle_t2);
+      AE_SA8X8X2_IP(d_out1, d_out2, align_store, (ae_int8x16*)p_out_tmp);
+
+      AE_DSEL8X8(d_out1, d_out2, q2, q3, d_shuffle_t2);
+      AE_SA8X8X2_IP(d_out1, d_out2, align_store, (ae_int8x16*)p_out_tmp);
+    }
+    if (elements_per_channel_t_rem) {
+      AE_LAV8X8X2_XP(d_index, d_dummy, align_index, (ae_int8x16*)pIn_tmp,
+                     (elements_per_channel_t_rem >>
+                      2)); /* Loading 48 bits for decoding 16 weight values */
+      d0 = d_index;
+      d1 =
+          AE_MOVINT8X8_FROMINT64(AE_SRLI64(AE_MOVINT64_FROMINT8X8(d_index), 2));
+      d2 = AE_MOVINT8X8_FROMINT64(
+          AE_AND64(AE_MOVINT64_FROMINT8X8(d0),
+                   0x3333333333333333LL));  // i1,i3,i5, ....
+      d3 = AE_MOVINT8X8_FROMINT64(
+          AE_AND64(AE_MOVINT64_FROMINT8X8(d1),
+                   0x3333333333333333LL));  // i0,i2,i4, ....
+
+      AE_DSEL8X8(d4, d5, d3, d2,
+                 d_shuffle_t1);  // d4 = i0,i2,i1,i3,i4,i6,...    d5 =
+                                 // i16,i18, i17,i19, ....
+
+      AE_DSEL8X8(q0, q1, d_value_0, d_value_1,
+                 d4);  // q0 = 0,1,4,5,8,9,12,13        q1 = 2,3,6,7,10,11,14,15
+      AE_DSEL8X8(
+          q2, q3, d_value_0, d_value_1,
+          d5);  // q2 = 16,17,20,21,24,25,28,29  q3 = 18,19,22,23,26,27,30,31
+
+      AE_DSEL8X8(d_out1, d_out2, q0, q1, d_shuffle_t2);
+
+      AE_SAV8X8X2_XP(d_out1, d_out2, align_store, (ae_int8x16*)p_out_tmp,
+                     elements_per_channel_t_rem);
+
+      AE_DSEL8X8(d_out1, d_out2, q2, q3, d_shuffle_t2);
+
+      AE_SAV8X8X2_XP(d_out1, d_out2, align_store, (ae_int8x16*)p_out_tmp,
+                     elements_per_channel_t_rem_minus_16);
     }
 
-    value_table += stride;
+    value_table = value_table + stride;
   }
+  AE_SA128POS_FP(align_store, (ae_int8x16*)p_out_tmp);
 }
 
 void DecompressionStateXtensa::DecompressToBufferWidthAnyInt8_Xtensa(
@@ -407,20 +607,25 @@ int8_t* DecompressionState::DecompressToBuffer<int8_t>(void* buffer) {
 
   if (comp_data_.data.lut_data->compressed_bit_width == 4 &&
       !comp_data_.data.lut_data->use_alternate_axis) {
-    if (!(elements_per_channel_ & 0x0F) &&
-        comp_data_.data.lut_data->value_table_channel_stride == 16) {
-      dsx.DecompressToBufferWidth4_Xtensa_Old(static_cast<int8_t*>(buffer));
+    if (!(elements_per_channel_ & 0x01)) {
+      dsx.DecompressToBufferWidth4_Xtensa(static_cast<int8_t*>(buffer));
     } else {
-      dsx.DecompressToBufferWidth4_16(static_cast<int8_t*>(buffer));
+      dsx.DecompressToBufferWidthAnyInt8_Xtensa(static_cast<int8_t*>(buffer));
     }
   } else if (comp_data_.data.lut_data->compressed_bit_width == 3 &&
              !comp_data_.data.lut_data->use_alternate_axis) {
-    // TODO(ddavis-2015): placeholder
-    dsx.DecompressToBufferWidthAnyInt8_Xtensa(static_cast<int8_t*>(buffer));
+    if (!(elements_per_channel_ & 0x07)) {
+      dsx.DecompressToBufferWidth3_Xtensa(static_cast<int8_t*>(buffer));
+    } else {
+      dsx.DecompressToBufferWidthAnyInt8_Xtensa(static_cast<int8_t*>(buffer));
+    }
   } else if (comp_data_.data.lut_data->compressed_bit_width == 2 &&
              !comp_data_.data.lut_data->use_alternate_axis) {
-    // TODO(ddavis-2015): placeholder
-    dsx.DecompressToBufferWidthAnyInt8_Xtensa(static_cast<int8_t*>(buffer));
+    if (!(elements_per_channel_ & 0x03)) {
+      dsx.DecompressToBufferWidth2_Xtensa(static_cast<int8_t*>(buffer));
+    } else {
+      dsx.DecompressToBufferWidthAnyInt8_Xtensa(static_cast<int8_t*>(buffer));
+    }
   } else {
     dsx.DecompressToBufferWidthAnyInt8_Xtensa(static_cast<int8_t*>(buffer));
   }

diff --git a/tensorflow/lite/micro/testing/BUILD b/tensorflow/lite/micro/testing/BUILD
@@ -1,9 +1,5 @@
 load("@rules_python//python:defs.bzl", "py_binary", "py_library")
 load("@tflm_pip_deps//:requirements.bzl", "requirement")
-load("//tensorflow/lite/micro:build_def.bzl",
-    "tflm_cc_library",
-    "tflm_cc_test",
-)
 load(
     "//tensorflow/lite/micro:build_def.bzl",
     "tflm_cc_library",