Implement multithreading in qgemm_kleidi #26301

melkap01-Arm · 2025-10-14T14:33:35Z

Key changes

This PR makes changes to improve the performance on Dynamic Qgemms by implementing tiling and threading across operations.

The changes introduce thread local buffers for reusing memory during inference. And utilizes those in Dynamic Quantised Matmul operations using Kleidiai kernels.

And updating KleidiAI version to 1.15.0

Example performance

single thread :

2 threads :

melkap01-Arm · 2025-10-14T14:55:37Z

@microsoft-github-policy-service agree company="Arm"

hariharans29 · 2025-10-14T16:47:50Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-14T16:48:09Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-10-16T17:09:19Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-16T17:09:37Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-10-24T20:10:26Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-24T20:10:45Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-10-24T20:31:35Z

cmake/deps.txt

 cudnn_frontend;https://github.com/NVIDIA/cudnn-frontend/archive/refs/tags/v1.12.0.zip;7e733cfdc410d777b76122d64232499205589a96
 dawn;https://github.com/google/dawn/archive/13c1635a14574ebb7116b56a69f5519301417fda.zip;0aadd28fc385cf7d657d5fc70a352372d2d3c76a
-kleidiai;https://github.com/ARM-software/kleidiai/archive/refs/tags/v1.10.0.tar.gz;11b62149cb2514b3b9069cc435c3aa7a4e82b97a
+kleidiai;https://github.com/ARM-software/kleidiai/archive/refs/tags/v1.15.0.tar.gz;62ccd24ab60bcef68766440fb42d79071ac2a5d2


With this update in the KAI version from 1.10 to 1.15, can SME/SME2 detection be enabled on Windows too to leverage the kernels ?

https://github.com/microsoft/onnxruntime/pull/25187/files#r2223006773
https://github.com/microsoft/onnxruntime/pull/25760/files#r2325260570

patryk-kaiser-ARM · 2025-10-28T10:48:42Z

Can we get workflows ran please

hariharans29 · 2025-10-28T16:19:52Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-28T16:20:12Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-10-28T17:58:42Z

onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp

+
+        g_kai_tls_qgemm.lhs_packed.reserve(LhsPackedStride * BatchSize);
+    }
+    g_kai_tls_qgemm.lhs_packed.resize(LhsPackedStride * BatchSize);


Can't we just do the resizing directly instead of reserve + resize ?

Yes, reserve() + resize() or using only resize() cases both end up with one allocation + one initialisation. But somehow there is a very very little performance difference in the case allocation and initialisation separated or done at once with resize(). (after: is the case reserve() calls removed and only resize() is used.)

hariharans29 · 2025-10-28T19:41:45Z

onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp

+                g_kai_tls_qgemm.output_tile.reserve(tile_elems);
+            }
+            // resize the tile to the required size (doesn't effect memory)
+            g_kai_tls_qgemm.output_tile.resize(tile_elems);


Ditto - Is Reserve + Resize necessary ?

hariharans29 · 2025-10-28T19:45:00Z

onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp

+// Thread-local reusable buffers to reduce allocation overhead across tiles.
+struct KaiTlsBuffersQgemm {
+    std::vector<float> output_tile;
+    std::vector<float> bias_zero;


Is bias_zero used somewhere ?

addressed in the new commit

hariharans29 · 2025-10-28T19:56:09Z

onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp

+            g_kai_tls_qgemm.output_tile.resize(tile_elems);
+        }
+        float* temp_tile = g_kai_tls_qgemm.output_tile.data();
+        std::fill_n(temp_tile, TileSizeM * TileSizeN, 0.0f);


Is this buffer zeroing absolutely needed (i.e.) Does the micro-kernel accumulate into the existing contents ?

Is there a concept of dis-reagrding existing contents in the output buffer in the micro-kernel's interface ?

We can remove the fill_n, the kernel handles zeroing of the tile

hariharans29 · 2025-10-28T20:01:23Z

onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp

+    LhsPackedData = g_kai_tls_qgemm.lhs_packed.data();
+
+    //Per-batch table of lhs
+    std::vector<const std::byte*> LhsBase(BatchSize);


Just a thought - Can this vector containing the per-batch address be moved into the KaiTlsBuffersQgemm struct and be re-sized when it's size is less than the BatchSize ?

The pro of that approach:

We generally expect the BatchSize to be stable across runs and that will mean we can do away with the dynamic memory allocation latency variance that comes with using std::vector

The con of that approach:

The size of that caching vector will be bound by the highest batch size that the kernel will encounter.

Given that the batch sizes are generally stable across different runs, I am thinking the pro might outweight the con ?

What are your thoughts on this ?

This is an idea worth to try and measure the impact.
I implemented it and the results with single thread:

and 2 threads :

After: Lhsbase is moved inside the TLS structure. Before: LhaBase is a local buffer shared with the threads.

here is the implementation:

//Per-batch table of lhs if (g_kai_tls_qgemm.LhsBase.capacity() < BatchSize) { g_kai_tls_qgemm.LhsBase.reserve(BatchSize); } g_kai_tls_qgemm.LhsBase.resize(BatchSize); // Capture the shared batch table pointer so worker threads use the same backing storage. const std::byte** tls_lhs_base = g_kai_tls_qgemm.LhsBase.data(); // B batches require no packing ⋮ kai_run_lhs_quant_pack_qai8dxp_f32(Shape.M, Shape.K, mr, kr, sr, 0, DataParams[batch_idx].A, DataParams[batch_idx].lda*sizeof(float), lhs); tls_lhs_base[batch_idx] = lhs; }); ⋮ const std::byte* A_base = tls_lhs_base[BIdx]; // LhsPackedData + LhsPackedStride * BIdx; OR DataParams[batch_idx].Workspace; auto ATile = reinterpret_cast<const std::byte*>(A_base + lhs_packed_offset);

I suspect perf-wise there isn't much difference but it is coming from a performance variance POV. If we performed dynamic memory allocations on every Run(), I suspect we may see some latency variance. I was just wonderinf if this can be avoided as in most cases, usually the Gemm problem shapes stay the same across invocations. Let us dynamically resize only when we encounter a change of shape (batch size). Hope the motivation of the comment is clear now.

Motivation behind the comment is clear, if we expect generally stable batches, reusing its capacity across calls is making sense. If the performance results also acceptable we are all good with this idea. Please find the implementation in the latest commit.

hariharans29 · 2025-10-28T20:02:21Z

onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp


-        if (DataParams->Workspace && DataParams->WorkspaceSize >= lhs_size) {
-            lhs = static_cast<std::byte*>(DataParams->Workspace);
+    if (Shape.M == 0 || Shape.N == 0) {


Should there be a Shape.K check for completeness ?

addressed in the newest commit.

hariharans29 · 2025-10-28T20:07:18Z

General sanity check question: Are there enough tests that trigger all the nuances of the multi-threaded implementation - Are there enough tests with multiple batch sizes, M, and N dimensions that exercise all aspects of the multi-threaded implementation ?

edgchen1 · 2025-10-28T23:20:26Z

onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp

+        return;
+    }
+    if ((Shape.M < m_step || Shape.N < n_step) && !DataParams->PackedB) {
+        // Fallback to MLAS


there is no fallback implementation of MlasDynamicQGemmBatch().

onnxruntime/onnxruntime/core/mlas/lib/qgemm.cpp

Lines 212 to 222 in 0f6cffc

#if defined(USE_KLEIDIAI) && !defined(_MSC_VER)

//No fallback and putting in guards

if(MLAS_CPUIDINFO::GetCPUIDInfo().HasArm_SME()){

ArmKleidiAI::MlasDynamicQGemmBatch(Shape, DataParams, BatchN, ThreadPool);

}

#endif

MLAS_UNREFERENCED_PARAMETER(Shape);

MLAS_UNREFERENCED_PARAMETER(DataParams);

MLAS_UNREFERENCED_PARAMETER(BatchN);

MLAS_UNREFERENCED_PARAMETER(ThreadPool);

if we get to this point, the computation should happen or (maybe less preferably) it should be a hard error.

We will investigate the fallback case further and try to provide better implementation.
Until then, would like to get your opinion on using ORT_ENFORCE

ORT_ENFORCE(false, "ArmKleidiAI::MlasDynamicQGemmBatch(): unsupported small-shape case (M < m_step or N < n_step)");

Could we instead implement @edgchen1's suggestion in the other PR: #26302 (comment) to have a universal check that can be used in all places to check if MLAS supports QGemm for that problem shape, platform, etc. ?

Also since we have a check on the M dimension, this might need some thinking - In the current setup, we turn off MLAS usage for QGemm in PrePack() if we don't detect SME or the weight's shape don't match requirements in PrePack(). See here and here. The M dimension won't be known in PrePack().

Just curious - what would happen if the M was < m_step ? Would there be a crash or would the perf be sub-optimal ? If so, we need to add a runtime check in the CPU kernel's Run() function which means we may need to perform pre-packing for both KAI and the "regular" path. See here.

edgchen1 · 2025-10-28T23:40:53Z

onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp

+
+        // Final output tile pointer
+        float* dst_tile = reinterpret_cast<float*>(CTile);
+        std::memcpy(dst_tile, temp_tile, TileSizeM * TileSizeN * sizeof(float));


what's the benefit of writing to a temporary buffer (temp_tile) and then copying it to dst_tile instead of directly writing to dst_tile?

The idea behind it was making the arithmetics on the temporary tile to be error prone as it was implemented on the sgemms. But I see making the calculations on the destination and writing directly is lowering the complexity.

instead of having the result in each TLS and copying to the destination tile, destination tile can have the result directly.
Measuring the impact :
single thread:

2 threads :

hariharans29 · 2025-10-30T21:11:42Z

Will trigger CI once you push commits addressing the PR feedback (right now I only see a rebase). Thanks.

melkap01-Arm · 2025-10-31T17:25:30Z

General sanity check question: Are there enough tests that trigger all the nuances of the multi-threaded implementation - Are there enough tests with multiple batch sizes, M, and N dimensions that exercise all aspects of the multi-threaded implementation ?

We checked the existing tests for qgemm. In current implementation tests are supported for thread pool = null. We created a follow up ticket for test coverage.

hariharans29 · 2025-10-31T17:48:04Z

General sanity check question: Are there enough tests that trigger all the nuances of the multi-threaded implementation - Are there enough tests with multiple batch sizes, M, and N dimensions that exercise all aspects of the multi-threaded implementation ?

We checked the existing tests for qgemm. In current implementation tests are supported for thread pool = null. We created a follow up ticket for test coverage.

If all the tests are with ThreadPool == null, does that mean the new threadpool based parallel code path(s) are not exercised ?

melkap01-Arm · 2025-11-04T10:46:55Z

General sanity check question: Are there enough tests that trigger all the nuances of the multi-threaded implementation - Are there enough tests with multiple batch sizes, M, and N dimensions that exercise all aspects of the multi-threaded implementation ?

We checked the existing tests for qgemm. In current implementation tests are supported for thread pool = null. We created a follow up ticket for test coverage.

If all the tests are with ThreadPool == null, does that mean the new threadpool based parallel code path(s) are not exercised ?

It means it was not exercised on the onnxruntime_mlas_test run, but it is on the onnxruntime_perf_test. However, unit tests for the multithreaded code added now, in the latest commit. Both cases can use multiple threads in the latest situation.

Signed-off-by: melkap01 <[email protected]>

unused variable removed, unnecessary temp_tile use and copy removed, K==0 case checked Signed-off-by: melkap01 <[email protected]>

Signed-off-by: melkap01 <[email protected]>

hariharans29 · 2025-12-11T08:23:01Z

onnxruntime/contrib_ops/cpu/quantization/dynamic_quantize_matmul.cc

-  // Indicates that the biases are a constant input and thus already quantized / packed
-  bool dynamic_quant_mlas_bias_data_was_packed_{false};
-#endif
+  // Flag storage is handled by MatMulIntegerBase.


Comment on line 200 is dangling ("Indicates when....") - I guess it is no longer relevant given that the flags have moved....

will be addressed in the new commit

hariharans29 · 2025-12-11T08:29:19Z

onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp

+    dim[2] = MlasDivRoundup(Shape.N, n_step);  // N
+
+    // Minimize the kernel call count for the number of available threads
+    auto RequiredTiles = std::min(static_cast<size_t>(MlasGetMaximumThreadCount(ThreadPool)), dim[0] * dim[1] * dim[2]);


Is there room for tuning this heuristic ? What if dim[0] * dim[1] * dim[2] is closer to 2 * Thread_count ? In that case, does it make sense to keep the required tiles as is as the tail processing is quite less or does it make sense to process bigger tiles and keep tile count smaller ?

In this line minimum of the dim[0] * dim[1] * dim[2] vs Thread_count is taken account in order to keep as is or enlarge the tile size accordingly. If we see the later lines code updates the m_step/n_step by the scale calculated in the middle. A rebalancing work going on here in order to minimise the kernel call. For example m_step/n_step = 16/64 initially and after the scaling according to the required tiles new m_step/n_step = 16/256 for a C matrix 1x512 when #threads=2 and becomes 16/192 when #threads=3 and 16/128 when #threads=6...
I believe this logic here both minimises the kernel call & leaves no room for tail processing. I am sorry if I didn't clearly understand the question but I feel like this logic is better for reducing tail processing as it tries to fit the tiles into the C tensor cleanly. Please highlight any point if this does not answer your question.

Fair enough, thanks

hariharans29 · 2025-12-11T08:30:34Z

onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp

+
+    // tile iteration dimensions
+    std::array<size_t, 3> dim;
+    dim[0] = BatchSize;                  // B


Is there room for optimization in any of the logic below if BatchSize == 1 ?

I am not sure if I understood this clearly. BatchSize, dim[0], is contributes the multiplication on RequiredTiles. In later lines all the calculation goes over other dimensions not the BatchSize. I would argue the cost of the multiply of the other two dim by 1 is negligible and changing the code to treat this differently ,e.g checking against the BatchSize ==1, would complicate the code for no substantive gain.

Makes sense, thanks

hariharans29 · 2025-12-11T08:31:12Z

onnxruntime/core/providers/cpu/quantization/matmul_integer_base.h

    is_packed = false;

-    // only pack Matrix B
+    // only pack Matrix B++


Nit: Is ++ a typo ?

yes it is , will be addressed in the new commit

hariharans29 · 2025-12-11T08:34:01Z

onnxruntime/core/providers/cpu/quantization/matmul_integer_base.h

+    const size_t packed_b_size = MlasDynamicQgemmPackBSize(ctx.N, ctx.K);
+    if (packed_b_size == 0) {
+      can_use_dynamic_quant_mlas_ = false;
+      return true;


Should this return false if can_use_dynamic_quant_mlas_ = false ?

will be addressed in the new commit

hariharans29 · 2025-12-11T08:36:51Z

onnxruntime/core/providers/cpu/quantization/matmul_integer_base.h

+  bool IsBShapeSupportedForDynamicQuant(const TensorShape& tensor_shape) {
+    b_shape_ = tensor_shape;
+    if (b_shape_.NumDimensions() < 2) {
+      return false;


Low priority question: Can 1-D shapes be promoted to 2-D shapes by pre-pending or appending 1 ?

It is implemented in the latest commit.

hariharans29 · 2025-12-11T08:37:57Z

onnxruntime/core/providers/cpu/quantization/matmul_integer_base.h

+    std::optional<Tensor> transposed_buffer;
+  };
+
+  bool TryKleidiaiDynamicPrePack(const Tensor& tensor, int input_idx, AllocatorPtr alloc,


A brief description of what each of the following helper methods do and look for and when it returns true/false will help the reader.

hariharans29 · 2025-12-11T08:40:30Z

onnxruntime/test/mlas/unittest/test_dynamic_qgemm.cpp

+ public:
+  void Test(size_t M, size_t N, size_t K, size_t BatchSize) {
+    // Currently, MlasDynamicQGemmBatch() and associated functions require SME or else they are no-ops.
+    if (!MLAS_CPUIDINFO::GetCPUIDInfo().HasArm_SME())


Could we use the MLAS APi to check if dynamic Q Gemm functionality is available ?

hariharans29 · 2025-12-11T08:40:40Z

onnxruntime/test/mlas/unittest/test_dynamic_qgemm.cpp

+ public:
+  void Test(size_t M, size_t N, size_t K, size_t BatchSize) {
+    // Currently, MlasDynamicQGemmBatch() and associated functions require SME or else they are no-ops.
+    if (!MLAS_CPUIDINFO::GetCPUIDInfo().HasArm_SME())


Same as above

-MlasIsDynamicQGemmAvailable() used instead of CPUIDInfo::GetCPUIDInfo().HasArm_SME() Signed-off-by: melkap01 <[email protected]>

hariharans29 · 2025-12-18T18:56:40Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-12-18T18:57:00Z

Azure Pipelines successfully started running 4 pipeline(s).

Signed-off-by: melkap01 <[email protected]>

edgchen1 · 2026-01-06T17:08:25Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-01-06T17:08:48Z

Azure Pipelines successfully started running 4 pipeline(s).

Signed-off-by: melkap01 <[email protected]>

edgchen1 · 2026-01-07T01:53:51Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-01-07T01:54:10Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2026-01-09T18:51:35Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-01-09T18:51:57Z

Azure Pipelines successfully started running 4 pipeline(s).

patryk-kaiser-ARM mentioned this pull request Oct 24, 2025

Fix: Disable KleidiAI on systems with SME1 but not SME2 #26399

Closed

hariharans29 reviewed Oct 24, 2025

View reviewed changes

hariharans29 reviewed Oct 28, 2025

View reviewed changes

hariharans29 mentioned this pull request Oct 28, 2025

Adding SME1 Convolution Kernel to convole_kleidiai.cpp #26402

Merged

edgchen1 reviewed Oct 28, 2025

View reviewed changes

hariharans29 mentioned this pull request Oct 30, 2025

Implement FP32 kleidiai Gemv #26302

Merged

melkap01-Arm added 4 commits November 28, 2025 13:31

Implement multithreading in qgemm_kleidi

517e166

Signed-off-by: melkap01 <[email protected]>

fixes addressed:

bd05fce

unused variable removed, unnecessary temp_tile use and copy removed, K==0 case checked Signed-off-by: melkap01 <[email protected]>

lhs_base_table buffer implemented inside TLS

75fee7a

Signed-off-by: melkap01 <[email protected]>

multithreaded qgemms coverage with single-multi threaded

e53e67b

Signed-off-by: melkap01 <[email protected]>

hariharans29 reviewed Dec 11, 2025

View reviewed changes

melkap01-Arm added 2 commits December 15, 2025 09:12

-Arm KleidiAI helper methods in Mlas space commented.

0c3748b

-MlasIsDynamicQGemmAvailable() used instead of CPUIDInfo::GetCPUIDInfo().HasArm_SME() Signed-off-by: melkap01 <[email protected]>

Merge branch 'microsoft:main' into melkap01_implement_mt_qgemm

99fe8c5

melkap01-Arm marked this pull request as ready for review December 18, 2025 10:11

melkap01-Arm added 5 commits December 19, 2025 18:15

KleidiAI dynamic quantization supported by promoting 1D B tensor to 2D

47e4c92

Signed-off-by: melkap01 <[email protected]>

Merge branch 'microsoft:main' into melkap01_implement_mt_qgemm

fb8eefb

Merge branch 'microsoft:main' into melkap01_implement_mt_qgemm

d9a26bf

Merge branch 'microsoft:main' into melkap01_implement_mt_qgemm

cd80e56

Merge branch 'microsoft:main' into melkap01_implement_mt_qgemm

6356e68

melkap01-Arm added 2 commits January 6, 2026 19:31

lintrunner issue fixed

50dddaf

Signed-off-by: melkap01 <[email protected]>

Merge branch 'microsoft:main' into melkap01_implement_mt_qgemm

017a425

melkap01-Arm added 3 commits January 7, 2026 11:39

Merge branch 'microsoft:main' into melkap01_implement_mt_qgemm

2ad388c

Merge branch 'microsoft:main' into melkap01_implement_mt_qgemm

8dc8bc3

Merge branch 'microsoft:main' into melkap01_implement_mt_qgemm

e000f04

	#if defined(USE_KLEIDIAI) && !defined(_MSC_VER)
	//No fallback and putting in guards
	if(MLAS_CPUIDINFO::GetCPUIDInfo().HasArm_SME()){
	ArmKleidiAI::MlasDynamicQGemmBatch(Shape, DataParams, BatchN, ThreadPool);
	}
	#endif

	MLAS_UNREFERENCED_PARAMETER(Shape);
	MLAS_UNREFERENCED_PARAMETER(DataParams);
	MLAS_UNREFERENCED_PARAMETER(BatchN);
	MLAS_UNREFERENCED_PARAMETER(ThreadPool);

Implement multithreading in qgemm_kleidi #26301

Are you sure you want to change the base?

Implement multithreading in qgemm_kleidi #26301

Conversation

melkap01-Arm commented Oct 14, 2025

Uh oh!

melkap01-Arm commented Oct 14, 2025

Uh oh!

hariharans29 commented Oct 14, 2025

Uh oh!

azure-pipelines bot commented Oct 14, 2025

Uh oh!

hariharans29 commented Oct 16, 2025

Uh oh!

azure-pipelines bot commented Oct 16, 2025

Uh oh!

hariharans29 commented Oct 24, 2025

Uh oh!

azure-pipelines bot commented Oct 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patryk-kaiser-ARM commented Oct 28, 2025

Uh oh!

hariharans29 commented Oct 28, 2025

Uh oh!

azure-pipelines bot commented Oct 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

melkap01-Arm Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hariharans29 Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

melkap01-Arm Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hariharans29 commented Oct 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hariharans29 Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

melkap01-Arm Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hariharans29 commented Oct 30, 2025

Uh oh!

melkap01-Arm Oct 30, 2025 •

edited

Loading

hariharans29 Oct 28, 2025 •

edited

Loading

melkap01-Arm Oct 30, 2025 •

edited

Loading

hariharans29 Oct 31, 2025 •

edited

Loading

melkap01-Arm Oct 30, 2025 •

edited

Loading

melkap01-Arm Dec 11, 2025 •

edited

Loading