[rocsolver] new stedc code#3098
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #3098 +/- ##
===========================================
- Coverage 84.07% 75.16% -8.91%
===========================================
Files 501 321 -180
Lines 42479 31280 -11199
Branches 5459 4717 -742
===========================================
- Hits 35714 23510 -12204
- Misses 2499 4929 +2430
+ Partials 4266 2841 -1425
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
* block_m = 32 * ck block_m = 32 * aiter/3rdparty/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_xdlops_b_preshuffle_mx_moe_v3.hpp format * mxfp4_moe v1 pipe * update format --------- Co-authored-by: zhimding <zhimding@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: felix <felix.li@amd.com>
* block_m = 32 * ck block_m = 32 * aiter/3rdparty/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_xdlops_b_preshuffle_mx_moe_v3.hpp format * mxfp4_moe v1 pipe * update format --------- Co-authored-by: zhimding <zhimding@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: felix <felix.li@amd.com> [ROCm/composable_kernel commit: d04eba4]
There was a problem hiding this comment.
Pull request overview
Introduces a reworked STEDC (divide-and-conquer) implementation intended to simplify merge logic and improve performance, while reusing existing SYEV/HEEV argument validation.
Changes:
- Switch SYEVD/HEEVD argument checking to reuse
rocsolver_syev_heev_argCheck. - Replace large parts of STEDC merge/deflation/vector-update flow with a new implementation and new workspace layout.
- Add a shared
bisearchhelper and adjust test initialization to optionally generate the EIG7 case via GPU operations forbc==1.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| projects/rocsolver/library/src/lapack/roclapack_syevd_heevd_strided_batched.cpp | Reuses SYEV/HEEV arg checking in strided-batched entrypoint. |
| projects/rocsolver/library/src/lapack/roclapack_syevd_heevd_batched.cpp | Reuses SYEV/HEEV arg checking in batched entrypoint. |
| projects/rocsolver/library/src/lapack/roclapack_syevd_heevd.hpp | Removes local argCheck, wires STEDC call-site workspace changes, includes SYEV/HEEV header. |
| projects/rocsolver/library/src/lapack/roclapack_syevd_heevd.cpp | Reuses SYEV/HEEV arg checking in non-batched entrypoint. |
| projects/rocsolver/library/src/include/lib_device_helpers.hpp | Adds bisearch helper for device/host binary search. |
| projects/rocsolver/library/src/auxiliary/rocauxiliary_stedc.hpp | Major STEDC refactor: new merge/deflation kernels, new memory sizing and workspace usage. |
| projects/rocsolver/library/src/auxiliary/rocauxiliary_stedc.cpp | Updates STEDC workspace allocation to new memory-size API and new buffers. |
| projects/rocsolver/clients/common/lapack/testing_syevd_heevd.hpp | Adds GPU-based EIG7 init for bc==1 and toggles selection via macro. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| extern __shared__ rocblas_int shmem[]; | ||
| rocblas_int* posi = shmem; | ||
| rocblas_int* posf = posi + (1 << (k + 1)); | ||
| S* shmaxz = reinterpret_cast<S*>(posf + (1 << (k + 1))); |
| strideD, V, 0, ldv, strideV, workSvecs, workInt); | ||
|
|
||
| rocblas_int ngps = blks / (1 << (k + 1)); | ||
| size_t lmemsize = sizeof(S) * blks + sizeof(rocblas_int) * 2 * (1 << (k + 1)); |
| @@ -39,213 +39,27 @@ | |||
| #include "rocsolver/rocsolver.h" | |||
|
|
|||
| #include <algorithm> | |||
|
|
||
| ROCSOLVER_BEGIN_NAMESPACE | ||
|
|
||
| #define STEDC_BDIM 512 // Number of threads per thread-block used in main stedc kernels | ||
| #define STEDC_SOLVE_BDIM 4 // Number of threads per thread-block used in solver kernel | ||
| #define STEDC_BDIM_VALUES 4 // Number of therads per thread-block used in mergeValues kernel |
| - This kernel is to be called with as many groups in x as needed to cover all | ||
| the batch_count problems. Each thread will work with a matrix in the batch. | ||
| divides the input matrix into 'blks' sub-blocks. | ||
| - This kernel is to be called with as many sroups in x as needed to cover all |
| @@ -39,213 +39,27 @@ | |||
| #include "rocsolver/rocsolver.h" | |||
|
|
|||
| #include <algorithm> | |||
| If STRICT = false, it returns the number of elements in 'X' that are smaller than or | ||
| equal to 'val' **/ | ||
| template <typename T> | ||
| __device__ __host__ rocblas_int bisearch(T val, T* X, rocblas_int n, bool STRICT, bool REVERSE) |
| #define USE_GPU true | ||
| if((std::getenv("TEST_EIG7") != nullptr) || (std::getenv("SYEVD_TEST_EIG7") != nullptr)) | ||
| { | ||
| syevd_heevd_eig7_initData<CPU, GPU>(handle, evect, n, dA, lda, bc, hA, A, test); | ||
| if(USE_GPU) | ||
| syevd_heevd_eig7_initData_gpu<CPU, GPU>(handle, evect, n, dA, lda, bc, hA, A, test); | ||
| else | ||
| syevd_heevd_eig7_initData<CPU, GPU>(handle, evect, n, dA, lda, bc, hA, A, test); |
|
This pull request has been inactive for 25 days and will be marked as stale. If you would like to keep this PR open, please:
This PR will be automatically closed in 5 days if no further activity occurs. |
|
This PR is still under review and should stay open. It has been delayed by higher priority PRs. |
Motivation
Add a better implementation of the divide and conquer algorithm. The main objective is to have a cleaner/simpler STEDC code.
Technical Details
This new implementation starts the merge process with the eigenvalues/vectors of the leaf nodes in the tree already sorted.
Test Plan
The normal unit tests, the extended stress tests, and the performance tests were executed
Test Result
The new code is simpler and a bit faster than the current code (up to 8ms for the larger sizes across all matrix initializations), without any relevant changes in accuracy.