Skip to content

Adding multiple enhancement to improve the execution time as well as debugging#5740

Open
pdhirajkumarprasad wants to merge 6 commits into
developfrom
users/dhirajp/hipblaslt_test_improvement
Open

Adding multiple enhancement to improve the execution time as well as debugging#5740
pdhirajkumarprasad wants to merge 6 commits into
developfrom
users/dhirajp/hipblaslt_test_improvement

Conversation

@pdhirajkumarprasad
Copy link
Copy Markdown
Contributor

@pdhirajkumarprasad pdhirajkumarprasad commented Mar 24, 2026

Motivation

Through this PR, we want to achieve multiple improvement in hipblaslt-test

  1. Currently hipblaslt-test runs on single GPU so even when we have multi-gpu system, we don't have way to utilize the complete system
  2. Add string like [Test #3/180] (CurrentTestRunning/TotalTest) to existing [ RUN ] so that it's easy to understand the progress

Technical Details

Test Plan

Test Result

Execution time

./clients/hipblaslt-test --gtest_filter="MatrixTransformTest" -> 12961ms (current behavior)

./clients/hipblaslt-test --parallel_gpus 4 --gtest_filter="MatrixTransformTest" -> 6309ms

./clients/hipblaslt-test --parallel_gpus 4 --gtest_filter="MatrixTransformTest" -> 3286ms

./clients/hipblaslt-test --parallel_gpus 8 --gtest_filter="MatrixTransformTest" -> 1609ms

on complete hipblaslt-test

Before Change [==========] 22051 tests from 12 test suites ran. (1266297 ms total)

** After Change **

OVERALL SUMMARY (across all GPUs):
Total tests run:  22051
Total PASSED:     22051
Total FAILED:     0
Average time:     516142 ms

So we have 2.5X improvement

Further improvement by handling the OpenMP thread correctly


OVERALL SUMMARY (across all GPUs):
Total tests run: 22051
Total PASSED: 22051
Total FAILED: 0
Average time: 194332 ms

so we have > 6x improvement

On Debugability

** Old o/p **
[ RUN ] _/matmul_test.matmul/pre_checkin_alpha_beta_zero_NaN_bf16_rbf16_rbf16_rbf16_rf32_r_NN_256_128_64_nnan_256_64_2_256_256_1
[ OK ] _/matmul_test.matmul/pre_checkin_alpha_beta_zero_NaN_bf16_rbf16_rbf16_rbf16_rf32_r_NN_256_128_64_nnan_256_64_2_256_256_1 (4016 ms)
[ RUN ] _/matmul_test.matmul/quick_matmul_one_f16_rf16_rf16_rf16_rf32_r_NN_1_1_1_0_1_1_2_1_1_1
[ OK ] _/matmul_test.matmul/quick_matmul_one_f16_rf16_rf16_rf16_rf32_r_NN_1_1_1_0_1_1_2_1_1_1 (2224 ms)
[ RUN ] _/matmul_test.matmul/quick_matmul_one_f32_rf32_rf32_rf32_rf32_r_NN_1_1_1_0_1_1_2_1_1_1

** New o/p **
[Test #3/2757] [ RUN ] _/matmul_test.matmul/pre_checkin_alpha_beta_zero_NaN_bf16_rbf16_rbf16_rbf16_rf32_r_NN_256_128_64_nnan_256_64_2_256_256_1
[ OK ] _/matmul_test.matmul/pre_checkin_alpha_beta_zero_NaN_bf16_rbf16_rbf16_rbf16_rf32_r_NN_256_128_64_nnan_256_64_2_256_256_1 (4016 ms)
[Test #4/2757] [ RUN ] _/matmul_test.matmul/quick_matmul_one_f16_rf16_rf16_rf16_rf32_r_NN_1_1_1_0_1_1_2_1_1_1
[ OK ] _/matmul_test.matmul/quick_matmul_one_f16_rf16_rf16_rf16_rf32_r_NN_1_1_1_0_1_1_2_1_1_1 (2224 ms)
[Test #5/2757] [ RUN ] _/matmul_test.matmul/quick_matmul_one_f32_rf32_rf32_rf32_rf32_r_NN_1_1_1_0_1_1_2_1_1_1

Submission Checklist

…debugging

Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>
@math-ci-webhook
Copy link
Copy Markdown

perfci run on commit 34eaf44

math-ci run

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

❌ Your project status has failed because the head coverage (68.73%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #5740      +/-   ##
===========================================
- Coverage    70.41%   66.63%   -3.78%     
===========================================
  Files         1106     1855     +749     
  Lines       203542   286754   +83212     
  Branches     30104    40241   +10137     
===========================================
+ Hits        143309   191066   +47757     
- Misses       48332    79218   +30886     
- Partials     11901    16470    +4569     
Flag Coverage Δ *Carryforward flag
hipBLAS 90.67% <ø> (+0.02%) ⬆️ Carriedforward from 5566a64
hipBLASLt 40.00% <ø> (ø)
hipCUB 82.21% <ø> (ø) Carriedforward from 5566a64
hipDNN 85.79% <ø> (+5.33%) ⬆️ Carriedforward from 5566a64
hipFFT 56.31% <ø> (?) Carriedforward from 5566a64
hipRAND 76.12% <ø> (?) Carriedforward from 5566a64
hipSOLVER 68.73% <ø> (-0.13%) ⬇️ Carriedforward from 5566a64
hipSPARSE 84.70% <ø> (?) Carriedforward from 5566a64
rocBLAS 47.97% <ø> (?) Carriedforward from 5566a64
rocFFT 47.88% <ø> (-2.05%) ⬇️ Carriedforward from 5566a64
rocRAND 57.07% <ø> (?) Carriedforward from 5566a64
rocSOLVER 77.21% <ø> (?) Carriedforward from 5566a64
rocSPARSE 71.48% <ø> (-0.17%) ⬇️ Carriedforward from 5566a64

*This pull request uses carry forward flags. Click here to find out more.
see 1241 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…it timeout

Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>
@pdhirajkumarprasad pdhirajkumarprasad marked this pull request as ready for review March 26, 2026 08:34
@pdhirajkumarprasad pdhirajkumarprasad requested a review from a team as a code owner March 26, 2026 08:34
@pdhirajkumarprasad pdhirajkumarprasad changed the title [WIP]: Adding multiple enhancement to improve the execution time as well as debugging Adding multiple enhancement to improve the execution time as well as debugging Mar 26, 2026
Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_gtest_main.cpp Outdated
Copy link
Copy Markdown
Contributor

@talumbau talumbau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall

The multi-GPU parallel testing feature is valuable — the 2.5-6x speedup numbers are compelling and the approach (fork/exec + GTest sharding via GTEST_TOTAL_SHARDS/GTEST_SHARD_INDEX) is the right one. There are some changes needed before this merges.

Structural: move parallel execution code to its own file

hipblaslt_gtest_main.cpp goes from a clean 265-line file (setup + main) to 900+ lines. run_tests_parallel_gpus() and the supporting code are a separate concern from GTest setup. Please move them to a new file (e.g. hipblaslt_parallel_test.cpp / .hpp) and add it to CMakeLists.txt.

Remove --internal-parallel-child

This flag exists to gate the verbose timing instrumentation and to suppress one line of output. But the child processes already have stdout/stderr redirected to log files, so suppressing output is moot. And the timing blocks are development scaffolding that shouldn't ship. The child processes don't need to know they're children — the env vars (GTEST_TOTAL_SHARDS, HIP_VISIBLE_DEVICES) are sufficient. Please remove --internal-parallel-child and all the is_parallel_child conditionals.

JSON merge as a follow-up

The merge_gtest_json_files function adds ~150 lines of hand-rolled JSON string parsing which is fragile (substring matching, unsigned underflow risk on empty vectors, repeated linear scans). I'd prefer to land the parallel execution feature without the JSON merge and follow up with a proper implementation. For now, just document that --num_gpus with --gtest_output=json:file.json produces per-GPU files (file_gpu0.json, file_gpu1.json, etc.).

Rename --parallel_gpus to --num_gpus

Shorter, clearer, and reads better: --num_gpus 4 vs --parallel_gpus 4.

Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_gtest_main.cpp Outdated
Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_gtest_main.cpp Outdated
Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_gtest_main.cpp Outdated
Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_gtest_main.cpp Outdated
Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_gtest_main.cpp Outdated
Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_gtest_main.cpp Outdated
Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_gtest_main.cpp Outdated
Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_gtest_main.cpp Outdated
Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_gtest_main.cpp Outdated
Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_test.cpp Outdated
@pdhirajkumarprasad pdhirajkumarprasad requested a review from a team as a code owner April 6, 2026 07:10
@pdhirajkumarprasad
Copy link
Copy Markdown
Contributor Author

pdhirajkumarprasad commented Apr 6, 2026

@talumbau I have updated and pushed the change based on your feedback. Here are brief about the change

1> Created separate file for parallel execution logic
2> Updated printing test-count when GTEST_LISTENER=NO_PASS_LINE_IN_LOG is set
3> Updated the timeout logic
4> Removed the json merging logic when we are running in parallel
5> removing the output json file from /tmp when it passes

@talumbau talumbau self-requested a review April 8, 2026 15:05
Copy link
Copy Markdown
Contributor

@talumbau talumbau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing many of the previous comments -- the file split, removing JSON merge, removing --internal-parallel-child, the --num_gpus rename, respecting GTEST_LISTENER, and cleaning up /tmp files are all done well.

There are still a few open items from last round, and a structural issue I want to flag that I think is the root cause of most of them.

The arg parsing doesn't belong in run_tests_parallel_gpus().

Right now run_tests_parallel_gpus(int argc, char** argv, int num_gpus) receives the raw command line and then scans through argv twice -- once to build a command string for system(), and again to construct a child argv for execvp. This is the source of multiple bugs:

  • The system() call has the command injection risk I flagged last round (unescaped quotes/backticks in arguments) -- still unfixed.
  • The --num_gpus 4 (space-separated) form isn't fully stripped: "--num_gpus" is skipped but "4" leaks through to the child as a stray argument.
  • The skip check uses arg.find("--num_gpus") != std::string::npos (substring match), which is overly broad.

But the deeper issue is: this function shouldn't be parsing argv at all. Its job is to fork children, set up env vars, and wait for results. Argument parsing belongs in main().

main() should:

  1. Parse and strip --num_gpus (already done).
  2. Parse and strip --gtest_output if present, noting the filename for per-GPU renaming.
  3. Pass a clean argv (custom flags already removed) to the parallel runner, along with any parsed values it needs.

Then run_tests_parallel_gpus doesn't need to scan argv at all -- it just passes the clean argv straight to execvp, only adding env vars (HIP_VISIBLE_DEVICES, GTEST_TOTAL_SHARDS, GTEST_SHARD_INDEX, OMP_NUM_THREADS) and modifying the output filename. The signature would look something like:

int run_tests_parallel_gpus(int argc, char** argv, int num_gpus,
                            const std::string& json_output_base);

where argv has already had --num_gpus removed, and json_output_base is the parsed output path (empty string if not specified).

This also eliminates the need for the system() call entirely -- see inline comments.

Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_parallel_test.cpp Outdated
Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_parallel_test.cpp
Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_parallel_test.cpp Outdated
Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>
Copy link
Copy Markdown
Contributor

@talumbau talumbau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- this revision addresses all the blocking issues from the last review. The system() block is gone, arg parsing is in main(), the child logic is in its own function, and fork failure cleans up properly. Nice work.

One suggestion on the argv stripping -- see inline comment.

Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_gtest_main.cpp Outdated
@talumbau
Copy link
Copy Markdown
Contributor

Also in the future, please go back and "resolve" or otherwise comment on my comments so that I can see you have addressed them and haven't missed the comment I left - thanks!

@talumbau talumbau requested a review from bnemanich April 10, 2026 18:01
@talumbau
Copy link
Copy Markdown
Contributor

hey @bnemanich this PR is about ready in my view. Is there anyone else you would suggest to review for changes in this part of the code?

Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>
@pdhirajkumarprasad
Copy link
Copy Markdown
Contributor Author

@davidd-amd can you review this once?

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been inactive for 25 days and will be marked as stale.

If you would like to keep this PR open, please:

  • Add new commits
  • Add a comment explaining why it should remain open

This PR will be automatically closed in 5 days if no further activity occurs.

@github-actions github-actions Bot added the Stale PR has no activity for 25+ days label May 17, 2026
@davidd-amd
Copy link
Copy Markdown
Contributor

My biggest concern with this change is that it will introduce orphaned and/or zombie processes. I am working on building a list of scenarios that could lead to orphaned processes and I will also provide some alternatives that we could consider through either ctest or gtest so we don't need to worry about managing process creation and reaping.

@github-actions github-actions Bot removed the Stale PR has no activity for 25+ days label May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants