Adding multiple enhancement to improve the execution time as well as debugging by pdhirajkumarprasad · Pull Request #5740 · ROCm/rocm-libraries

pdhirajkumarprasad · 2026-03-24T06:53:52Z

Motivation

Through this PR, we want to achieve multiple improvement in hipblaslt-test

Currently hipblaslt-test runs on single GPU so even when we have multi-gpu system, we don't have way to utilize the complete system
Add string like [Test #3/180] (CurrentTestRunning/TotalTest) to existing [ RUN ] so that it's easy to understand the progress

Technical Details

Test Plan

Test Result

Execution time

./clients/hipblaslt-test --gtest_filter="MatrixTransformTest" -> 12961ms (current behavior)

./clients/hipblaslt-test --parallel_gpus 4 --gtest_filter="MatrixTransformTest" -> 6309ms

./clients/hipblaslt-test --parallel_gpus 4 --gtest_filter="MatrixTransformTest" -> 3286ms

./clients/hipblaslt-test --parallel_gpus 8 --gtest_filter="MatrixTransformTest" -> 1609ms

on complete hipblaslt-test

Before Change [==========] 22051 tests from 12 test suites ran. (1266297 ms total)

** After Change **

OVERALL SUMMARY (across all GPUs):
Total tests run:  22051
Total PASSED:     22051
Total FAILED:     0
Average time:     516142 ms

So we have 2.5X improvement

Further improvement by handling the OpenMP thread correctly

OVERALL SUMMARY (across all GPUs):
Total tests run: 22051
Total PASSED: 22051
Total FAILED: 0
Average time: 194332 ms

so we have > 6x improvement

On Debugability

** Old o/p **
[ RUN ] _/matmul_test.matmul/pre_checkin_alpha_beta_zero_NaN_bf16_rbf16_rbf16_rbf16_rf32_r_NN_256_128_64_nnan_256_64_2_256_256_1
[ OK ] _/matmul_test.matmul/pre_checkin_alpha_beta_zero_NaN_bf16_rbf16_rbf16_rbf16_rf32_r_NN_256_128_64_nnan_256_64_2_256_256_1 (4016 ms)
[ RUN ] _/matmul_test.matmul/quick_matmul_one_f16_rf16_rf16_rf16_rf32_r_NN_1_1_1_0_1_1_2_1_1_1
[ OK ] _/matmul_test.matmul/quick_matmul_one_f16_rf16_rf16_rf16_rf32_r_NN_1_1_1_0_1_1_2_1_1_1 (2224 ms)
[ RUN ] _/matmul_test.matmul/quick_matmul_one_f32_rf32_rf32_rf32_rf32_r_NN_1_1_1_0_1_1_2_1_1_1

** New o/p **
[Test #3/2757] [ RUN ] _/matmul_test.matmul/pre_checkin_alpha_beta_zero_NaN_bf16_rbf16_rbf16_rbf16_rf32_r_NN_256_128_64_nnan_256_64_2_256_256_1
[ OK ] _/matmul_test.matmul/pre_checkin_alpha_beta_zero_NaN_bf16_rbf16_rbf16_rbf16_rf32_r_NN_256_128_64_nnan_256_64_2_256_256_1 (4016 ms)
[Test #4/2757] [ RUN ] _/matmul_test.matmul/quick_matmul_one_f16_rf16_rf16_rf16_rf32_r_NN_1_1_1_0_1_1_2_1_1_1
[ OK ] _/matmul_test.matmul/quick_matmul_one_f16_rf16_rf16_rf16_rf32_r_NN_1_1_1_0_1_1_2_1_1_1 (2224 ms)
[Test #5/2757] [ RUN ] _/matmul_test.matmul/quick_matmul_one_f32_rf32_rf32_rf32_rf32_r_NN_1_1_1_0_1_1_2_1_1_1

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…debugging Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>

math-ci-webhook · 2026-03-24T08:19:32Z

perfci run on commit `34eaf44`

math-ci run

codecov-commenter · 2026-03-24T08:21:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

❌ Your project status has failed because the head coverage (68.73%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #5740      +/-   ##
===========================================
- Coverage    70.41%   66.63%   -3.78%     
===========================================
  Files         1106     1855     +749     
  Lines       203542   286754   +83212     
  Branches     30104    40241   +10137     
===========================================
+ Hits        143309   191066   +47757     
- Misses       48332    79218   +30886     
- Partials     11901    16470    +4569

Flag	Coverage Δ		*Carryforward flag
hipBLAS	`90.67% <ø> (+0.02%)`	⬆️	Carriedforward from 5566a64
hipBLASLt	`40.00% <ø> (ø)`
hipCUB	`82.21% <ø> (ø)`		Carriedforward from 5566a64
hipDNN	`85.79% <ø> (+5.33%)`	⬆️	Carriedforward from 5566a64
hipFFT	`56.31% <ø> (?)`		Carriedforward from 5566a64
hipRAND	`76.12% <ø> (?)`		Carriedforward from 5566a64
hipSOLVER	`68.73% <ø> (-0.13%)`	⬇️	Carriedforward from 5566a64
hipSPARSE	`84.70% <ø> (?)`		Carriedforward from 5566a64
rocBLAS	`47.97% <ø> (?)`		Carriedforward from 5566a64
rocFFT	`47.88% <ø> (-2.05%)`	⬇️	Carriedforward from 5566a64
rocRAND	`57.07% <ø> (?)`		Carriedforward from 5566a64
rocSOLVER	`77.21% <ø> (?)`		Carriedforward from 5566a64
rocSPARSE	`71.48% <ø> (-0.17%)`	⬇️	Carriedforward from 5566a64

*This pull request uses carry forward flags. Click here to find out more.
see 1241 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…it timeout Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>

talumbau

Overall

The multi-GPU parallel testing feature is valuable — the 2.5-6x speedup numbers are compelling and the approach (fork/exec + GTest sharding via GTEST_TOTAL_SHARDS/GTEST_SHARD_INDEX) is the right one. There are some changes needed before this merges.

Structural: move parallel execution code to its own file

hipblaslt_gtest_main.cpp goes from a clean 265-line file (setup + main) to 900+ lines. run_tests_parallel_gpus() and the supporting code are a separate concern from GTest setup. Please move them to a new file (e.g. hipblaslt_parallel_test.cpp / .hpp) and add it to CMakeLists.txt.

Remove `--internal-parallel-child`

This flag exists to gate the verbose timing instrumentation and to suppress one line of output. But the child processes already have stdout/stderr redirected to log files, so suppressing output is moot. And the timing blocks are development scaffolding that shouldn't ship. The child processes don't need to know they're children — the env vars (GTEST_TOTAL_SHARDS, HIP_VISIBLE_DEVICES) are sufficient. Please remove --internal-parallel-child and all the is_parallel_child conditionals.

JSON merge as a follow-up

The merge_gtest_json_files function adds ~150 lines of hand-rolled JSON string parsing which is fragile (substring matching, unsigned underflow risk on empty vectors, repeated linear scans). I'd prefer to land the parallel execution feature without the JSON merge and follow up with a proper implementation. For now, just document that --num_gpus with --gtest_output=json:file.json produces per-GPU files (file_gpu0.json, file_gpu1.json, etc.).

Rename `--parallel_gpus` to `--num_gpus`

Shorter, clearer, and reads better: --num_gpus 4 vs --parallel_gpus 4.

pdhirajkumarprasad · 2026-04-06T07:14:29Z

@talumbau I have updated and pushed the change based on your feedback. Here are brief about the change

1> Created separate file for parallel execution logic
2> Updated printing test-count when GTEST_LISTENER=NO_PASS_LINE_IN_LOG is set
3> Updated the timeout logic
4> Removed the json merging logic when we are running in parallel
5> removing the output json file from /tmp when it passes

talumbau

Thanks for addressing many of the previous comments -- the file split, removing JSON merge, removing --internal-parallel-child, the --num_gpus rename, respecting GTEST_LISTENER, and cleaning up /tmp files are all done well.

There are still a few open items from last round, and a structural issue I want to flag that I think is the root cause of most of them.

The arg parsing doesn't belong in run_tests_parallel_gpus().

Right now run_tests_parallel_gpus(int argc, char** argv, int num_gpus) receives the raw command line and then scans through argv twice -- once to build a command string for system(), and again to construct a child argv for execvp. This is the source of multiple bugs:

The system() call has the command injection risk I flagged last round (unescaped quotes/backticks in arguments) -- still unfixed.
The --num_gpus 4 (space-separated) form isn't fully stripped: "--num_gpus" is skipped but "4" leaks through to the child as a stray argument.
The skip check uses arg.find("--num_gpus") != std::string::npos (substring match), which is overly broad.

But the deeper issue is: this function shouldn't be parsing argv at all. Its job is to fork children, set up env vars, and wait for results. Argument parsing belongs in main().

main() should:

Parse and strip --num_gpus (already done).
Parse and strip --gtest_output if present, noting the filename for per-GPU renaming.
Pass a clean argv (custom flags already removed) to the parallel runner, along with any parsed values it needs.

Then run_tests_parallel_gpus doesn't need to scan argv at all -- it just passes the clean argv straight to execvp, only adding env vars (HIP_VISIBLE_DEVICES, GTEST_TOTAL_SHARDS, GTEST_SHARD_INDEX, OMP_NUM_THREADS) and modifying the output filename. The signature would look something like:

int run_tests_parallel_gpus(int argc, char** argv, int num_gpus,
                            const std::string& json_output_base);

where argv has already had --num_gpus removed, and json_output_base is the parsed output path (empty string if not specified).

This also eliminates the need for the system() call entirely -- see inline comments.

Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>

talumbau

Thanks -- this revision addresses all the blocking issues from the last review. The system() block is gone, arg parsing is in main(), the child logic is in its own function, and fork failure cleans up properly. Nice work.

One suggestion on the argv stripping -- see inline comment.

talumbau · 2026-04-10T18:00:31Z

Also in the future, please go back and "resolve" or otherwise comment on my comments so that I can see you have addressed them and haven't missed the comment I left - thanks!

talumbau · 2026-04-10T18:01:53Z

hey @bnemanich this PR is about ready in my view. Is there anyone else you would suggest to review for changes in this part of the code?

Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>

pdhirajkumarprasad · 2026-04-15T16:03:13Z

@davidd-amd can you review this once?

github-actions · 2026-05-17T01:09:41Z

This pull request has been inactive for 25 days and will be marked as stale.

If you would like to keep this PR open, please:

Add new commits
Add a comment explaining why it should remain open

This PR will be automatically closed in 5 days if no further activity occurs.

davidd-amd · 2026-05-18T14:52:16Z

My biggest concern with this change is that it will introduce orphaned and/or zombie processes. I am working on building a list of scenarios that could lead to orphaned processes and I will also provide some alternatives that we could consider through either ctest or gtest so we don't need to worry about managing process creation and reaping.

Adding multiple enhancement to improve the execution time as well as …

34eaf44

…debugging Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>

github-actions Bot added the project: hipblaslt label Mar 24, 2026

assistant-librarian Bot added the organization: ROCm label Mar 24, 2026

Added better handling of OpenMP and option to continue even when we h…

8e5a542

…it timeout Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>

pdhirajkumarprasad marked this pull request as ready for review March 26, 2026 08:34

pdhirajkumarprasad requested a review from a team as a code owner March 26, 2026 08:34

pdhirajkumarprasad requested a review from draganmladjenovic March 26, 2026 08:35

pdhirajkumarprasad changed the title ~~[WIP]: Adding multiple enhancement to improve the execution time as well as debugging~~ Adding multiple enhancement to improve the execution time as well as debugging Mar 26, 2026

draganmladjenovic reviewed Mar 31, 2026

View reviewed changes

Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_gtest_main.cpp Outdated

pdhirajkumarprasad requested a review from talumbau March 31, 2026 15:52

talumbau requested changes Apr 1, 2026

View reviewed changes

Updated the change based on feeedback on PR

1b905f6

pdhirajkumarprasad requested a review from a team as a code owner April 6, 2026 07:10

talumbau self-requested a review April 8, 2026 15:05

talumbau requested changes Apr 8, 2026

View reviewed changes

Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_parallel_test.cpp Outdated

Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_parallel_test.cpp

Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_parallel_test.cpp Outdated

Updated the change based on feedback

8d0ddf1

Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>

pdhirajkumarprasad requested a review from talumbau April 9, 2026 06:16

talumbau reviewed Apr 10, 2026

View reviewed changes

Comment thread projects/hipblaslt/clients/tests/src/hipblaslt_gtest_main.cpp Outdated

talumbau requested a review from bnemanich April 10, 2026 18:01

parallel execution when num_gpus > 1 and not on window

5566a64

Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>

pdhirajkumarprasad requested review from davidd-amd and talumbau April 13, 2026 04:10

Merge branch 'develop' into users/dhirajp/hipblaslt_test_improvement

b700719

github-actions Bot added the Stale PR has no activity for 25+ days label May 17, 2026

github-actions Bot removed the Stale PR has no activity for 25+ days label May 19, 2026

Conversation

pdhirajkumarprasad commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Execution time

on complete hipblaslt-test

Further improvement by handling the OpenMP thread correctly

On Debugability

Submission Checklist

Uh oh!

math-ci-webhook Bot commented Mar 24, 2026

perfci run on commit 34eaf44

Uh oh!

codecov-commenter commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

talumbau left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Overall

Structural: move parallel execution code to its own file

Remove --internal-parallel-child

JSON merge as a follow-up

Rename --parallel_gpus to --num_gpus

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pdhirajkumarprasad commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

talumbau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

talumbau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

talumbau commented Apr 10, 2026

Uh oh!

talumbau commented Apr 10, 2026

Uh oh!

pdhirajkumarprasad commented Apr 15, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

davidd-amd commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pdhirajkumarprasad commented Mar 24, 2026 •

edited

Loading

perfci run on commit `34eaf44`

codecov-commenter commented Mar 24, 2026 •

edited

Loading

talumbau left a comment •

edited

Loading

Remove `--internal-parallel-child`

Rename `--parallel_gpus` to `--num_gpus`

pdhirajkumarprasad commented Apr 6, 2026 •

edited

Loading