Adding multiple enhancement to improve the execution time as well as debugging#5740
Adding multiple enhancement to improve the execution time as well as debugging#5740pdhirajkumarprasad wants to merge 6 commits into
Conversation
…debugging Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>
perfci run on commit 34eaf44 |
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (68.73%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #5740 +/- ##
===========================================
- Coverage 70.41% 66.63% -3.78%
===========================================
Files 1106 1855 +749
Lines 203542 286754 +83212
Branches 30104 40241 +10137
===========================================
+ Hits 143309 191066 +47757
- Misses 48332 79218 +30886
- Partials 11901 16470 +4569
*This pull request uses carry forward flags. Click here to find out more. 🚀 New features to boost your workflow:
|
…it timeout Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>
There was a problem hiding this comment.
Overall
The multi-GPU parallel testing feature is valuable — the 2.5-6x speedup numbers are compelling and the approach (fork/exec + GTest sharding via GTEST_TOTAL_SHARDS/GTEST_SHARD_INDEX) is the right one. There are some changes needed before this merges.
Structural: move parallel execution code to its own file
hipblaslt_gtest_main.cpp goes from a clean 265-line file (setup + main) to 900+ lines. run_tests_parallel_gpus() and the supporting code are a separate concern from GTest setup. Please move them to a new file (e.g. hipblaslt_parallel_test.cpp / .hpp) and add it to CMakeLists.txt.
Remove --internal-parallel-child
This flag exists to gate the verbose timing instrumentation and to suppress one line of output. But the child processes already have stdout/stderr redirected to log files, so suppressing output is moot. And the timing blocks are development scaffolding that shouldn't ship. The child processes don't need to know they're children — the env vars (GTEST_TOTAL_SHARDS, HIP_VISIBLE_DEVICES) are sufficient. Please remove --internal-parallel-child and all the is_parallel_child conditionals.
JSON merge as a follow-up
The merge_gtest_json_files function adds ~150 lines of hand-rolled JSON string parsing which is fragile (substring matching, unsigned underflow risk on empty vectors, repeated linear scans). I'd prefer to land the parallel execution feature without the JSON merge and follow up with a proper implementation. For now, just document that --num_gpus with --gtest_output=json:file.json produces per-GPU files (file_gpu0.json, file_gpu1.json, etc.).
Rename --parallel_gpus to --num_gpus
Shorter, clearer, and reads better: --num_gpus 4 vs --parallel_gpus 4.
|
@talumbau I have updated and pushed the change based on your feedback. Here are brief about the change 1> Created separate file for parallel execution logic |
talumbau
left a comment
There was a problem hiding this comment.
Thanks for addressing many of the previous comments -- the file split, removing JSON merge, removing --internal-parallel-child, the --num_gpus rename, respecting GTEST_LISTENER, and cleaning up /tmp files are all done well.
There are still a few open items from last round, and a structural issue I want to flag that I think is the root cause of most of them.
The arg parsing doesn't belong in run_tests_parallel_gpus().
Right now run_tests_parallel_gpus(int argc, char** argv, int num_gpus) receives the raw command line and then scans through argv twice -- once to build a command string for system(), and again to construct a child argv for execvp. This is the source of multiple bugs:
- The
system()call has the command injection risk I flagged last round (unescaped quotes/backticks in arguments) -- still unfixed. - The
--num_gpus 4(space-separated) form isn't fully stripped:"--num_gpus"is skipped but"4"leaks through to the child as a stray argument. - The skip check uses
arg.find("--num_gpus") != std::string::npos(substring match), which is overly broad.
But the deeper issue is: this function shouldn't be parsing argv at all. Its job is to fork children, set up env vars, and wait for results. Argument parsing belongs in main().
main() should:
- Parse and strip
--num_gpus(already done). - Parse and strip
--gtest_outputif present, noting the filename for per-GPU renaming. - Pass a clean argv (custom flags already removed) to the parallel runner, along with any parsed values it needs.
Then run_tests_parallel_gpus doesn't need to scan argv at all -- it just passes the clean argv straight to execvp, only adding env vars (HIP_VISIBLE_DEVICES, GTEST_TOTAL_SHARDS, GTEST_SHARD_INDEX, OMP_NUM_THREADS) and modifying the output filename. The signature would look something like:
int run_tests_parallel_gpus(int argc, char** argv, int num_gpus,
const std::string& json_output_base);where argv has already had --num_gpus removed, and json_output_base is the parsed output path (empty string if not specified).
This also eliminates the need for the system() call entirely -- see inline comments.
Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>
talumbau
left a comment
There was a problem hiding this comment.
Thanks -- this revision addresses all the blocking issues from the last review. The system() block is gone, arg parsing is in main(), the child logic is in its own function, and fork failure cleans up properly. Nice work.
One suggestion on the argv stripping -- see inline comment.
|
Also in the future, please go back and "resolve" or otherwise comment on my comments so that I can see you have addressed them and haven't missed the comment I left - thanks! |
|
hey @bnemanich this PR is about ready in my view. Is there anyone else you would suggest to review for changes in this part of the code? |
Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>
|
@davidd-amd can you review this once? |
|
This pull request has been inactive for 25 days and will be marked as stale. If you would like to keep this PR open, please:
This PR will be automatically closed in 5 days if no further activity occurs. |
|
My biggest concern with this change is that it will introduce orphaned and/or zombie processes. I am working on building a list of scenarios that could lead to orphaned processes and I will also provide some alternatives that we could consider through either ctest or gtest so we don't need to worry about managing process creation and reaping. |
Motivation
Through this PR, we want to achieve multiple improvement in hipblaslt-test
hipblaslt-testruns on single GPU so even when we have multi-gpu system, we don't have way to utilize the complete system[Test #3/180](CurrentTestRunning/TotalTest) to existing[ RUN ]so that it's easy to understand the progressTechnical Details
Test Plan
Test Result
Execution time
./clients/hipblaslt-test --gtest_filter="MatrixTransformTest" -> 12961ms (current behavior)
./clients/hipblaslt-test --parallel_gpus 4 --gtest_filter="MatrixTransformTest" -> 6309ms
./clients/hipblaslt-test --parallel_gpus 4 --gtest_filter="MatrixTransformTest" -> 3286ms
./clients/hipblaslt-test --parallel_gpus 8 --gtest_filter="MatrixTransformTest" -> 1609ms
on complete hipblaslt-test
Before Change [==========] 22051 tests from 12 test suites ran. (1266297 ms total)
** After Change **
So we have 2.5X improvement
Further improvement by handling the OpenMP thread correctly
OVERALL SUMMARY (across all GPUs):
Total tests run: 22051
Total PASSED: 22051
Total FAILED: 0
Average time: 194332 ms
so we have > 6x improvement
On Debugability
** Old o/p **
[ RUN ] _/matmul_test.matmul/pre_checkin_alpha_beta_zero_NaN_bf16_rbf16_rbf16_rbf16_rf32_r_NN_256_128_64_nnan_256_64_2_256_256_1
[ OK ] _/matmul_test.matmul/pre_checkin_alpha_beta_zero_NaN_bf16_rbf16_rbf16_rbf16_rf32_r_NN_256_128_64_nnan_256_64_2_256_256_1 (4016 ms)
[ RUN ] _/matmul_test.matmul/quick_matmul_one_f16_rf16_rf16_rf16_rf32_r_NN_1_1_1_0_1_1_2_1_1_1
[ OK ] _/matmul_test.matmul/quick_matmul_one_f16_rf16_rf16_rf16_rf32_r_NN_1_1_1_0_1_1_2_1_1_1 (2224 ms)
[ RUN ] _/matmul_test.matmul/quick_matmul_one_f32_rf32_rf32_rf32_rf32_r_NN_1_1_1_0_1_1_2_1_1_1
** New o/p **
[Test #3/2757] [ RUN ] _/matmul_test.matmul/pre_checkin_alpha_beta_zero_NaN_bf16_rbf16_rbf16_rbf16_rf32_r_NN_256_128_64_nnan_256_64_2_256_256_1
[ OK ] _/matmul_test.matmul/pre_checkin_alpha_beta_zero_NaN_bf16_rbf16_rbf16_rbf16_rf32_r_NN_256_128_64_nnan_256_64_2_256_256_1 (4016 ms)
[Test #4/2757] [ RUN ] _/matmul_test.matmul/quick_matmul_one_f16_rf16_rf16_rf16_rf32_r_NN_1_1_1_0_1_1_2_1_1_1
[ OK ] _/matmul_test.matmul/quick_matmul_one_f16_rf16_rf16_rf16_rf32_r_NN_1_1_1_0_1_1_2_1_1_1 (2224 ms)
[Test #5/2757] [ RUN ] _/matmul_test.matmul/quick_matmul_one_f32_rf32_rf32_rf32_rf32_r_NN_1_1_1_0_1_1_2_1_1_1
Submission Checklist