Benchmark for distributed matmul with overlap#4326
Conversation
|
!test |
|
Review updated until commit c798fe2 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
nsarka
left a comment
There was a problem hiding this comment.
Can you post the results? The benchmark looks good
67638f9 to
0de9e90
Compare
|
!test |
|
cc @Priya2698 to make sure this is aligned with how you plan to set up Python benchmarks |
|
As multidevice benchmarks grow, it would be useful to have them in the dashboard for visualization. See http://nv/eGB for examples. We can use This can unify the multidevice benchmarks with the existing utilities. I am not sure though, if pytest-benchmark can interfere with any multidevice calls, such as barrier. I would think it works since this PR already uses |
I fully agree with the idea. I started to do as you describe, but then got caught in debugging/exploring some strange performance behavior, so I simplified the structure. Tbh, this PR was not really ready for review :) Let me arrange the PR and ping you again when it is ready |
b1f3e76 to
dfe4eaf
Compare
|
!test |
|
@Priya2698 @wujingyue the PR is ready for review. This PR now depends on another PR #4466 which resolves the strange benchmark results I obtained. The benchmark is now using pytest structure, and measures each iteration separately (as is preferable, as we discussed) through cuda events. |
e80c834 to
c798fe2
Compare
|
I removed the dependecy of this PR with respect to the allocation cache #4466 |
|
!test |
| return None | ||
|
|
||
| # Set our custom CUDA event timer | ||
| benchmark._timer = CUDAEventTimer() |
There was a problem hiding this comment.
Did you run into issues around the precision of timer?
If no timer precision is set, pytest-benchmark does its own calibration which can sometimes produce invalid results:
Fuser/benchmarks/python/core.py
Lines 112 to 115 in c229fd5
There was a problem hiding this comment.
no I didn't get any issue about the precision of timer. Do you want me to manually set the precision anyway ?
| @pytest.mark.parametrize("k", [2**12, 2**16]) | ||
| @pytest.mark.parametrize("n", [2**10]) | ||
| @pytest.mark.parametrize("dtype", [torch.float16, torch.float32]) | ||
| def test_overlap_allgather_matmul_stream_outermost( |
There was a problem hiding this comment.
Is this similar to
?If so, we should remove that instance now.
There was a problem hiding this comment.
They are close but slightly different (e.g., the datatype) and would rather keep them separated because we might not want to test and benchmark the same instances
There was a problem hiding this comment.
They are close but slightly different (e.g., the datatype)
bfloat16 can be added to the datatypes parameter you already have though
we might not want to test and benchmark the same instances
I am not clear, why would this be problematic? That test is also running a benchmark with validation similar to the current test case.
There was a problem hiding this comment.
They are close but slightly different (e.g., the datatype)
bfloat16can be added to the datatypes parameter you already have thoughwe might not want to test and benchmark the same instances
I am not clear, why would this be problematic? That test is also running a benchmark with validation similar to the current test case.
Nothing is problematic. I just think it is useful to separate in general the instance used for validation and the one for performance. I thought this was the idea behind having two separate folders. We should allow the instances to diverge more in the future. Anyway, nothing crucial for now so if you think that is important let me know and I'll remove the test, and/or, remove the validation in the current benchmark.
There was a problem hiding this comment.
Got it.
If you intend to modify the test instance, we can keep it unchanged. In the current state, it is actually benchmarking as well, which seems redundant with the current benchmark addition. Maybe we can make the other one validation only? It can be a separate PR.
There was a problem hiding this comment.
I agree we do not need to measure execution time for the "tests". Btw, I wouldn't trust the numbers there, which IIUC represent host wall clock. In our case, the numbers from the CI look wrong, so would consider them not relevant
00:00:12 test_overlap_allgather_matmul_stream_outermost[s=1-backend_type=CommunicatorBackend.nccl] 248.8578 (1.0) 686.4527 (1.0) 275.4573 (1.0) 86.5904 (1.0) 254.6739 (1.0) 9.0676 (1.0) 1;3 3,630.3265 (1.0) 25 1
00:00:12 test_overlap_allgather_matmul_stream_outermost[s=1-backend_type=CommunicatorBackend.ucc] 256.7070 (1.03) 718.1512 (1.05) 284.6673 (1.03) 91.3450 (1.05) 263.6444 (1.04) 9.1677 (1.01) 1;3 3,512.8730 (0.97) 25 1
00:00:12 test_overlap_allgather_matmul_stream_outermost[s=1-backend_type=CommunicatorBackend.ucc] 262.9953 (1.06) 669.2354 (1.0) 292.9457 (1.06) 80.4101 (1.0) 271.6640 (1.07) 17.0944 (1.43) 1;2 3,413.6025 (0.94) 25 1
00:00:12 test_overlap_allgather_matmul_stream_outermost[s=8-backend_type=CommunicatorBackend.nccl] 1,227.1553 (4.95) 2,761.1768 (4.13) 1,348.3307 (4.90) 307.3344 (3.82) 1,269.3135 (5.00) 62.8864 (5.27) 2;3 741.6578 (0.20) 25 1
00:00:12 test_overlap_allgather_matmul_stream_outermost[s=8-backend_type=CommunicatorBackend.ucc] 1,237.6998 (4.99) 1,651.3430 (2.47) 1,275.0503 (4.63) 81.1988 (1.01) 1,257.0452 (4.95) 28.9301 (2.43) 1;2 784.2828 (0.22) 25 1
Got it. How can we monitor if that issue is showing up again? |
By having the benchmark running nightly and monitoring the performance remains in an acceptable range. This could be set up in nvFuser's CI, but the plan is to also include this benchmark in my team's benchmark tool and have a nightly consistent check there. |
Add a benchmark unit test for AG+Matmul with/without overlap. The test uses pytest framework and defines a custom Timer based on Cuda Events.
Results obtained show significant performance benefits of pipelining with ucc for large matrices. Results obtained for DGX 8* H100 GPU: