Skip to content

[BACKEND] Refactor the TritonGPU dialect utils to call the APIs of the TritonGPU layout attributes.#2682

Merged
ThomasRaoux merged 2 commits intotriton-lang:mainfrom
chengjunlu:chengjun/tritongpu_refactor
Nov 29, 2023
Merged

[BACKEND] Refactor the TritonGPU dialect utils to call the APIs of the TritonGPU layout attributes.#2682
ThomasRaoux merged 2 commits intotriton-lang:mainfrom
chengjunlu:chengjun/tritongpu_refactor

Conversation

@chengjunlu
Copy link
Copy Markdown
Contributor

Refactor the TritonGPU dialect utils to call the APIs of the TritonGPU attributes and its inheritance.

@chengjunlu
Copy link
Copy Markdown
Contributor Author

Hi @ThomasRaoux ,
The CI test failed because the H100 cases fails.

FAILED hopper/test_persistent_warp_specialized_gemm.py::test_full_static_persistent_matmul_kernel[256-64-16-4-2-256-256-64-False-True-add-rows-float16-False-3-True]

I am looking for a H100 platform to debug this case.
I cannot find the helpful information about the failure in the test log. Do you know where it has more detail information about the crash?

@ptillet
Copy link
Copy Markdown
Collaborator

ptillet commented Nov 22, 2023

judging from the log in the crash

Thread 0x00007f782dc00700 (most recent call first):
  File "/home/ptillet/.local/lib/python3.8/site-packages/execnet/gateway_base.py", line 474 in read
  File "/home/ptillet/.local/lib/python3.8/site-packages/execnet/gateway_base.py", line 507 in from_io
  File "/home/ptillet/.local/lib/python3.8/site-packages/execnet/gateway_base.py", line 1049 in _thread_receiver
  File "/home/ptillet/.local/lib/python3.8/site-packages/execnet/gateway_base.py", line 296 in run
  File "/home/ptillet/.local/lib/python3.8/site-packages/execnet/gateway_base.py", line 361 in _perform_spawn
Current thread 0x00007f782e5dd740 (most recent call first):
  File "/home/ptillet/.local/lib/python3.8/site-packages/triton/compiler/compiler.py", line 123 in optimize_ttgir
  File "/home/ptillet/.local/lib/python3.8/site-packages/triton/compiler/compiler.py", line 437 in <lambda>
  File "/home/ptillet/.local/lib/python3.8/site-packages/triton/compiler/compiler.py", line 543 in compile
  File "/home/ptillet/.local/lib/python3.8/site-packages/triton/runtime/jit.py", line 532 in run
  File "/home/ptillet/actions-runner/_work/triton/triton/python/test/unit/hopper/test_persistent_warp_specialized_gemm.py", line 889 in test_full_static_persistent_matmul_kernel
  File "/home/ptillet/.local/lib/python3.8/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
  File "/home/ptillet/.local/lib/python3.8/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/home/ptillet/.local/lib/python3.8/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/home/ptillet/.local/lib/python3.8/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/home/ptillet/.local/lib/python3.8/site-packages/_pytest/python.py", line 1792 in runtest
  File "/home/ptillet/.local/lib/python3.8/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
  File "/home/ptillet/.local/lib/python3.8/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/home/ptillet/.local/lib/python3.8/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/home/ptillet/.local/lib/python3.8/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/home/ptillet/.local/lib/python3.8/site-packages/_pytest/runner.py", line 262 in <lambda>
  File "/home/ptillet/.local/lib/python3.8/site-packages/_pytest/runner.py", line 341 in from_call
  File "/home/ptillet/.local/lib/python3.8/site-packages/_pytest/runner.py", line 261 in call_runtest_hook
  File "/home/ptillet/.local/lib/python3.8/site-packages/_pytest/runner.py", line 222 in call_and_report
  File "/home/ptillet/.local/lib/python3.8/site-packages/_pytest/runner.py", line 133 in runtestprotocol
  File "/home/ptillet/.local/lib/python3.8/site-packages/_pytest/runner.py", line 114 in pytest_runtest_protocol
  File "/home/ptillet/.local/lib/python3.8/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/home/ptillet/.local/lib/python3.8/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/home/ptillet/.local/lib/python3.8/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/home/ptillet/.local/lib/python3.8/site-packages/xdist/remote.py", line 174 in run_one_test
  File "/home/ptillet/.local/lib/python3.8/site-packages/xdist/remote.py", line 157 in pytest_runtestloop
  File "/home/ptillet/.local/lib/python3.8/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/home/ptillet/.local/lib/python3.8/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/home/ptillet/.local/lib/python3.8/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/home/ptillet/.local/lib/python3.8/site-packages/_pytest/main.py", line 325 in _main
  File "/home/ptillet/.local/lib/python3.8/site-packages/_pytest/main.py", line 271 in wrap_session
  File "/home/ptillet/.local/lib/python3.8/site-packages/_pytest/main.py", line 318 in pytest_cmdline_main
  File "/home/ptillet/.local/lib/python3.8/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/home/ptillet/.local/lib/python3.8/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/home/ptillet/.local/lib/python3.8/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/home/ptillet/.local/lib/python3.8/site-packages/xdist/remote.py", line 355 in <module>
  File "/home/ptillet/.local/lib/python3.8/site-packages/execnet/gateway_base.py", line 1157 in executetask
  File "/home/ptillet/.local/lib/python3.8/site-packages/execnet/gateway_base.py", line 296 in run
  File "/home/ptillet/.local/lib/python3.8/site-packages/execnet/gateway_base.py", line 361 in _perform_spawn
  File "/home/ptillet/.local/lib/python3.8/site-packages/execnet/gateway_base.py", line 343 in integrate_as_primary_thread
  File "/home/ptillet/.local/lib/python3.8/site-packages/execnet/gateway_base.py", line 1142 in serve
  File "/home/ptillet/.local/lib/python3.8/site-packages/execnet/gateway_base.py", line 1640 in serve
  File "<string>", line 8 in <module>
  File "<string>", line 1 in <module>
................................................[gw4] node down: Not properly terminated
F
replacing crashed worker gw4

it seems like a segfault in one of the ttgir optimization passes

@chengjunlu chengjunlu marked this pull request as draft November 23, 2023 01:19
Copy link
Copy Markdown
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup, thanks!
I think the crash may be related to the splitNum problem I pointed out. Looking at which tests fail it is definitely related to cases with multiple CTAs.
Let me know if you are not able to get a H100 to debug and I can try to help.

Comment thread lib/Dialect/TritonGPU/IR/Dialect.cpp Outdated
Comment thread lib/Dialect/TritonGPU/IR/Dialect.cpp
@chengjunlu
Copy link
Copy Markdown
Contributor Author

Nice cleanup, thanks! I think the crash may be related to the splitNum problem I pointed out. Looking at which tests fail it is definitely related to cases with multiple CTAs. Let me know if you are not able to get a H100 to debug and I can try to help.

@ThomasRaoux Very appreciate you reviewed the code carefully. I think the code you pointed out is the root cause of the bug.
I just get the accessing to a H100. I will fix the issue soon.

@chengjunlu chengjunlu force-pushed the chengjun/tritongpu_refactor branch 2 times, most recently from 57b04f6 to 3912fdb Compare November 27, 2023 06:10
@chengjunlu chengjunlu marked this pull request as ready for review November 27, 2023 06:10
@chengjunlu
Copy link
Copy Markdown
Contributor Author

Hi @ThomasRaoux ,
I have fixed the issue and changed the code based on your comments.
Would you like to help to review again?

Copy link
Copy Markdown
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ThomasRaoux ThomasRaoux enabled auto-merge (squash) November 29, 2023 07:55
@ThomasRaoux ThomasRaoux disabled auto-merge November 29, 2023 07:56
@ThomasRaoux ThomasRaoux merged commit 2310943 into triton-lang:main Nov 29, 2023
feihugis pushed a commit to feihugis/triton that referenced this pull request Feb 13, 2024
…e TritonGPU layout attributes. (triton-lang#2682)

Refactor the TritonGPU dialect utils to call the APIs of the TritonGPU
attributes and its inheritance.
pingzhuu pushed a commit to siliconflow/triton that referenced this pull request Apr 2, 2024
…e TritonGPU layout attributes. (triton-lang#2682)

Refactor the TritonGPU dialect utils to call the APIs of the TritonGPU
attributes and its inheritance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants