-
Notifications
You must be signed in to change notification settings - Fork 41
[Feature] Add diagnostic modules to dispatch and combine #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @oagniqgnat, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request enhances the deep_ep library by integrating diagnostic capabilities into its core dispatch and combine operations. These new modules enable the collection of crucial performance metrics related to token communication waiting times, providing valuable insights for identifying and resolving bottlenecks in distributed deep learning models. The changes span across C++ and Python components, ensuring comprehensive support for performance analysis.
Highlights
- Diagnostic Modules for Dispatch: Introduced an optional diagnostic module to the
intranode_dispatchfunction, allowing it to record the time taken to receive all tokens from each slave rank in the current rank during the dispatch phase. This data is stored in atorch.Tensor. - Diagnostic Modules for Combine: Added an optional diagnostic module to the
intranode_combinefunction, which records the time when the current rank sends all tokens to other ranks during the combine phase. This information is also captured in atorch.Tensor. - C++ API and Kernel Integration: The C++ API (
deep_ep.hpp,deep_ep.cpp) and underlying kernel operations (cam_moe_dispatch_normal,cam_moe_combine_normal) have been updated to accept and process these new diagnostic tensors, including modifications to their tiling functions and op_api definitions. - Python Binding and Testing: The Python bindings (
buffer.py) fordispatchandcombinenow expose these diagnostic parameters. A new test case (test_intranode.py) has been added to validate the diagnostic functionality, including adiagnose_matrixutility to analyze the collected statistics.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces diagnostic modules to the dispatch and combine functions, allowing for performance monitoring by collecting cost statistics. The changes are propagated through the entire stack, from the Python API down to the C++ kernels. While the overall approach is sound and the necessary plumbing is in place, there are critical flaws in the core profiling logic within the kernels. Specifically, the implementation incorrectly accumulates timestamps instead of time durations, which will lead to incorrect metrics. Additionally, there are a few areas for improvement regarding code clarity, such as a magic number for time conversion and misleading variable names in a utility function. Addressing these issues will ensure the new diagnostic feature provides accurate and meaningful data.
75a1731 to
ca5d42f
Compare
ca5d42f to
1497b7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable diagnostic feature to detect communication anomalies in dispatch and combine operations by measuring wait times within the kernels. The changes are comprehensive, spanning C++ kernels, op definitions, and Python bindings, and include a new Python utility for analyzing the collected data. My review focuses on improving the accuracy of the time measurement and the efficiency of the Python utility function.
Introduce
This feature is primarily used to detect slow anomalies in the dispatch and combine communications within super nodes.(Reference #311)
It maintains a statistical matrix of average receive/send wait times:
Matrix[src_rank, dst_rank], where each row represents a source rank and each column represents a destination rank.Specifically:
Usage
dispatchoperator, pass a tensor nameddispatch_wait_recv_cost_statswith a size ofnum_ranks.combineoperator, pass a tensor namedcombine_send_cost_statswith a size ofnum_ranks.The waiting time for receiving/sending will be stored in
dispatch_wait_recv_cost_statsandcombine_send_cost_stats.Then use the
diagnose_matrixinterface inutils.pyto calculate the rank of the anomaly.Test
Exapmle, set
thres_col=1.5, thres_row=1.5, thres_point=1.5,dispatch phase:
Calculated abnormal ranks:
[Diagnose Dispatch wait recv cost] abnormal_rows [[0, 148229.125, 1.7111639000711096], [1, 240550.0625, 2.7769210882803845], [5, 185996.1875, 2.14714862278825], [13, 236245.5, 2.7272290189546036]], abnormal_cols [], abnormal_points []combine phase:
Combine send cost stats: tensor([[ 17535, 67014, 501899, 394013, 380369, 421478, 683155, 592656, 425904, 403497, 414773, 514238, 724636, 708598, 437623, 394191], [ 60848, 15988, 502524, 474103, 513227, 420470, 539840, 485127, 464411, 415381, 365740, 450094, 651227, 522109, 400606, 363934], [491042, 483476, 9967, 56094, 442099, 495325, 497743, 463416, 457214, 412039, 388844, 479295, 469587, 457530, 428783, 352615], [510904, 510450, 62957, 13052, 429710, 480017, 512519, 392768, 393602, 374604, 375297, 538673, 549245, 416121, 360176, 321353], [399463, 617925, 543199, 454542, 12542, 65434, 614647, 424236, 398032, 388118, 329475, 481818, 758660, 564984, 382908, 330476], [478183, 578249, 585834, 523876, 66545, 15937, 784692, 577432, 436999, 393947, 363914, 445326, 726018, 710072, 436198, 403622], [464003, 609982, 501194, 402355, 306554, 321554, 15766, 64817, 459165, 437715, 432248, 571087, 713795, 506062, 342844, 288210], [479972, 512734, 474235, 423101, 391073, 437962, 60894, 11336, 448688, 469527, 456334, 475064, 512557, 463246, 395979, 348762], [527042, 592813, 461439, 375299, 362390, 466999, 593294, 442352, 13326, 62114, 460688, 587821, 678208, 437098, 320981, 355646], [558229, 662661, 570484, 479456, 336138, 440521, 743352, 534235, 61249, 13436, 430387, 579100, 821487, 718146, 411456, 386770], [435739, 613712, 583898, 478798, 315877, 361587, 721543, 519360, 429456, 389244, 12301, 68793, 876324, 615263, 376605, 358820], [517460, 597047, 592513, 486437, 323178, 395890, 762980, 579128, 420911, 393029, 67082, 26066, 868488, 581504, 382181, 332309], [553443, 484408, 526832, 489950, 411433, 440452, 500062, 495189, 511627, 430218, 403097, 516341, 10905, 58856, 482921, 509737], [545272, 500621, 481087, 422905, 387628, 459312, 499742, 466025, 458888, 391166, 419738, 543670, 60611, 12215, 436960, 484767], [560601, 660786, 570498, 412770, 318250, 437199, 721577, 472127, 372920, 354698, 422344, 537366, 826942, 683831, 10575, 66856], [521747, 593666, 606340, 518684, 361110, 425254, 726548, 621850, 482865, 432528, 374748, 476399, 739924, 756717, 71028, 17964]], device='npu:0', dtype=torch.int32)Calculated abnormal ranks:
Performance
After multiple tests, the performance loss during the dispatch and combine stages is less than 1%.