Skip to content

Conversation

@wangfakang
Copy link
Contributor

Diagnostic proposal

Distributed large-scale EP is gradually becoming the main deployment strategy for MOE models. However, as the scale of EP increases, the risk of slowdowns in the Dispatch and Combine communication operators also rises. There are many factors, ranging from GPU hardware anomalies and imbalanced MOE computation to issues with communication links, all of which make the detection and localization of the DeepEP slow problems extremely challenging.

To address this, we have designed a diagnosis module. Each rank collects the average waiting time for receiving each token from other ranks and reports these statistics to rank 0. Based on the mean-normalized characteristics of the resulting analysis matrix, rank 0 can effectively detect and precisely localize slow anomalies in distributed communication. In addition, the impact of the overhead of this diagnostic module on performance can be ignored. Supports identifying:

    1. Slowdown caused by the destination rank.
    1. Slowdown caused by the source rank.
    1. Slowdown caused by the communication path between a specific source and destination rank.
4672ed9008649a842f760bb6068085c0

Maintains a statistical matrix of average receive wait times: Matrix[src_rank, dst_rank], where each row represents a source rank and each column represents a destination rank. Example anomaly localization:

  1. Abnormal column 3: indicates destination rank 3 is slow.
16   13   10  117   18   18   19   12
10   19   11  118   16   16   16   13
18   18   12  110   18   19   18   13
13   18   16  12    12   11   18   18
14   20   10  114   14   16   18   16
20   20   15  114   19   13   15   18
18   17   19  116   10   17   17   19
15   17   20  118   13   13   15   14
  1. Abnormal row 6: indicates source rank 6 is slow.
16   13   10   17   18   18   19   12
10   19   11   18   16   16   16   13
18   18   12   10   18   19   18   13
13   18   16   12   12   11   18   18
14   20   10   14   14   16   18   16
20   20   15   14   19   13   15   18
138  137  139  137  130  137  13   139
15   17   20   18   13   13   15   14
  1. Abnormal entry (3, 4): indicates the path from src=3 to dst=4 is slow.
16   13   10   17   18   18   19   12
10   19   11   18   16   16   16   13
18   18   12   10   18   19   18   15
13   18   16   12   125  11   18   18
14   20   10   14   14   16   18   16
20   20   15   14   19   13   15   18
18   17   19   17   10   17   17   19
15   17   20   18   13   13   15   14

Test

The following test case simulates slow behavior: rank 2 sleeps for 1 ms before Dispatch, and rank 3 sleeps for 1 ms before Combine. With the diagnosis feature enabled, the abnormal ranks can be efficiently and accurately identified through the diagnostic logs.

MASTER_ADDR=x1 WORLD_SIZE=2 RANK=0 python ./test_low_latency.py --enable-diagnose
MASTER_ADDR=x1 WORLD_SIZE=2 RANK=1 python ./test_low_latency.py --enable-diagnose

[Diagnose] test successful!!! [Dispatch] slow_rank: 2 diagnose info: {'abnormal_cols': [2], 'abnormal_rows': [], 'abnormal_points': []}
[Diagnose] test successful!!! [Combine] slow_rank: 3 diagnose info: {'abnormal_cols': [3], 'abnormal_rows': [], 'abnormal_points': []}
[2025-07-17 17:40:46] [Diagnose] InstanceID: a36ddd16-a8d0-43bf-9e8a-f196f23010d5 EPSize: 16, diagnose: {'abnormal_cols': [2], 'abnormal_rows': [], 'abnormal_points': []}, Dispatch Wait Recv Cost Per Token Matrix[src_rank, dst_rank]:
[     458      436  2089715      414      409      442      489      446   631083   630701   600662   603353   623662   601509   648353   607069]
[    1824      428  2109761      592      420      467     1334      445   645833   631392   604618   613074   631749   613036   661512   613172]
[     482      540      400      398      531      431      413      468      462      417      435      435      412      443      503      443]
[     496      593  2102578      433      397      495      446      516   635600   620213   616995   607728   607846   623413   647436   622338]
[     454      504  2106236      504      438      483      560      441   640223   616259   615725   611239   626197   614345   661815   625302]
[     464      493  2107535      662      499      455      528      524   635921   624701   618855   614211   621635   618019   648553   625600]
[     575      488  2103425      455      521      470      400      509   644805   615734   605293   611079   632458   608621   651734   611419]
[     423      458  2106031      588      429      415      440      426   646507   614305   610228   617194   635773   609468   656905   613554]
[     477      434  2083882      451      560      488      396      449      456      466      454      432      441      419     3791      468]
[     602      551  2091881      450      511      452      410      540     8318      416      481      593     2854      504    13156      406]
[    2792     2244  2104466      497      537      467     1904      453    18365     5799      458      462    16904     4561    27280     7967]
[   12787     9139  2100530     5827     7124     5457    14654     1651    28623    17839     8014      436    24580    16211    37229    18754]
[     453      679  2086719      598      413      533      511      436     3055      533      464      468      510      427     5974      447]
[    3123      462  2098418     3769      460     1135     3716      423     9868     1782      427      471     3711      412    15862      857]
[     588      595  2074080      569      561      541      416      491     2404      427      530      528      418      508      537      487]
[     449      491  2088717      455      611      482      520      553     5556     1405      584      502     1128      526     8714      445]
[2025-07-17 17:40:46] [Diagnose] InstanceID: a36ddd16-a8d0-43bf-9e8a-f196f23010d5 EPSize: 16, diagnose: {'abnormal_cols': [3], 'abnormal_rows': [], 'abnormal_points': []}, Combine Wait Recv Cost Per Token Matrix[src_rank, dst_rank]:
[   19072     7279    20561  2072268    15997    13573    21674      673  1263147  1148066  1167954  1252067  1285783  1180916  1290758  1178798]
[   47104    26949    25853  2085599    20598    12922    17269     7592  1213723  1115401  1167540  1282320  1297273  1169337  1283833  1168333]
[    6296    21619    24966  2076229    37408    14371    12938     1016  1246273  1139305  1153039  1129926  1233979  1181000  1259534  1149558]
[     739      633      584     7801      656      667      611      604      674      572      587      584      603      562      567      663]
[   25043    17020    21670  2084261    25272    25071    15836     3110  1230170  1165741  1157909  1195771  1230103  1182279  1247737  1189596]
[   21088    17349    31263  2078697    21184    26455    16925    18211  1240208  1137336  1137735  1180834  1256534  1194188  1253146  1169898]
[   13794    23841    23786  2089254    18234    25111    23887    32920  1242799  1124483  1171214  1194640  1227864  1192370  1257611  1192630]
[   29019    10497    22257  2076324    16723    10625    20293    31106  1240523  1170728  1170950  1178887  1219817  1191883  1250422  1170812]
[    8392     8423    25567  2069109    18190      580     9496    36811    15278    12900     2577    22902     9648     6556    18032      761]
[   31205     9006    13939  2076012      736    16169    17915    20850    29518    19928    26908    26724    22044    33417    11850     4338]
[    6611      568    21173  2040542    12961      549     1226    31314    29411    26835    25064    27549    20819    26131    35882     8051]
[    7102      553      551  2068760      565      558      793    14701     6394     1799     1266    17533     3819     1104      588      602]
[   19826      566    16792  2071241    10065      608     1522    39026    25604     9060    22092    56178    16785    29829    25597     7929]
[   16069      611    27698  2079775    44330      589      558    28203    21146    16350      702    21377     6320    22852    18327      789]
[   39488      554    15426  2045748    42412      596      555    31918    21832    16259    10681    32967     5057    15117    26542     3061]
[   50092      639    50833  2069901    38149      564    27683    43023    42475    31316    16153    41648    27319    36471    45751    20772]

@alpha-baby
Copy link
Contributor

Integrating self-diagnostic capabilities into DeepEP would be highly beneficial, particularly for its deployment in production environments. Should issues arise in production, this capability facilitates rapid root cause identification, enabling us to formulate appropriate contingency strategies for different scenarios. This is crucial for ensuring the stability of inference services across the entire cluster.

Moreover, from the perspective of the overall AI training and inference service ecosystem, it would be even more ideal if the underlying systems could autonomously handle fault localization and recovery effectively.

@sphish
Copy link
Collaborator

sphish commented Jul 30, 2025

Thanks!

@sphish sphish merged commit 4b67064 into deepseek-ai:main Jul 30, 2025
@wangfakang
Copy link
Contributor Author

A diagnosis analyzer for efficient and precise localization of slow ranks. Welcome to try it: https://github.com/antgroup/DeepXTrace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants