ditorch

ditorch 是设备无关 torch，旨在屏蔽各硬件厂商 torch 差异，为用户提供一致使用体验。通过 ditorch，开发者可以适配多个硬件算子库；此外，ditorch 提供训练过程中需要的基础工具，解决模型训练过程中出现的痛点问题。

核心功能

1. 可无感切换 pytorch 至国产芯片

只需添加两行代码，即可在国产芯片上像官方 pytorch 一样使用。

>>> import torch
>>> import ditorch
ditorch.framework: torch_npu:2.1.0.post3  pid: 1729023
>>> x = torch.randn(3,4,device="cuda")
>>>
>>> y = x + x
>>> x
Warning: Device do not support double dtype now, dtype cast repalce with float.
tensor([[ 1.3310,  1.0011, -1.0679, -1.5444],
        [-0.7345, -0.9888, -1.7310, -0.3305],
        [-0.6676, -1.7792,  0.7108, -0.9981]], device='cuda:0')
>>> y
tensor([[ 2.6619,  2.0023, -2.1359, -3.0887],
        [-1.4691, -1.9777, -3.4620, -0.6609],
        [-1.3353, -3.5583,  1.4216, -1.9962]], device='cuda:0')
>>>

ditorch + Ascend910 pytorch原生测例通过情况

ditorch + mlu370_m8 pytorch原生测例通过情况

2. 提供多个基础工具，解决训练过程的问题

提供模型训练过程中需要的基础工具，解决模型训练过程中出现的痛点问题算子工具。

序号	工具	简介
1	算子参数抓取工具	抓取模型真实训练过程中真实的输入输出
2	精度分析工具	进行离线和实时的精度分析
3	速度分析工具	可进行离线和实时的耗时分析，协助性能优化
4	算子 Fallback	可将指定、全部算子在设备上运行的操作 fallback 到 CPU 计算
5	算子数据类型转换工具	可将指定、全部算子的特定数据类型转到给定数据类型去计算
6	溢出检测工具	可对指定、全部算子进行溢出检测

算子参数抓取工具

抓取模型真实训练过程中真实的输入输出：

# usage1
import op_tools
capture = op_tools.OpCapture()
capture.start()
code_snippet_to_capture
capture.stop()
...

# usage2
import op_tools
with op_tools.OpCapture():
    code_snippet_to_capture()

抓取前向和反向的所有输入输出

...
apply OpCaptureHook on torch.Tensor.add
op_tools_results/op_capture_results/torch.Tensor.add/283699/161/2024-10-09-11-42-15/input.pth saved
op_tools_results/op_capture_results/torch.Tensor.add/283699/161/2024-10-09-11-42-15/output.pth saved
torch.Tensor.add    forward_id:161/2024-10-09-11-42-15    /deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/ops/norm.py:14 manual_rms_norm: my_input = my_input * torch.rsqrt(variance + eps)
+----------------------------+--------+---------+-------+---------------+---------------+---------------+---------------+----------------+-------+
|            name            | device |  dtype  | numel |     shape     |     stride    | requires_grad |     layout    |    data_ptr    | value |
+----------------------------+--------+---------+-------+---------------+---------------+---------------+---------------+----------------+-------+
| torch.Tensor.add inputs[0] | npu:0  | float32 | 16384 | (1, 16384, 1) | (16384, 1, 1) |     False     | torch.strided | 20067180408320 |       |
| torch.Tensor.add inputs[1] |        |         |       |               |               |               |               |                | 1e-05 |
|  torch.Tensor.add outputs  | npu:0  | float32 | 16384 | (1, 16384, 1) | (16384, 1, 1) |     False     | torch.strided | 20067180474368 |       |
+----------------------------+--------+---------+-------+---------------+---------------+---------------+---------------+----------------+-------+




apply OpCaptureHook on torch.rsqrt
op_tools_results/op_capture_results/torch.rsqrt/283699/162/2024-10-09-11-42-15/input.pth saved
op_tools_results/op_capture_results/torch.rsqrt/283699/162/2024-10-09-11-42-15/output.pth saved
torch.rsqrt    forward_id:162/2024-10-09-11-42-15    /deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/ops/norm.py:14 manual_rms_norm: my_input = my_input * torch.rsqrt(variance + eps)
+---------------------+--------+---------+-------+---------------+---------------+---------------+---------------+----------------+
|         name        | device |  dtype  | numel |     shape     |     stride    | requires_grad |     layout    |    data_ptr    |
+---------------------+--------+---------+-------+---------------+---------------+---------------+---------------+----------------+
|  torch.rsqrt inputs | npu:0  | float32 | 16384 | (1, 16384, 1) | (16384, 1, 1) |     False     | torch.strided | 20067180474368 |
| torch.rsqrt outputs | npu:0  | float32 | 16384 | (1, 16384, 1) | (16384, 1, 1) |     False     | torch.strided | 20067180540416 |
+---------------------+--------+---------+-------+---------------+---------------+---------------+---------------+----------------+




apply OpCaptureHook on torch.Tensor.mul
op_tools_results/op_capture_results/torch.Tensor.mul/283699/163/2024-10-09-11-42-15/input.pth saved
op_tools_results/op_capture_results/torch.Tensor.mul/283699/163/2024-10-09-11-42-15/output.pth saved
torch.Tensor.mul    forward_id:163/2024-10-09-11-42-15    /deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/ops/norm.py:14 manual_rms_norm: my_input = my_input * torch.rsqrt(variance + eps)
+----------------------------+--------+----------+----------+------------------+---------------------+---------------+---------------+----------------+
|            name            | device |  dtype   |  numel   |      shape       |        stride       | requires_grad |     layout    |    data_ptr    |
+----------------------------+--------+----------+----------+------------------+---------------------+---------------+---------------+----------------+
| torch.Tensor.mul inputs[0] | npu:0  | bfloat16 | 33554432 | (1, 16384, 2048) | (33554432, 2048, 1) |      True     | torch.strided | 20074677141504 |
| torch.Tensor.mul inputs[1] | npu:0  | float32  |  16384   |  (1, 16384, 1)   |    (16384, 1, 1)    |     False     | torch.strided | 20067180540416 |
|  torch.Tensor.mul outputs  | npu:0  | float32  | 33554432 | (1, 16384, 2048) | (33554432, 2048, 1) |     False     | torch.strided | 20075012687360 |
+----------------------------+--------+----------+----------+------------------+---------------------+---------------+---------------+----------------+
...

只抓取sort算子的参数，忽略其他算子 OP_CAPTURE_LIST=torch.Tensor.sort

...
skip OpCaptureHook on torch.Tensor.mul
skip OpCaptureHook on torch.Tensor.div
op_tools_results/op_capture_results/torch.Tensor.sort/3834328/59/2024-10-09-11-40-14/input.pth saved
op_tools_results/op_capture_results/torch.Tensor.sort/3834328/59/2024-10-09-11-40-14/output.pth saved
torch.Tensor.sort    forward_id:59/2024-10-09-11-40-14    /deeplink_afs/zhaoguochun/ditorch2/op_tools/test/test_op_capture.py:15 f: sorted, indices = e.sort()  # return torch.return_type.sort
+----------------------------------+--------+---------+-------+----------+---------+---------------+---------------+----------------+
|               name               | device |  dtype  | numel |  shape   |  stride | requires_grad |     layout    |    data_ptr    |
+----------------------------------+--------+---------+-------+----------+---------+---------------+---------------+----------------+
|     torch.Tensor.sort inputs     | npu:0  | float32 |  200  | (10, 20) | (20, 1) |      True     | torch.strided | 20067179830784 |
| torch.Tensor.sort outputs [0][0] | npu:0  | float32 |  200  | (10, 20) | (20, 1) |      True     | torch.strided | 20067179831808 |
| torch.Tensor.sort outputs [0][1] | npu:0  |  int64  |  200  | (10, 20) | (20, 1) |     False     | torch.strided | 20067179832832 |
+----------------------------------+--------+---------+-------+----------+---------+---------------+---------------+----------------+




skip OpCaptureHook on torch.Tensor.__getitem__
skip OpCaptureHook on torch.Tensor.sum
op_tools_results/op_capture_results/torch.Tensor.sort/3834328/59/2024-10-09-11-40-14/grad_inputs.pth saved
op_tools_results/op_capture_results/torch.Tensor.sort/3834328/59/2024-10-09-11-40-14/grad_outputs.pth saved
torch.Tensor.sort forward_id:<built-in function id>
+-------------------------------+--------+---------+-------+----------+---------+---------------+---------------+----------------+
|              name             | device |  dtype  | numel |  shape   |  stride | requires_grad |     layout    |    data_ptr    |
+-------------------------------+--------+---------+-------+----------+---------+---------------+---------------+----------------+
| torch.Tensor.sort grad_output | npu:0  | float32 |  200  | (10, 20) | (20, 1) |     False     | torch.strided | 20067179835904 |
| torch.Tensor.sort grad_inputs | npu:0  | float32 |  200  | (10, 20) | (20, 1) |     False     | torch.strided | 20067179836928 |
+-------------------------------+--------+---------+-------+----------+---------+---------------+---------------+----------------+
...

排除指定算子，抓取所有其他算子 OP_CAPTURE_DISABLE_LIST="torch.Tensor.add,torch.Tensor.sub"

apply OpCaptureHook on torch.Tensor.to
op_capture_result/0/2024-08-06--11-46/torch.Tensor.to/29/input.pth saved
op_capture_result/0/2024-08-06--11-46/torch.Tensor.to/29/output.pth saved
apply OpCaptureHook on torch.Tensor.mul
op_capture_result/0/2024-08-06--11-46/torch.Tensor.mul/30/input.pth saved
op_capture_result/0/2024-08-06--11-46/torch.Tensor.mul/30/output.pth saved
skip OpCaptureHook on torch.Tensor.add
skip OpCaptureHook on torch.Tensor.sub
apply OpCaptureHook on torch.Tensor.div
op_capture_result/0/2024-08-06--11-46/torch.Tensor.div/31/input.pth saved
op_capture_result/0/2024-08-06--11-46/torch.Tensor.div/31/output.pth saved
apply OpCaptureHook on torch.Tensor.sort
op_capture_result/0/2024-08-06--11-46/torch.Tensor.sort/32/input.pth saved
op_capture_result/0/2024-08-06--11-46/torch.Tensor.sort/32/output.pth saved
apply OpCaptureHook on torch.Tensor.sum
op_capture_result/0/2024-08-06--11-46/torch.Tensor.sum/33/input.pth saved
op_capture_result/0/2024-08-06--11-46/torch.Tensor.sum/33/output.pth saved
...

精度分析工具

精度分析工具可以实现：

离线分析：用模型训练过程中真实输入输出，离线对比。
实时精度对比：模型训练时实时与cpu对比分析精度。

# usage1
import op_tools
with op_tools.OpAutoCompare():
    code_snippet_to_autocompare()

# usage2
import op_tools
autocompare = op_tools.OpAutoCompare()
autocompare.start()
code_snippet_to_autocompare()
autocompare.stop()

可通过设置: AUTOCOMPARE_ERROR_TOLERANCE_FLOAT16, AUTOCOMPARE_ERROR_TOLERANCE_FLOAT32, AUTOCOMPARE_ERROR_TOLERANCE_FLOAT64, AUTOCOMPARE_ERROR_TOLERANCE 这几个环境变量来自定义精度阈值。

# for float16
export AUTOCOMPARE_ERROR_TOLERANCE_FLOAT16="1e-3,1e-4" # atol=1e-3, rtol=1e-4
# for bfloat16
export AUTOCOMPARE_ERROR_TOLERANCE_BFLOAT16="1e-2,1e-3" # atol=1e-2, rtol=1e-3
# for other dtype
export AUTOCOMPARE_ERROR_TOLERANCE="1e-4,1e-5" # atol=1e-4, rtol=1e-5

基于InternEvo + ditorch + torch_npu 在华为910B上实时精度分析输出片段

...
torch.Tensor.mul forward_id: 165    cpu_dtype_cast_info(from:to): {torch.bfloat16: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/ops/norm.py:26 manual_rms_norm: return weight * my_input
+---------------------------------+------------------+---------------------+----------+----------+--------+---------------+---------------+-----------------+
|               name              |      shape       |        stride       |  numel   |  dtype   | device | requires_grad |     layout    |     data_ptr    |
+---------------------------------+------------------+---------------------+----------+----------+--------+---------------+---------------+-----------------+
|    torch.Tensor.mul inputs[0]   |     (2048,)      |         (1,)        |   2048   | bfloat16 | npu:0  |      True     | torch.strided |  20076634832896 |
|    torch.Tensor.mul inputs[1]   | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | bfloat16 | npu:0  |     False     | torch.strided |  20067477619200 |
|     torch.Tensor.mul outputs    | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | bfloat16 | npu:0  |     False     | torch.strided |  20067544728576 |
| torch.Tensor.mul inputs(cpu)[0] |     (2048,)      |         (1,)        |   2048   | float64  |  cpu   |     False     | torch.strided |   34492000832   |
| torch.Tensor.mul inputs(cpu)[1] | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | float64  |  cpu   |     False     | torch.strided | 140503487049792 |
|  torch.Tensor.mul outputs(cpu)  | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | float64  |  cpu   |     False     | torch.strided | 140503218610240 |
+---------------------------------+------------------+---------------------+----------+----------+--------+---------------+---------------+-----------------+
+--------------------------------+----------+--------------+-------------------+-------------+-------------+---------------------------------------------------+
|              name              | allclose | max_abs_diff | max_relative_diff |     atol    |     rtol    |                     error_info                    |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+---------------------------------------------------+
| torch.Tensor.mul input[0]      |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
| torch.Tensor.mul input[1]      |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
| torch.Tensor.mul output        |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+---------------------------------------------------+

...
torch.Tensor.reshape forward_id: 173    cpu_dtype_cast_info(from:to): {torch.bfloat16: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/modeling_internlm2.py:421 _packed_forward: q = rearrange(q, "b t h gs d -> b t (h gs) d")
+-----------------------------------------+-----------------------+-------------------------------+----------+----------+--------+---------------+---------------+-----------------+-------+
|                   name                  |         shape         |             stride            |  numel   |  dtype   | device | requires_grad |     layout    |     data_ptr    | value |
+-----------------------------------------+-----------------------+-------------------------------+----------+----------+--------+---------------+---------------+-----------------+-------+
|      torch.Tensor.reshape inputs[0]     | (1, 16384, 8, 2, 128) | (67108864, 4096, 512, 128, 1) | 33554432 | bfloat16 | npu:0  |     False     | torch.strided |  20067611837952 |       |
|    torch.Tensor.reshape inputs [1][0]   |                       |                               |          |          |        |               |               |                 |   1   |
|    torch.Tensor.reshape inputs [1][1]   |                       |                               |          |          |        |               |               |                 | 16384 |
|    torch.Tensor.reshape inputs [1][2]   |                       |                               |          |          |        |               |               |                 |   16  |
|    torch.Tensor.reshape inputs [1][3]   |                       |                               |          |          |        |               |               |                 |  128  |
|       torch.Tensor.reshape outputs      |  (1, 16384, 16, 128)  |    (33554432, 2048, 128, 1)   | 33554432 | bfloat16 | npu:0  |     False     | torch.strided |  20067477619200 |       |
|   torch.Tensor.reshape inputs(cpu)[0]   | (1, 16384, 8, 2, 128) | (33554432, 2048, 256, 128, 1) | 33554432 | float64  |  cpu   |     False     | torch.strided | 140502405996608 |       |
| torch.Tensor.reshape inputs(cpu) [1][0] |                       |                               |          |          |        |               |               |                 |   1   |
| torch.Tensor.reshape inputs(cpu) [1][1] |                       |                               |          |          |        |               |               |                 | 16384 |
| torch.Tensor.reshape inputs(cpu) [1][2] |                       |                               |          |          |        |               |               |                 |   16  |
| torch.Tensor.reshape inputs(cpu) [1][3] |                       |                               |          |          |        |               |               |                 |  128  |
|    torch.Tensor.reshape outputs(cpu)    |  (1, 16384, 16, 128)  |    (33554432, 2048, 128, 1)   | 33554432 | float64  |  cpu   |     False     | torch.strided | 140502405996608 |       |
+-----------------------------------------+-----------------------+-------------------------------+----------+----------+--------+---------------+---------------+-----------------+-------+
+--------------------------------+----------+--------------+-------------------+-------------+-------------+---------------------------------------------------+
|              name              | allclose | max_abs_diff | max_relative_diff |     atol    |     rtol    |                     error_info                    |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+---------------------------------------------------+
| torch.Tensor.reshape input[0]  |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
| torch.Tensor.reshape input[1]  |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                   |
| torch.Tensor.reshape input[2]  |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                   |
| torch.Tensor.reshape input[3]  |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                   |
| torch.Tensor.reshape input[4]  |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                   |
| torch.Tensor.reshape output    |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+---------------------------------------------------+
...
torch.nn.functional.linear forward_id: 231    cpu_dtype_cast_info(from:to): {torch.bfloat16: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/utils.py:287 forward: output = F.linear(total_x, weight, bias)  # pylint: disable=E1102
+-------------------------------------------+------------------+---------------------+----------+----------+--------+---------------+---------------+-----------------+-------+
|                    name                   |      shape       |        stride       |  numel   |  dtype   | device | requires_grad |     layout    |     data_ptr    | value |
+-------------------------------------------+------------------+---------------------+----------+----------+--------+---------------+---------------+-----------------+-------+
|    torch.nn.functional.linear inputs[0]   | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | bfloat16 | npu:0  |     False     | torch.strided |  20088138762240 |       |
|    torch.nn.functional.linear inputs[1]   |   (2048, 2048)   |      (2048, 1)      | 4194304  | bfloat16 | npu:0  |      True     | torch.strided |  20076626444288 |       |
|    torch.nn.functional.linear inputs[2]   |                  |                     |          |          |        |               |               |                 |  None |
|     torch.nn.functional.linear outputs    | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | bfloat16 | npu:0  |     False     | torch.strided |  20076634845696 |       |
| torch.nn.functional.linear inputs(cpu)[0] | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | float64  |  cpu   |     False     | torch.strided | 140502875758656 |       |
| torch.nn.functional.linear inputs(cpu)[1] |   (2048, 2048)   |      (2048, 1)      | 4194304  | float64  |  cpu   |     False     | torch.strided | 140560622940224 |       |
| torch.nn.functional.linear inputs(cpu)[2] |                  |                     |          |          |        |               |               |                 |  None |
|  torch.nn.functional.linear outputs(cpu)  | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | float64  |  cpu   |     False     | torch.strided | 140502607319104 |       |
+-------------------------------------------+------------------+---------------------+----------+----------+--------+---------------+---------------+-----------------+-------+
+-------------------------------------+----------+--------------+-------------------+-------------+-------------+---------------------------------------------------+
|                 name                | allclose | max_abs_diff | max_relative_diff |     atol    |     rtol    |                     error_info                    |
+-------------------------------------+----------+--------------+-------------------+-------------+-------------+---------------------------------------------------+
| torch.nn.functional.linear input[0] |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
| torch.nn.functional.linear input[1] |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
| torch.nn.functional.linear input[2] |   True   | 0.000000000  |    0.000000000    | 0.000000000 | 0.000000000 |                                                   |
|  torch.nn.functional.linear output  |  False   | 0.003906250  |    0.083984375    | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
+-------------------------------------+----------+--------------+-------------------+-------------+-------------+---------------------------------------------------+



op_tools_results/op_capture_results/torch.nn.functional.linear/1527072/autocompare/231/2024-10-08-15-20-39/device/input.pth saved
op_tools_results/op_capture_results/torch.nn.functional.linear/1527072/autocompare/231/2024-10-08-15-20-39/device/output.pth saved
op_tools_results/op_capture_results/torch.nn.functional.linear/1527072/autocompare/231/2024-10-08-15-20-39/cpu/input.pth saved
op_tools_results/op_capture_results/torch.nn.functional.linear/1527072/autocompare/231/2024-10-08-15-20-39/cpu/output.pth saved

...
torch.outer forward_id: 175    cpu_dtype_cast_info(from:to): {torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/modules/embedding.py:351 _update_cos_sin_cache: freqs = torch.outer(t, self.inv_freq.to(device=t.device))
+----------------------------+------------+---------+-------+---------+--------+---------------+---------------+----------------+
|            name            |   shape    |  stride | numel |  dtype  | device | requires_grad |     layout    |    data_ptr    |
+----------------------------+------------+---------+-------+---------+--------+---------------+---------------+----------------+
|   torch.outer inputs[0]    |  (1025,)   |   (1,)  |  1025 | float32 | npu:0  |     False     | torch.strided | 20067180423168 |
|   torch.outer inputs[1]    |   (64,)    |   (1,)  |   64  | float32 | npu:0  |     False     | torch.strided | 20067179825152 |
|    torch.outer outputs     | (1025, 64) | (64, 1) | 65600 | float32 | npu:0  |     False     | torch.strided | 20067180428800 |
| torch.outer inputs(cpu)[0] |  (1025,)   |   (1,)  |  1025 | float64 |  cpu   |     False     | torch.strided |  36028922944   |
| torch.outer inputs(cpu)[1] |   (64,)    |   (1,)  |   64  | float64 |  cpu   |     False     | torch.strided |  33746424192   |
|  torch.outer outputs(cpu)  | (1025, 64) | (64, 1) | 65600 | float64 |  cpu   |     False     | torch.strided |  33745266368   |
+----------------------------+------------+---------+-------+---------+--------+---------------+---------------+----------------+
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
|              name              | allclose | max_abs_diff | max_relative_diff |     atol    |     rtol    |                    error_info                    |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
| torch.outer input[0]           |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.outer input[1]           |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.outer output             |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
...

torch.Tensor.add forward_id: 214    cpu_dtype_cast_info(from:to): {torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/modules/embedding.py:100 _torch_apply_rotary_func: out2.copy_(x1 * sin + x2 * cos)
+---------------------------------+-------------------+-----------------------+---------+---------+--------+---------------+---------------+-----------------+
|               name              |       shape       |         stride        |  numel  |  dtype  | device | requires_grad |     layout    |     data_ptr    |
+---------------------------------+-------------------+-----------------------+---------+---------+--------+---------------+---------------+-----------------+
|    torch.Tensor.add inputs[0]   | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0  |     False     | torch.strided |  20088138762240 |
|    torch.Tensor.add inputs[1]   | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0  |     False     | torch.strided |  20088172317184 |
|     torch.Tensor.add outputs    | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0  |     False     | torch.strided |  20088205872128 |
| torch.Tensor.add inputs(cpu)[0] | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float64 |  cpu   |     False     | torch.strided | 140504225255488 |
| torch.Tensor.add inputs(cpu)[1] | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float64 |  cpu   |     False     | torch.strided | 140504158142528 |
|  torch.Tensor.add outputs(cpu)  | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float64 |  cpu   |     False     | torch.strided | 140503009972288 |
+---------------------------------+-------------------+-----------------------+---------+---------+--------+---------------+---------------+-----------------+
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
|              name              | allclose | max_abs_diff | max_relative_diff |     atol    |     rtol    |                    error_info                    |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
| torch.Tensor.add input[0]      |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.Tensor.add input[1]      |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.Tensor.add output        |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+






torch.Tensor.copy_ forward_id: 215    cpu_dtype_cast_info(from:to): {torch.bfloat16: torch.float64, torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/modules/embedding.py:100 _torch_apply_rotary_func: out2.copy_(x1 * sin + x2 * cos)
+-----------------------------------+-------------------+--------------------------+---------+----------+--------+---------------+---------------+-----------------+
|                name               |       shape       |          stride          |  numel  |  dtype   | device | requires_grad |     layout    |     data_ptr    |
+-----------------------------------+-------------------+--------------------------+---------+----------+--------+---------------+---------------+-----------------+
|    torch.Tensor.copy_ inputs[0]   | (1, 16384, 8, 64) | (16777216, 1024, 128, 1) | 8388608 | bfloat16 | npu:0  |     False     | torch.strided |  20067746056320 |
|    torch.Tensor.copy_ inputs[1]   | (1, 16384, 8, 64) |  (8388608, 512, 64, 1)   | 8388608 | float32  | npu:0  |     False     | torch.strided |  20088205872128 |
|     torch.Tensor.copy_ outputs    | (1, 16384, 8, 64) | (16777216, 1024, 128, 1) | 8388608 | bfloat16 | npu:0  |     False     | torch.strided |  20067746056320 |
| torch.Tensor.copy_ inputs(cpu)[0] | (1, 16384, 8, 64) |  (8388608, 512, 64, 1)   | 8388608 | float64  |  cpu   |     False     | torch.strided | 140503144198208 |
| torch.Tensor.copy_ inputs(cpu)[1] | (1, 16384, 8, 64) |  (8388608, 512, 64, 1)   | 8388608 | float64  |  cpu   |     False     | torch.strided | 140503077085248 |
|  torch.Tensor.copy_ outputs(cpu)  | (1, 16384, 8, 64) |  (8388608, 512, 64, 1)   | 8388608 | float64  |  cpu   |     False     | torch.strided | 140503144198208 |
+-----------------------------------+-------------------+--------------------------+---------+----------+--------+---------------+---------------+-----------------+
+--------------------------------+----------+--------------+-------------------+-------------+-------------+---------------------------------------------------+
|              name              | allclose | max_abs_diff | max_relative_diff |     atol    |     rtol    |                     error_info                    |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+---------------------------------------------------+
| torch.Tensor.copy_ input[0]    |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
| torch.Tensor.copy_ input[1]    |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 |  Inconsistent dtypes: torch.float32 torch.float64 |
| torch.Tensor.copy_ output      |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+---------------------------------------------------+
...
torch.Tensor.tolist forward_id: 225
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/modules/multi_head_attention.py:205 _forward: actual_seq_qlen = actual_seq_qlen[1:].tolist()
+-----------------------------------------+-------+--------+-------+-------+--------+---------------+---------------+----------------+-------+
|                   name                  | shape | stride | numel | dtype | device | requires_grad |     layout    |    data_ptr    | value |
+-----------------------------------------+-------+--------+-------+-------+--------+---------------+---------------+----------------+-------+
|        torch.Tensor.tolist inputs       |  (4,) |  (1,)  |   4   | int32 | npu:0  |     False     | torch.strided | 20067179825668 |       |
|    torch.Tensor.tolist outputs [0][0]   |       |        |       |       |        |               |               |                |  4096 |
|    torch.Tensor.tolist outputs [0][1]   |       |        |       |       |        |               |               |                |  8192 |
|    torch.Tensor.tolist outputs [0][2]   |       |        |       |       |        |               |               |                | 12288 |
|    torch.Tensor.tolist outputs [0][3]   |       |        |       |       |        |               |               |                | 16384 |
|     torch.Tensor.tolist inputs(cpu)     |  (4,) |  (1,)  |   4   | int32 |  cpu   |     False     | torch.strided |  33694948224   |       |
| torch.Tensor.tolist outputs(cpu) [0][0] |       |        |       |       |        |               |               |                |  4096 |
| torch.Tensor.tolist outputs(cpu) [0][1] |       |        |       |       |        |               |               |                |  8192 |
| torch.Tensor.tolist outputs(cpu) [0][2] |       |        |       |       |        |               |               |                | 12288 |
| torch.Tensor.tolist outputs(cpu) [0][3] |       |        |       |       |        |               |               |                | 16384 |
+-----------------------------------------+-------+--------+-------+-------+--------+---------------+---------------+----------------+-------+
+--------------------------------+----------+--------------+-------------------+-------------+-------------+------------+
|              name              | allclose | max_abs_diff | max_relative_diff |     atol    |     rtol    | error_info |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+------------+
| torch.Tensor.tolist input      |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |            |
| torch.Tensor.tolist output[0]  |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |            |
| torch.Tensor.tolist output[1]  |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |            |
| torch.Tensor.tolist output[2]  |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |            |
| torch.Tensor.tolist output[3]  |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |            |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+------------+
...
torch.Tensor.mul forward_id: 252    cpu_dtype_cast_info(from:to): {torch.bfloat16: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/utils.py:667 Silu: return F.silu(w1_o) * w2_o
+---------------------------------+------------------+----------------------+-----------+----------+--------+---------------+---------------+-----------------+
|               name              |      shape       |        stride        |   numel   |  dtype   | device | requires_grad |     layout    |     data_ptr    |
+---------------------------------+------------------+----------------------+-----------+----------+--------+---------------+---------------+-----------------+
|    torch.Tensor.mul inputs[0]   | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | bfloat16 | npu:0  |     False     | torch.strided |  20077980155904 |
|    torch.Tensor.mul inputs[1]   | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | bfloat16 | npu:0  |     False     | torch.strided |  20076769064448 |
|     torch.Tensor.mul outputs    | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | bfloat16 | npu:0  |     False     | torch.strided |  20078282145792 |
| torch.Tensor.mul inputs(cpu)[0] | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | float64  |  cpu   |     False     | torch.strided | 140501332246592 |
| torch.Tensor.mul inputs(cpu)[1] | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | float64  |  cpu   |     False     | torch.strided | 140497574117440 |
|  torch.Tensor.mul outputs(cpu)  | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | float64  |  cpu   |     False     | torch.strided | 140496500371520 |
+---------------------------------+------------------+----------------------+-----------+----------+--------+---------------+---------------+-----------------+
+--------------------------------+----------+--------------+-------------------+-------------+-------------+---------------------------------------------------+
|              name              | allclose | max_abs_diff | max_relative_diff |     atol    |     rtol    |                     error_info                    |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+---------------------------------------------------+
| torch.Tensor.mul input[0]      |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
| torch.Tensor.mul input[1]      |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
| torch.Tensor.mul output        |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+---------------------------------------------------+
...
torch.Tensor.mean forward_id: 262    cpu_dtype_cast_info(from:to): {torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/ops/norm.py:13 manual_rms_norm: variance = my_input.to(torch.float32).pow(2).mean(dims, keepdim=True)
+---------------------------------------+------------------+---------------------+----------+---------+--------+---------------+---------------+-----------------+-------+
|                  name                 |      shape       |        stride       |  numel   |  dtype  | device | requires_grad |     layout    |     data_ptr    | value |
+---------------------------------------+------------------+---------------------+----------+---------+--------+---------------+---------------+-----------------+-------+
|      torch.Tensor.mean inputs[0]      | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | float32 | npu:0  |      True     | torch.strided |  20078114374144 |       |
|    torch.Tensor.mean inputs [1][0]    |                  |                     |          |         |        |               |               |                 |   -1  |
|    torch.Tensor.mean inputs keepdim   |                  |                     |          |         |        |               |               |                 |  True |
|       torch.Tensor.mean outputs       |  (1, 16384, 1)   |    (16384, 1, 1)    |  16384   | float32 | npu:0  |      True     | torch.strided |  20067180823040 |       |
|    torch.Tensor.mean inputs(cpu)[0]   | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | float64 |  cpu   |      True     | torch.strided | 140500392681536 |       |
|  torch.Tensor.mean inputs(cpu) [1][0] |                  |                     |          |         |        |               |               |                 |   -1  |
| torch.Tensor.mean inputs(cpu) keepdim |                  |                     |          |         |        |               |               |                 |  True |
|     torch.Tensor.mean outputs(cpu)    |  (1, 16384, 1)   |    (16384, 1, 1)    |  16384   | float64 |  cpu   |      True     | torch.strided |   33746053632   |       |
+---------------------------------------+------------------+---------------------+----------+---------+--------+---------------+---------------+-----------------+-------+
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
|              name              | allclose | max_abs_diff | max_relative_diff |     atol    |     rtol    |                    error_info                    |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
| torch.Tensor.mean input[0]     |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.Tensor.mean input[1]     |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                  |
| torch.Tensor.mean output       |   True   | 0.000000060  |    0.000000152    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
...
torch.Tensor.argmax forward_id: 296    cpu_dtype_cast_info(from:to): {torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/metrics/base.py:115 update: (shift_labels == (shift_logits.argmax(dim=-1) + pred_shift)), logits_global
+-------------------------------------+----------------+------------+------------+---------+--------+---------------+---------------+-----------------+-------+
|                 name                |     shape      |   stride   |   numel    |  dtype  | device | requires_grad |     layout    |     data_ptr    | value |
+-------------------------------------+----------------+------------+------------+---------+--------+---------------+---------------+-----------------+-------+
|      torch.Tensor.argmax inputs     | (16384, 92544) | (92544, 1) | 1516240896 | float32 | npu:0  |      True     | torch.strided |  20088635785216 |       |
|    torch.Tensor.argmax inputs dim   |                |            |            |         |        |               |               |                 |   -1  |
|     torch.Tensor.argmax outputs     |    (16384,)    |    (1,)    |   16384    |  int64  | npu:0  |     False     | torch.strided |  20067181522432 |       |
|   torch.Tensor.argmax inputs(cpu)   | (16384, 92544) | (92544, 1) | 1516240896 | float64 |  cpu   |     False     | torch.strided | 140336718184512 |       |
| torch.Tensor.argmax inputs(cpu) dim |                |            |            |         |        |               |               |                 |   -1  |
|   torch.Tensor.argmax outputs(cpu)  |    (16384,)    |    (1,)    |   16384    |  int64  |  cpu   |     False     | torch.strided |   35724171200   |       |
+-------------------------------------+----------------+------------+------------+---------+--------+---------------+---------------+-----------------+-------+
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
|              name              | allclose | max_abs_diff | max_relative_diff |     atol    |     rtol    |                    error_info                    |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
| torch.Tensor.argmax input      |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.Tensor.argmax output     |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
...
torch.Tensor.mul forward_id: 381    cpu_dtype_cast_info(from:to): {torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/solver/optimizer/hybrid_zero_optim.py:609 backward: loss = self.loss_scale * loss
+--------------------------------------+-------+--------+-------+---------+--------+---------------+---------------+-----------------+-------+
|                 name                 | shape | stride | numel |  dtype  | device | requires_grad |     layout    |     data_ptr    | value |
+--------------------------------------+-------+--------+-------+---------+--------+---------------+---------------+-----------------+-------+
|  torch.Tensor.mul grad_output(cpu)   |  (1,) |  (1,)  |   1   | float32 |  cpu   |     False     | torch.strided | 139747700714432 |       |
|   torch.Tensor.mul grad_inputs[0]    |       |        |       |         |        |               |               |                 |  None |
|   torch.Tensor.mul grad_inputs[1]    |   ()  |   ()   |   1   | float32 | npu:0  |     False     | torch.strided |  20067181907968 |       |
| torch.Tensor.mul grad_inputs(cpu)[0] |       |        |       |         |        |               |               |                 |  None |
| torch.Tensor.mul grad_inputs(cpu)[1] |   ()  |   ()   |   1   | float64 |  cpu   |     False     | torch.strided | 139747700718656 |       |
+--------------------------------------+-------+--------+-------+---------+--------+---------------+---------------+-----------------+-------+
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
|              name              | allclose | max_abs_diff | max_relative_diff |     atol    |     rtol    |                    error_info                    |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
| torch.Tensor.mul grad[0]       |   True   | 0.000000000  |    0.000000000    | 0.000000000 | 0.000000000 |                                                  |
| torch.Tensor.mul grad[1]       |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
...
torch.Tensor.add_ forward_id: 289    cpu_dtype_cast_info(from:to): {torch.bfloat16: torch.float32, torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/core/scheduler/no_pipeline_scheduler.py:143 _train_one_batch: loss += moe_loss
+---------------------------------------+-------+--------+-------+---------+--------+---------------+---------------+-----------------+-------+
|                  name                 | shape | stride | numel |  dtype  | device | requires_grad |     layout    |     data_ptr    | value |
+---------------------------------------+-------+--------+-------+---------+--------+---------------+---------------+-----------------+-------+
|   torch.Tensor.add_ grad_output(cpu)  |   ()  |   ()   |   1   | float32 |  cpu   |     False     | torch.strided | 139747700723072 |       |
|    torch.Tensor.add_ grad_inputs[0]   |   ()  |   ()   |   1   | float32 | npu:0  |     False     | torch.strided |  20067181907968 |       |
|    torch.Tensor.add_ grad_inputs[1]   |       |        |       |         |        |               |               |                 |  None |
| torch.Tensor.add_ grad_inputs(cpu)[0] |   ()  |   ()   |   1   | float64 |  cpu   |     False     | torch.strided | 139747700726528 |       |
| torch.Tensor.add_ grad_inputs(cpu)[1] |       |        |       |         |        |               |               |                 |  None |
+---------------------------------------+-------+--------+-------+---------+--------+---------------+---------------+-----------------+-------+
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
|              name              | allclose | max_abs_diff | max_relative_diff |     atol    |     rtol    |                    error_info                    |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
| torch.Tensor.add_ grad[0]      |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.Tensor.add_ grad[1]      |   True   | 0.000000000  |    0.000000000    | 0.000000000 | 0.000000000 |                                                  |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
...
torch.Tensor.div_ forward_id: 288    cpu_dtype_cast_info(from:to): {torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/core/scheduler/no_pipeline_scheduler.py:142 _train_one_batch: loss /= scale_loss
+---------------------------------------+-------+--------+-------+---------+--------+---------------+---------------+-----------------+-------+
|                  name                 | shape | stride | numel |  dtype  | device | requires_grad |     layout    |     data_ptr    | value |
+---------------------------------------+-------+--------+-------+---------+--------+---------------+---------------+-----------------+-------+
|   torch.Tensor.div_ grad_output(cpu)  |   ()  |   ()   |   1   | float32 |  cpu   |     False     | torch.strided | 139747700733376 |       |
|    torch.Tensor.div_ grad_inputs[0]   |   ()  |   ()   |   1   | float32 | npu:0  |     False     | torch.strided |  20067247392256 |       |
|    torch.Tensor.div_ grad_inputs[1]   |       |        |       |         |        |               |               |                 |  None |
| torch.Tensor.div_ grad_inputs(cpu)[0] |   ()  |   ()   |   1   | float64 |  cpu   |     False     | torch.strided | 139747700740672 |       |
| torch.Tensor.div_ grad_inputs(cpu)[1] |       |        |       |         |        |               |               |                 |  None |
+---------------------------------------+-------+--------+-------+---------+--------+---------------+---------------+-----------------+-------+
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
|              name              | allclose | max_abs_diff | max_relative_diff |     atol    |     rtol    |                    error_info                    |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
| torch.Tensor.div_ grad[0]      |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.Tensor.div_ grad[1]      |   True   | 0.000000000  |    0.000000000    | 0.000000000 | 0.000000000 |                                                  |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
...
torch.Tensor.sum forward_id: 279    cpu_dtype_cast_info(from:to): {torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/losses/ce_loss.py:645 forward: loss = loss_list.sum() / (cond).sum()
+-----------------------------------+----------+--------+-------+---------+--------+---------------+---------------+-----------------+
|                name               |  shape   | stride | numel |  dtype  | device | requires_grad |     layout    |     data_ptr    |
+-----------------------------------+----------+--------+-------+---------+--------+---------------+---------------+-----------------+
| torch.Tensor.sum grad_output(cpu) |    ()    |   ()   |   1   | float32 |  cpu   |     False     | torch.strided | 139747700752832 |
|    torch.Tensor.sum grad_inputs   | (16384,) |  (0,)  | 16384 | float32 | npu:0  |     False     | torch.strided |  20067181907968 |
| torch.Tensor.sum grad_inputs(cpu) | (16384,) |  (0,)  | 16384 | float64 |  cpu   |     False     | torch.strided | 139747699960640 |
+-----------------------------------+----------+--------+-------+---------+--------+---------------+---------------+-----------------+
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
|              name              | allclose | max_abs_diff | max_relative_diff |     atol    |     rtol    |                    error_info                    |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
| torch.Tensor.sum grad          |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+






torch.nn.functional.cross_entropy forward_id: 277    cpu_dtype_cast_info(from:to): {torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/losses/ce_loss.py:635 forward: loss_list = self.loss_fn(
+----------------------------------------------------+----------------+------------+------------+---------+--------+---------------+---------------+-----------------+
|                        name                        |     shape      |   stride   |   numel    |  dtype  | device | requires_grad |     layout    |     data_ptr    |
+----------------------------------------------------+----------------+------------+------------+---------+--------+---------------+---------------+-----------------+
| torch.nn.functional.cross_entropy grad_output(cpu) |    (16384,)    |    (1,)    |   16384    | float32 |  cpu   |     False     | torch.strided | 139747701147904 |
|   torch.nn.functional.cross_entropy grad_inputs    | (16384, 92544) | (92544, 1) | 1516240896 | float32 | npu:0  |     False     | torch.strided |  20110110621696 |
| torch.nn.functional.cross_entropy grad_inputs(cpu) | (16384, 92544) | (92544, 1) | 1516240896 | float64 |  cpu   |     False     | torch.strided | 139473962684480 |
+----------------------------------------------------+----------------+------------+------------+---------+--------+---------------+---------------+-----------------+
+----------------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
|                  name                  | allclose | max_abs_diff | max_relative_diff |     atol    |     rtol    |                    error_info                    |
+----------------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
| torch.nn.functional.cross_entropy grad |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
+----------------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
...
+------------+------------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
| forward_id |                name                | allclose | max_abs_diff | max_relative_diff |     atol    |     rtol    |                    error_info                    |
+------------+------------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+
|    358     |   torch.Tensor.expand output       |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    359     |   torch.Tensor.max input           |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    359     |   torch.Tensor.max output          |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    360     |   torch.Tensor.int input           |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    360     |   torch.Tensor.int output          |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    361     |   torch.Tensor.add input[0]        |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    361     |   torch.Tensor.add input[1]        |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                  |
|    361     |   torch.Tensor.add output          |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    362     |   torch.Tensor.__index__ input     |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    362     |   torch.Tensor.__index__ output    |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                  |
|    363     |   torch.Tensor.__index__ input     |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    363     |   torch.Tensor.__index__ output    |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                  |
|    364     | torch.Tensor.scatter_add_ input[0] |  False   | 2.062500000  |    0.000011804    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    364     | torch.Tensor.scatter_add_ input[1] |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                  |
|    364     | torch.Tensor.scatter_add_ input[2] |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    364     | torch.Tensor.scatter_add_ input[3] |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    364     |  torch.Tensor.scatter_add_ output  |  False   | 2.062500000  |    0.000011804    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    365     |   torch.Tensor.expand input[0]     |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    365     |   torch.Tensor.expand input[1]     |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                  |
|    365     |   torch.Tensor.expand output       |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    366     |   torch.Tensor.max input           |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    366     |   torch.Tensor.max output          |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    367     |   torch.Tensor.int input           |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    367     |   torch.Tensor.int output          |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    368     |   torch.Tensor.add input[0]        |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    368     |   torch.Tensor.add input[1]        |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                  |
|    368     |   torch.Tensor.add output          |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    369     |   torch.Tensor.__index__ input     |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    369     |   torch.Tensor.__index__ output    |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                  |
|    370     |   torch.Tensor.__index__ input     |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    370     |   torch.Tensor.__index__ output    |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                  |
|    371     | torch.Tensor.scatter_add_ input[0] |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    371     | torch.Tensor.scatter_add_ input[1] |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                  |
|    371     | torch.Tensor.scatter_add_ input[2] |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 |                                                  |
|    371     | torch.Tensor.scatter_add_ input[3] |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    371     |  torch.Tensor.scatter_add_ output  |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    372     |   torch.Tensor.__len__ input       |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    372     |   torch.Tensor.__len__ output      |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                  |
|    373     |   torch.Tensor.__len__ input       |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    373     |   torch.Tensor.__len__ output      |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                  |
|    374     |  torch.Tensor.new_zeros input[0]   |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    374     |  torch.Tensor.new_zeros input[1]   |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                  |
|    374     |   torch.Tensor.new_zeros output    |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    375     |   torch.cat input[0]               |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    375     |   torch.cat input[1]               |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    375     |   torch.cat output                 |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    376     |   torch.Tensor.__len__ input       |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    376     |   torch.Tensor.__len__ output      |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                  |
|    377     |  torch.Tensor.new_zeros input[0]   |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    377     |  torch.Tensor.new_zeros input[1]   |   True   | 0.000000000  |    0.000000000    | 0.000001000 | 0.000001000 |                                                  |
|    377     |   torch.Tensor.new_zeros output    |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    378     |   torch.cat input[0]               |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    378     |   torch.cat input[1]               |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    378     |   torch.cat output                 |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    379     |   torch.Tensor.add_ input[0]       |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    379     |   torch.Tensor.add_ input[1]       |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    379     |   torch.Tensor.add_ output         |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    380     |   torch.Tensor.add_ input[0]       |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    380     |   torch.Tensor.add_ input[1]       |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    380     |   torch.Tensor.add_ output         |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    381     |   torch.Tensor.mul input[0]        |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    381     |   torch.Tensor.mul input[1]        |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    381     |   torch.Tensor.mul output          |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
|    381     |   torch.Tensor.mul grad[0]         |   True   | 0.000000000  |    0.000000000    | 0.000000000 | 0.000000000 |                                                  |
|    381     |   torch.Tensor.mul grad[1]         |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
+------------+------------------------------------+----------+--------------+-------------------+-------------+-------------+--------------------------------------------------+

离线算子精度测试

python op_tools/run_op_from_data.py /deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/op_tools_results/op_capture_results/torch.nn.functional.normalize/ --acc_check --run_ti
mes 1
ditorch.framework: torch_npu:2.1.0.post3
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/op_tools_results/op_capture_results/torch.nn.functional.normalize/1556339/autocompare/268/2024-10-08-16-14-55/device


torch.nn.functional.normalize forward_id: 1    cpu_dtype_cast_info(from:to): {torch.bfloat16: torch.float64}
/deeplink_afs/zhaoguochun/ditorch2/op_tools/op_runner.py:124 run_forward: self.result = self.func(*self.args, **self.kwargs)
+-----------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+-----------------+-------+
|                      name                     |     shape     |   stride  |   numel   |  dtype   | device | requires_grad |     layout    |     data_ptr    | value |
+-----------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+-----------------+-------+
|      torch.nn.functional.normalize inputs     | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |      True     | torch.strided |  20067179823104 |       |
|     torch.nn.functional.normalize inputs p    |               |           |           |          |        |               |               |                 |  2.0  |
|    torch.nn.functional.normalize inputs dim   |               |           |           |          |        |               |               |                 |   1   |
|    torch.nn.functional.normalize inputs eps   |               |           |           |          |        |               |               |                 | 1e-12 |
|    torch.nn.functional.normalize inputs out   |               |           |           |          |        |               |               |                 |  None |
|     torch.nn.functional.normalize outputs     | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |      True     | torch.strided |  20076824625152 |       |
|   torch.nn.functional.normalize inputs(cpu)   | (92544, 2048) | (2048, 1) | 189530112 | float64  |  cpu   |      True     | torch.strided | 140583390142528 |       |
|  torch.nn.functional.normalize inputs(cpu) p  |               |           |           |          |        |               |               |                 |  2.0  |
| torch.nn.functional.normalize inputs(cpu) dim |               |           |           |          |        |               |               |                 |   1   |
| torch.nn.functional.normalize inputs(cpu) eps |               |           |           |          |        |               |               |                 | 1e-12 |
| torch.nn.functional.normalize inputs(cpu) out |               |           |           |          |        |               |               |                 |  None |
|   torch.nn.functional.normalize outputs(cpu)  | (92544, 2048) | (2048, 1) | 189530112 | float64  |  cpu   |      True     | torch.strided | 140573367332928 |       |
+-----------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+-----------------+-------+
+--------------------------------------+----------+--------------+-------------------+-------------+-------------+---------------------------------------------------+
|                 name                 | allclose | max_abs_diff | max_relative_diff |     atol    |     rtol    |                     error_info                    |
+--------------------------------------+----------+--------------+-------------------+-------------+-------------+---------------------------------------------------+
| torch.nn.functional.normalize input  |   True   | 0.000000000  |    0.000000000    | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
| torch.nn.functional.normalize output |   True   | 0.000976562  |    0.007751465    | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
+--------------------------------------+----------+--------------+-------------------+-------------+-------------+---------------------------------------------------+

torch.nn.functional.normalize forward_id: 1    cpu_dtype_cast_info(from:to): {torch.bfloat16: torch.float64}
/deeplink_afs/zhaoguochun/ditorch2/op_tools/op_runner.py:124 run_forward: self.result = self.func(*self.args, **self.kwargs)
+---------------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+-----------------+
|                        name                       |     shape     |   stride  |   numel   |  dtype   | device | requires_grad |     layout    |     data_ptr    |
+---------------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+-----------------+
|   torch.nn.functional.normalize grad_output(cpu)  | (92544, 2048) | (2048, 1) | 189530112 | float32  |  cpu   |     False     | torch.strided | 140572514840640 |
|    torch.nn.functional.normalize grad_inputs[0]   | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |     False     | torch.strided |  20077898366976 |
|    torch.nn.functional.normalize grad_inputs[1]   | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |     False     | torch.strided |  20079351693312 |
| torch.nn.functional.normalize grad_inputs(cpu)[0] | (92544, 2048) | (2048, 1) | 189530112 | float64  |  cpu   |     False     | torch.strided | 140559542906944 |
| torch.nn.functional.normalize grad_inputs(cpu)[1] | (92544, 2048) | (2048, 1) | 189530112 | float64  |  cpu   |     False     | torch.strided | 140554994171968 |
+---------------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+-----------------+
+---------------------------------------+----------+---------------+-------------------+-------------+-------------+--------------------------------------------------+
|                  name                 | allclose |  max_abs_diff | max_relative_diff |     atol    |     rtol    |                    error_info                    |
+---------------------------------------+----------+---------------+-------------------+-------------+-------------+--------------------------------------------------+
| torch.nn.functional.normalize grad[0] |  False   | 104.576171875 |    0.005955214    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.nn.functional.normalize grad[1] |  False   |  5.168151855  |    0.014576809    | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
+---------------------------------------+----------+---------------+-------------------+-------------+-------------+--------------------------------------------------+

速度分析工具

速度分析工具同样可以支持（1）离线分析和（2）实时分析。

用模型训练过程中真实的输入输出分析算子和通信的耗时，分析出性能瓶颈

# 测量算子耗时（输入为使用算子抓取工具在模型训练时抓取到的真实数据）
python op_tools/run_op_from_data.py /deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/op_tools_results/op_capture_results/torch.nn.functional.normalize/ --sync_time_measure  --run_times 3
ditorch.framework: torch_npu:2.1.0.post3
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/op_tools_results/op_capture_results/torch.nn.functional.normalize/1556339/autocompare/268/2024-10-08-16-14-55/device


/deeplink_afs/zhaoguochun/ditorch2/op_tools/op_runner.py:124 run_forward: self.result = self.func(*self.args, **self.kwargs)
+------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+----------------+-------+
|                   name                   |     shape     |   stride  |   numel   |  dtype   | device | requires_grad |     layout    |    data_ptr    | value |
+------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+----------------+-------+
|   torch.nn.functional.normalize inputs   | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |      True     | torch.strided | 20067179823104 |       |
|  torch.nn.functional.normalize inputs p  |               |           |           |          |        |               |               |                |  2.0  |
| torch.nn.functional.normalize inputs dim |               |           |           |          |        |               |               |                |   1   |
| torch.nn.functional.normalize inputs eps |               |           |           |          |        |               |               |                | 1e-12 |
| torch.nn.functional.normalize inputs out |               |           |           |          |        |               |               |                |  None |
|  torch.nn.functional.normalize outputs   | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |      True     | torch.strided | 20076824625152 |       |
+------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+----------------+-------+
+-------------------------------+------------+-----------------+------+
|              name             | forward_id | forward_elasped | unit |
+-------------------------------+------------+-----------------+------+
| torch.nn.functional.normalize |     1      |   41.59259796   |  ms  |
+-------------------------------+------------+-----------------+------+



+---------------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+----------------+
|                        name                       |     shape     |   stride  |   numel   |  dtype   | device | requires_grad |     layout    |    data_ptr    |
+---------------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+----------------+
| torch.nn.functional.normalize grad_outputs [0][0] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |     False     | torch.strided | 20077204209664 |
|  torch.nn.functional.normalize grad_inputs [0][0] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |     False     | torch.strided | 20077898366976 |
|  torch.nn.functional.normalize grad_inputs [0][1] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |     False     | torch.strided | 20079351693312 |
+---------------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+----------------+
+-------------------------------+------------+------------------+------+
|              name             | forward_id | backward_elasped | unit |
+-------------------------------+------------+------------------+------+
| torch.nn.functional.normalize |     1      |    6.45375252    |  ms  |
+-------------------------------+------------+------------------+------+
/opt/miniconda3/envs/torch_npu_py39/lib/python3.9/site-packages/torch/autograd/__init__.py:251: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at torch_npu/csrc/aten/common/TensorFactories.cpp:74.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass



/deeplink_afs/zhaoguochun/ditorch2/op_tools/op_runner.py:124 run_forward: self.result = self.func(*self.args, **self.kwargs)
+------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+----------------+-------+
|                   name                   |     shape     |   stride  |   numel   |  dtype   | device | requires_grad |     layout    |    data_ptr    | value |
+------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+----------------+-------+
|   torch.nn.functional.normalize inputs   | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |      True     | torch.strided | 20067179823104 |       |
|  torch.nn.functional.normalize inputs p  |               |           |           |          |        |               |               |                |  2.0  |
| torch.nn.functional.normalize inputs dim |               |           |           |          |        |               |               |                |   1   |
| torch.nn.functional.normalize inputs eps |               |           |           |          |        |               |               |                | 1e-12 |
| torch.nn.functional.normalize inputs out |               |           |           |          |        |               |               |                |  None |
|  torch.nn.functional.normalize outputs   | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |      True     | torch.strided | 20077204209664 |       |
+------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+----------------+-------+
+-------------------------------+------------+-----------------+------+
|              name             | forward_id | forward_elasped | unit |
+-------------------------------+------------+-----------------+------+
| torch.nn.functional.normalize |     2      |    2.06089020   |  ms  |
+-------------------------------+------------+-----------------+------+



+---------------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+----------------+
|                        name                       |     shape     |   stride  |   numel   |  dtype   | device | requires_grad |     layout    |    data_ptr    |
+---------------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+----------------+
| torch.nn.functional.normalize grad_outputs [0][0] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |     False     | torch.strided | 20076824625152 |
|  torch.nn.functional.normalize grad_inputs [0][0] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |     False     | torch.strided | 20077898366976 |
|  torch.nn.functional.normalize grad_inputs [0][1] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |     False     | torch.strided | 20079351693312 |
+---------------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+----------------+
+-------------------------------+------------+------------------+------+
|              name             | forward_id | backward_elasped | unit |
+-------------------------------+------------+------------------+------+
| torch.nn.functional.normalize |     2      |    5.55872917    |  ms  |
+-------------------------------+------------+------------------+------+



/deeplink_afs/zhaoguochun/ditorch2/op_tools/op_runner.py:124 run_forward: self.result = self.func(*self.args, **self.kwargs)
+------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+----------------+-------+
|                   name                   |     shape     |   stride  |   numel   |  dtype   | device | requires_grad |     layout    |    data_ptr    | value |
+------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+----------------+-------+
|   torch.nn.functional.normalize inputs   | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |      True     | torch.strided | 20067179823104 |       |
|  torch.nn.functional.normalize inputs p  |               |           |           |          |        |               |               |                |  2.0  |
| torch.nn.functional.normalize inputs dim |               |           |           |          |        |               |               |                |   1   |
| torch.nn.functional.normalize inputs eps |               |           |           |          |        |               |               |                | 1e-12 |
| torch.nn.functional.normalize inputs out |               |           |           |          |        |               |               |                |  None |
|  torch.nn.functional.normalize outputs   | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |      True     | torch.strided | 20076824625152 |       |
+------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+----------------+-------+
+-------------------------------+------------+-----------------+------+
|              name             | forward_id | forward_elasped | unit |
+-------------------------------+------------+-----------------+------+
| torch.nn.functional.normalize |     3      |    2.00271606   |  ms  |
+-------------------------------+------------+-----------------+------+



+---------------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+----------------+
|                        name                       |     shape     |   stride  |   numel   |  dtype   | device | requires_grad |     layout    |    data_ptr    |
+---------------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+----------------+
| torch.nn.functional.normalize grad_outputs [0][0] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |     False     | torch.strided | 20077204209664 |
|  torch.nn.functional.normalize grad_inputs [0][0] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |     False     | torch.strided | 20077898366976 |
|  torch.nn.functional.normalize grad_inputs [0][1] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0  |     False     | torch.strided | 20079351693312 |
+---------------------------------------------------+---------------+-----------+-----------+----------+--------+---------------+---------------+----------------+
+-------------------------------+------------+------------------+------+
|              name             | forward_id | backward_elasped | unit |
+-------------------------------+------------+------------------+------+
| torch.nn.functional.normalize |     3      |    5.68151474    |  ms  |
+-------------------------------+------------+------------------+------+
+-------------------------------+------------+-----------------+------------------+------+
|              name             | forward_id | forward_elasped | backward_elasped | unit |
+-------------------------------+------------+-----------------+------------------+------+
| torch.nn.functional.normalize |     1      |   41.59259796   |    6.45375252    |  ms  |
| torch.nn.functional.normalize |     2      |    2.06089020   |    5.55872917    |  ms  |
| torch.nn.functional.normalize |     3      |    2.00271606   |    5.68151474    |  ms  |
+-------------------------------+------------+-----------------+------------------+------+
op elasped info saved to op_tools_results/op_time_measure_result/op_elasped_info_pid3722879_2024-10-08-17-00-42.csv

只跑指定算子3遍前向

ditorch/op_tools# python run_op_from_data.py /op_capture_result/torch.Tensor.div/2278281/5  --run_times 3 --only_run_forward --sync_time_measure
...

模型训练时算子耗时分析 (前向 + 反向)

# usage1
import op_tools
with op_tools.OpTimeMeasure():
    code_snippet_to_time_measure()

# usage2
import op_tools
timemeasure = op_tools.OpTimeMeasure()
timemeasure.start()
code_snippet_to_time_measure()
timemeasure.end()

...
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/modules/embedding.py:99 _torch_apply_rotary_func: out1.copy_(x1 * cos - x2 * sin)
+----------------------------+-------------------+-----------------------+---------+---------+--------+---------------+---------------+----------------+
|            name            |       shape       |         stride        |  numel  |  dtype  | device | requires_grad |     layout    |    data_ptr    |
+----------------------------+-------------------+-----------------------+---------+---------+--------+---------------+---------------+----------------+
| torch.Tensor.mul inputs[0] | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0  |     False     | torch.strided | 20080180069376 |
| torch.Tensor.mul inputs[1] |   (16384, 1, 64)  |      (64, 64, 1)      | 1048576 | float32 | npu:0  |     False     | torch.strided | 20081048292864 |
|  torch.Tensor.mul outputs  | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0  |     False     | torch.strided | 20080247179264 |
+----------------------------+-------------------+-----------------------+---------+---------+--------+---------------+---------------+----------------+
+------------------+------------+-----------------+------+
|       name       | forward_id | forward_elasped | unit |
+------------------+------------+-----------------+------+
| torch.Tensor.mul |    6195    |    0.07629395   |  ms  |
+------------------+------------+-----------------+------+






/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/modules/embedding.py:99 _torch_apply_rotary_func: out1.copy_(x1 * cos - x2 * sin)
+----------------------------+-------------------+-----------------------+---------+---------+--------+---------------+---------------+----------------+
|            name            |       shape       |         stride        |  numel  |  dtype  | device | requires_grad |     layout    |    data_ptr    |
+----------------------------+-------------------+-----------------------+---------+---------+--------+---------------+---------------+----------------+
| torch.Tensor.mul inputs[0] | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0  |     False     | torch.strided | 20080213624320 |
| torch.Tensor.mul inputs[1] |   (16384, 1, 64)  |      (64, 64, 1)      | 1048576 | float32 | npu:0  |     False     | torch.strided | 20081052487680 |
|  torch.Tensor.mul outputs  | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0  |     False     | torch.strided | 20080280734208 |
+----------------------------+-------------------+-----------------------+---------+---------+--------+---------------+---------------+----------------+
+------------------+------------+-----------------+------+
|       name       | forward_id | forward_elasped | unit |
+------------------+------------+-----------------+------+
| torch.Tensor.mul |    6196    |    0.06532669   |  ms  |
+------------------+------------+-----------------+------+






/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/modules/embedding.py:99 _torch_apply_rotary_func: out1.copy_(x1 * cos - x2 * sin)
+----------------------------+-------------------+-----------------------+---------+---------+--------+---------------+---------------+----------------+
|            name            |       shape       |         stride        |  numel  |  dtype  | device | requires_grad |     layout    |    data_ptr    |
+----------------------------+-------------------+-----------------------+---------+---------+--------+---------------+---------------+----------------+
| torch.Tensor.sub inputs[0] | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0  |     False     | torch.strided | 20080247179264 |
| torch.Tensor.sub inputs[1] | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0  |     False     | torch.strided | 20080280734208 |
|  torch.Tensor.sub outputs  | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0  |     False     | torch.strided | 20080316383232 |
+----------------------------+-------------------+-----------------------+---------+---------+--------+---------------+---------------+----------------+
+------------------+------------+-----------------+------+
|       name       | forward_id | forward_elasped | unit |
+------------------+------------+-----------------+------+
| torch.Tensor.sub |    6197    |    0.06794930   |  ms  |
+------------------+------------+-----------------+------+
...
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/ops/scatter.py:14 broadcast: src = src.expand(other.size())
+-------------------------------+----------+--------+-------+-------+--------+---------------+---------------+----------------+---------------------+
|              name             |  shape   | stride | numel | dtype | device | requires_grad |     layout    |    data_ptr    |        value        |
+-------------------------------+----------+--------+-------+-------+--------+---------------+---------------+----------------+---------------------+
| torch.Tensor.expand inputs[0] | (16384,) |  (1,)  | 16384 | int64 | npu:0  |     False     | torch.strided | 20067181255168 |                     |
| torch.Tensor.expand inputs[1] |          |        |       |       |        |               |               |                | torch.Size([16384]) |
|  torch.Tensor.expand outputs  | (16384,) |  (1,)  | 16384 | int64 | npu:0  |     False     | torch.strided | 20067181255168 |                     |
+-------------------------------+----------+--------+-------+-------+--------+---------------+---------------+----------------+---------------------+
+---------------------+------------+-----------------+------+
|         name        | forward_id | forward_elasped | unit |
+---------------------+------------+-----------------+------+
| torch.Tensor.expand |    5496    |    0.03361702   |  ms  |
+---------------------+------------+-----------------+------+






/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/ops/scatter.py:35 vanilla_scatter: size[dim] = index.max().int() + 1
+--------------------------+----------+--------+-------+-------+--------+---------------+---------------+----------------+
|           name           |  shape   | stride | numel | dtype | device | requires_grad |     layout    |    data_ptr    |
+--------------------------+----------+--------+-------+-------+--------+---------------+---------------+----------------+
| torch.Tensor.max inputs  | (16384,) |  (1,)  | 16384 | int64 | npu:0  |     False     | torch.strided | 20067181255168 |
| torch.Tensor.max outputs |    ()    |   ()   |   1   | int64 | npu:0  |     False     | torch.strided | 20067179832832 |
+--------------------------+----------+--------+-------+-------+--------+---------------+---------------+----------------+
+------------------+------------+-----------------+------+
|       name       | forward_id | forward_elasped | unit |
+------------------+------------+-----------------+------+
| torch.Tensor.max |    5497    |    0.21767616   |  ms  |
+------------------+------------+-----------------+------+

...
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/solver/optimizer/utils.py:177 multi_tensor_l2norm_torch: l2_norm = torch.norm(norms_tensor, p=2).unsqueeze(0)
+----------------------------------+-------+--------+-------+---------+--------+---------------+---------------+----------------+-------+
|               name               | shape | stride | numel |  dtype  | device | requires_grad |     layout    |    data_ptr    | value |
+----------------------------------+-------+--------+-------+---------+--------+---------------+---------------+----------------+-------+
| torch.Tensor.unsqueeze inputs[0] |   ()  |   ()   |   1   | float32 | npu:0  |     False     | torch.strided | 20067181123072 |       |
| torch.Tensor.unsqueeze inputs[1] |       |        |       |         |        |               |               |                |   0   |
|  torch.Tensor.unsqueeze outputs  |  (1,) |  (1,)  |   1   | float32 | npu:0  |     False     | torch.strided | 20067181123072 |       |
+----------------------------------+-------+--------+-------+---------+--------+---------------+---------------+----------------+-------+
+------------------------+------------+-----------------+------+
|          name          | forward_id | forward_elasped | unit |
+------------------------+------------+-----------------+------+
| torch.Tensor.unsqueeze |   18299    |    0.01835823   |  ms  |
+------------------------+------------+-----------------+------+






/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/solver/optimizer/utils.py:224 get_norm: grad_norm = calc_l2_norm(grads) ** norm_type
+--------------------------------+-------+--------+-------+---------+--------+---------------+---------------+----------------+-------+
|              name              | shape | stride | numel |  dtype  | device | requires_grad |     layout    |    data_ptr    | value |
+--------------------------------+-------+--------+-------+---------+--------+---------------+---------------+----------------+-------+
| torch.Tensor.__pow__ inputs[0] |  (1,) |  (1,)  |   1   | float32 | npu:0  |     False     | torch.strided | 20067181123072 |       |
| torch.Tensor.__pow__ inputs[1] |       |        |       |         |        |               |               |                |  2.0  |
|  torch.Tensor.__pow__ outputs  |  (1,) |  (1,)  |   1   | float32 | npu:0  |     False     | torch.strided | 20067181917184 |       |
+--------------------------------+-------+--------+-------+---------+--------+---------------+---------------+----------------+-------+
+----------------------+------------+-----------------+------+
|         name         | forward_id | forward_elasped | unit |
+----------------------+------------+-----------------+------+
| torch.Tensor.__pow__ |   18300    |    0.04887581   |  ms  |
+----------------------+------------+-----------------+------+
...

+--------------------------------------+----------+--------+-------+---------+--------+---------------+---------------+----------------+
|                 name                 |  shape   | stride | numel |  dtype  | device | requires_grad |     layout    |    data_ptr    |
+--------------------------------------+----------+--------+-------+---------+--------+---------------+---------------+----------------+
| torch.Tensor.sum grad_outputs [0][0] |    ()    |   ()   |   1   | float32 | npu:0  |     False     | torch.strided | 20067179833856 |
| torch.Tensor.sum grad_inputs [0][0]  | (16384,) |  (0,)  | 16384 | float32 | npu:0  |     False     | torch.strided | 20067179833856 |
+--------------------------------------+----------+--------+-------+---------+--------+---------------+---------------+----------------+
+------------------+------------+------------------+------+
|       name       | forward_id | backward_elasped | unit |
+------------------+------------+------------------+------+
| torch.Tensor.sum |   25750    |    0.02026558    |  ms  |
+------------------+------------+------------------+------+

...
+-------------------------------------------------------+----------------+------------+------------+---------+--------+---------------+---------------+----------------+
|                          name                         |     shape      |   stride   |   numel    |  dtype  | device | requires_grad |     layout    |    data_ptr    |
+-------------------------------------------------------+----------------+------------+------------+---------+--------+---------------+---------------+----------------+
| torch.nn.functional.cross_entropy grad_outputs [0][0] |    (16384,)    |    (0,)    |   16384    | float32 | npu:0  |     False     | torch.strided | 20067179833856 |
|  torch.nn.functional.cross_entropy grad_inputs [0][0] | (16384, 92544) | (92544, 1) | 1516240896 | float32 | npu:0  |     False     | torch.strided | 20111184363520 |
+-------------------------------------------------------+----------------+------------+------------+---------+--------+---------------+---------------+----------------+
+-----------------------------------+------------+------------------+------+
|                name               | forward_id | backward_elasped | unit |
+-----------------------------------+------------+------------------+------+
| torch.nn.functional.cross_entropy |   25748    |    4.23026085    |  ms  |
+-----------------------------------+------------+------------------+------+
...
+--------------------------------------+------------------+---------------------+----------+----------+--------+---------------+---------------+----------------+
|                 name                 |      shape       |        stride       |  numel   |  dtype   | device | requires_grad |     layout    |    data_ptr    |
+--------------------------------------+------------------+---------------------+----------+----------+--------+---------------+---------------+----------------+
| torch.Tensor.mul grad_outputs [0][0] | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | bfloat16 | npu:0  |     False     | torch.strided | 20067414704128 |
| torch.Tensor.mul grad_inputs [0][0]  |     (2048,)      |         (1,)        |   2048   | bfloat16 | npu:0  |     False     | torch.strided | 20067181914112 |
| torch.Tensor.mul grad_inputs [0][1]  | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | bfloat16 | npu:0  |     False     | torch.strided | 20080586915840 |
+--------------------------------------+------------------+---------------------+----------+----------+--------+---------------+---------------+----------------+
+------------------+------------+------------------+------+
|       name       | forward_id | backward_elasped | unit |
+------------------+------------+------------------+------+
| torch.Tensor.mul |   26134    |    0.42295456    |  ms  |
+------------------+------------+------------------+------+

算子 fallback

# usage 1
with op_tools.OpFallback():
    code_snippet_op_to_be_fallbacked()

# usage 2
fallback = op_tools.OpFallback()
fallback.start()
code_snippet_op_to_be_fallbacked()
fallback.end()

只 fallback 指定算子 (export OP_FALLBACK_LIST="torch.nn.functional.linear")

skip OpFallbackHook on torch.Tensor.float
skip OpFallbackHook on torch.Tensor.add
skip OpFallbackHook on torch.Tensor.div
skip OpFallbackHook on torch.Tensor.item
skip OpFallbackHook on torch.Tensor.float
skip OpFallbackHook on torch.Tensor.div
skip OpFallbackHook on torch.Tensor.fill_
skip OpFallbackHook on torch.Tensor.is_complex
skip OpFallbackHook on torch.Tensor.numel
skip OpFallbackHook on torch.Tensor.unbind
skip OpFallbackHook on torch.Tensor.sub
skip OpFallbackHook on torch.Tensor.max
...
OpFallbackHook: torch.nn.functional.linear                         input: {'args': ({'shape': torch.Size([1, 16384, 2048]), 'stride': (33554432, 2048, 1), 'numel': 33554432, 'dtype': 'torch.bfloat16', 'device': 'npu:0', 'requires_grad': False, 'layout': 'torch.strided', 'data': 20076203868160}, {'shape': torch.Size([4096, 2048]), 'stride': (2048, 1), 'numel': 8388608, 'dtype': 'torch.bfloat16', 'device': 'npu:0', 'requires_grad': True, 'layout': 'torch.strided', 'data': 20077985398784}, 'None')}
OpFallbackHook: torch.nn.functional.linear                         output: ({'shape': torch.Size([1, 16384, 4096]), 'stride': (67108864, 4096, 1), 'numel': 67108864, 'dtype': 'torch.bfloat16', 'device': 'npu:0', 'requires_grad': False, 'layout': 'torch.strided', 'data': 20075820089344},) cpu output: ({'shape': torch.Size([1, 16384, 4096]), 'stride': (67108864, 4096, 1), 'numel': 67108864, 'dtype': 'torch.bfloat16', 'device': 'cpu', 'requires_grad': False, 'layout': 'torch.strided', 'data': 139743270527040},) dtype_convert_back_dict:{}
skip OpFallbackHook on torch.Tensor.shape.__get__
...

fallback 指定算子以外所有算子（export OP_FALLBACK_DISABLE_LIST="torch.nn.functional.linear"）

...
torch.Tensor.std    forward_id: 475
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/utils/verifiers.py:28 check_parallel_statistic_equality: named_std = params.to(dtype=torch.float64).std()
+---------------------------------+--------------+-----------+----------+---------+--------+---------------+---------------+-----------------+
|               name              |    shape     |   stride  |  numel   |  dtype  | device | requires_grad |     layout    |     data_ptr    |
+---------------------------------+--------------+-----------+----------+---------+--------+---------------+---------------+-----------------+
|  torch.Tensor.std input(device) | (8192, 2048) | (2048, 1) | 16777216 | float64 | npu:0  |      True     | torch.strided |  20079859204096 |
|   torch.Tensor.std input(cpu)   | (8192, 2048) | (2048, 1) | 16777216 | float64 |  cpu   |      True     | torch.strided | 139687015878720 |
| torch.Tensor.std output(device) |      ()      |     ()    |    1     | float64 | npu:0  |      True     | torch.strided |  20067180001792 |
|   torch.Tensor.std output(cpu)  |      ()      |     ()    |    1     | float64 |  cpu   |      True     | torch.strided |   33658540800   |
+---------------------------------+--------------+-----------+----------+---------+--------+---------------+---------------+-----------------+
...

fallback 所有算子时部分输出

...
torch.Tensor.to    forward_id: 474
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/utils/verifiers.py:28 check_parallel_statistic_equality: named_std = params.to(dtype=torch.float64).std()
+-------------------------------------+--------------+-----------+----------+----------+--------+---------------+---------------+-----------------+---------------+
|                 name                |    shape     |   stride  |  numel   |  dtype   | device | requires_grad |     layout    |     data_ptr    |     value     |
+-------------------------------------+--------------+-----------+----------+----------+--------+---------------+---------------+-----------------+---------------+
|    torch.Tensor.to input(device)    | (8192, 2048) | (2048, 1) | 16777216 | bfloat16 | npu:0  |      True     | torch.strided |  20067790094336 |               |
| torch.Tensor.to input(device) dtype |              |           |          |          |        |               |               |                 | torch.float64 |
|      torch.Tensor.to input(cpu)     | (8192, 2048) | (2048, 1) | 16777216 | bfloat16 |  cpu   |      True     | torch.strided | 139738740682816 |               |
|   torch.Tensor.to input(cpu) dtype  |              |           |          |          |        |               |               |                 | torch.float64 |
|    torch.Tensor.to output(device)   | (8192, 2048) | (2048, 1) | 16777216 | float64  | npu:0  |      True     | torch.strided |  20079859204096 |               |
|     torch.Tensor.to output(cpu)     | (8192, 2048) | (2048, 1) | 16777216 | float64  |  cpu   |      True     | torch.strided | 139687150100544 |               |
+-------------------------------------+--------------+-----------+----------+----------+--------+---------------+---------------+-----------------+---------------+

...
torch.Tensor.mean    forward_id: 477
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/utils/verifiers.py:27 check_parallel_statistic_equality: named_mean = params.to(dtype=torch.float64).mean()
+----------------------------------+---------+--------+-------+---------+--------+---------------+---------------+----------------+
|               name               |  shape  | stride | numel |  dtype  | device | requires_grad |     layout    |    data_ptr    |
+----------------------------------+---------+--------+-------+---------+--------+---------------+---------------+----------------+
| torch.Tensor.mean input(device)  | (2048,) |  (1,)  |  2048 | float64 | npu:0  |      True     | torch.strided | 20067181881344 |
|   torch.Tensor.mean input(cpu)   | (2048,) |  (1,)  |  2048 | float64 |  cpu   |      True     | torch.strided |  33649506688   |
| torch.Tensor.mean output(device) |    ()   |   ()   |   1   | float64 | npu:0  |      True     | torch.strided | 20067180002304 |
|  torch.Tensor.mean output(cpu)   |    ()   |   ()   |   1   | float64 |  cpu   |      True     | torch.strided |  33658542784   |
+----------------------------------+---------+--------+-------+---------+--------+---------------+---------------+----------------+






torch.Tensor.to    forward_id: 478
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/utils/verifiers.py:28 check_parallel_statistic_equality: named_std = params.to(dtype=torch.float64).std()
+-------------------------------------+---------+--------+-------+----------+--------+---------------+---------------+----------------+---------------+
|                 name                |  shape  | stride | numel |  dtype   | device | requires_grad |     layout    |    data_ptr    |     value     |
+-------------------------------------+---------+--------+-------+----------+--------+---------------+---------------+----------------+---------------+
|    torch.Tensor.to input(device)    | (2048,) |  (1,)  |  2048 | bfloat16 | npu:0  |      True     | torch.strided | 20067179883008 |               |
| torch.Tensor.to input(device) dtype |         |        |       |          |        |               |               |                | torch.float64 |
|      torch.Tensor.to input(cpu)     | (2048,) |  (1,)  |  2048 | bfloat16 |  cpu   |      True     | torch.strided |  33649527040   |               |
|   torch.Tensor.to input(cpu) dtype  |         |        |       |          |        |               |               |                | torch.float64 |
|    torch.Tensor.to output(device)   | (2048,) |  (1,)  |  2048 | float64  | npu:0  |      True     | torch.strided | 20067181898240 |               |
|     torch.Tensor.to output(cpu)     | (2048,) |  (1,)  |  2048 | float64  |  cpu   |      True     | torch.strided |  33649532480   |               |
+-------------------------------------+---------+--------+-------+----------+--------+---------------+---------------+----------------+---------------+
...
torch.exp    forward_id: 487
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/metrics/base.py:210 get_metric: perplexity = round(torch.exp(self.total_log_probs / self.total).item(), 4)
+--------------------------+-------+--------+-------+---------+--------+---------------+---------------+----------------+
|           name           | shape | stride | numel |  dtype  | device | requires_grad |     layout    |    data_ptr    |
+--------------------------+-------+--------+-------+---------+--------+---------------+---------------+----------------+
| torch.exp input(device)  |  (1,) |  (1,)  |   1   | float32 | npu:0  |     False     | torch.strided | 20067179862016 |
|   torch.exp input(cpu)   |  (1,) |  (1,)  |   1   | float32 |  cpu   |     False     | torch.strided |   399798976    |
| torch.exp output(device) |  (1,) |  (1,)  |   1   | float32 | npu:0  |     False     | torch.strided | 20067179862528 |
|  torch.exp output(cpu)   |  (1,) |  (1,)  |   1   | float32 |  cpu   |     False     | torch.strided |   399799360    |
+--------------------------+-------+--------+-------+---------+--------+---------------+---------------+----------------+






torch.Tensor.item    forward_id: 488
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/metrics/base.py:210 get_metric: perplexity = round(torch.exp(self.total_log_probs / self.total).item(), 4)
+----------------------------------+-------+--------+-------+---------+--------+---------------+---------------+----------------+--------------+
|               name               | shape | stride | numel |  dtype  | device | requires_grad |     layout    |    data_ptr    |    value     |
+----------------------------------+-------+--------+-------+---------+--------+---------------+---------------+----------------+--------------+
| torch.Tensor.item input(device)  |  (1,) |  (1,)  |   1   | float32 | npu:0  |     False     | torch.strided | 20067179862528 |              |
|   torch.Tensor.item input(cpu)   |  (1,) |  (1,)  |   1   | float32 |  cpu   |     False     | torch.strided |  33581209792   |              |
| torch.Tensor.item output(device) |       |        |       |         |        |               |               |                | 147525.78125 |
|  torch.Tensor.item output(cpu)   |       |        |       |         |        |               |               |                | 147525.78125 |
+----------------------------------+-------+--------+-------+---------+--------+---------------+---------------+----------------+--------------+






torch.Tensor.float    forward_id: 489
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/metrics/base.py:218 get_metric: (self.ds_right[i].float() / (self.ds_tokens[i].float() + 1e-5)).item(), 4
+-----------------------------------+-------+--------+-------+---------+--------+---------------+---------------+----------------+
|                name               | shape | stride | numel |  dtype  | device | requires_grad |     layout    |    data_ptr    |
+-----------------------------------+-------+--------+-------+---------+--------+---------------+---------------+----------------+
|  torch.Tensor.float input(device) |   ()  |   ()   |   1   |  int64  | npu:0  |     False     | torch.strided | 20067180395520 |
|   torch.Tensor.float input(cpu)   |   ()  |   ()   |   1   |  int64  |  cpu   |     False     | torch.strided |  33652536256   |
| torch.Tensor.float output(device) |   ()  |   ()   |   1   | float32 | npu:0  |     False     | torch.strided | 20067179863040 |
|   torch.Tensor.float output(cpu)  |   ()  |   ()   |   1   | float32 |  cpu   |     False     | torch.strided |  33652536640   |
+-----------------------------------+-------+--------+-------+---------+--------+---------------+---------------+----------------+

算子数据类型转换工具

# usage1
export OP_DTYPE_CAST_DICT="torch.float16->torch.float32,torch.bfloat16->torch.float32"
with op_tools.OpDtypeCast():
    f()

# usage2
dtype_caster = op_tools.OpDtypeCast()
dtype_caster.start()
for i in range(3):
    f()
dtype_caster.stop()

# usage3
os.environ["OP_DTYPE_CAST_DISABLE_LIST"] = "torch.Tensor.add,torch.Tensor.sub"
dtype_caster.start()
f()
dtype_caster.stop()

# usage4
os.environ["OP_DTYPE_CAST_DISABLE_LIST"] = ""
os.environ["OP_DTYPE_CAST_LIST"] = "torch.Tensor.sort,torch.Tensor.add"  # only cast these op
os.environ["OP_DTYPE_CAST_DICT"] = "torch.half->torch.bfloat16"
dtype_caster.start()
f()
dtype_caster.stop()

torch.nn.functional.linear    forward_id: 490
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/utils.py:287 forward: output = F.linear(total_x, weight, bias)  # pylint: disable=E1102
+---------------------------------------------+------------------+----------------------+-----------+---------+--------+---------------+---------------+-----------------+-------+
|                     name                    |      shape       |        stride        |   numel   |  dtype  | device | requires_grad |     layout    |     data_ptr    | value |
+---------------------------------------------+------------------+----------------------+-----------+---------+--------+---------------+---------------+-----------------+-------+
| torch.nn.functional.linear input(device)[0] | (1, 16384, 2048) | (33554432, 2048, 1)  |  33554432 | float32 | npu:0  |     False     | torch.strided |  20082537267200 |       |
| torch.nn.functional.linear input(device)[1] |   (8192, 2048)   |      (2048, 1)       |  16777216 | float32 | npu:0  |     False     | torch.strided |  20082673582080 |       |
| torch.nn.functional.linear input(device)[2] |                  |                      |           |         |        |               |               |                 |  None |
|   torch.nn.functional.linear input(cpu)[0]  | (1, 16384, 2048) | (33554432, 2048, 1)  |  33554432 | float32 |  cpu   |     False     | torch.strided | 139842851389504 |       |
|   torch.nn.functional.linear input(cpu)[1]  |   (8192, 2048)   |      (2048, 1)       |  16777216 | float32 |  cpu   |     False     | torch.strided | 139842720587840 |       |
|   torch.nn.functional.linear input(cpu)[2]  |                  |                      |           |         |        |               |               |                 |  None |
|  torch.nn.functional.linear output(device)  | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | float32 | npu:0  |     False     | torch.strided |  20083267076096 |       |
|    torch.nn.functional.linear output(cpu)   | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | float32 |  cpu   |     False     | torch.strided | 139833330622528 |       |
+---------------------------------------------+------------------+----------------------+-----------+---------+--------+---------------+---------------+-----------------+-------+



+----------------------------+-----------+---------------------------------+------------------------------------------------------------+
|            name            |   target  |              action             |                           config                           |
+----------------------------+-----------+---------------------------------+------------------------------------------------------------+
| torch.nn.functional.linear |  input[0] | torch.bfloat16 -> torch.float32 | torch.float16->torch.float32,torch.bfloat16->torch.float32 |
| torch.nn.functional.linear |  input[1] | torch.bfloat16 -> torch.float32 | torch.float16->torch.float32,torch.bfloat16->torch.float32 |
| torch.nn.functional.linear | output[0] | torch.float32 -> torch.bfloat16 | torch.float16->torch.float32,torch.bfloat16->torch.float32 |
+----------------------------+-----------+---------------------------------+------------------------------------------------------------+
apply OpDtypeCastHook on torch.nn.functional.silu



torch.nn.functional.silu    forward_id: 492
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/utils.py:667 Silu: return F.silu(w1_o) * w2_o
+------------------------------------------------+------------------+----------------------+-----------+---------+--------+---------------+---------------+-----------------+-------+
|                      name                      |      shape       |        stride        |   numel   |  dtype  | device | requires_grad |     layout    |     data_ptr    | value |
+------------------------------------------------+------------------+----------------------+-----------+---------+--------+---------------+---------------+-----------------+-------+
|     torch.nn.functional.silu input(device)     | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | float32 | npu:0  |     False     | torch.strided |  20084340817920 |       |
| torch.nn.functional.silu input(device) inplace |                  |                      |           |         |        |               |               |                 | False |
|      torch.nn.functional.silu input(cpu)       | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | float32 |  cpu   |     False     | torch.strided | 139832793747520 |       |
|  torch.nn.functional.silu input(cpu) inplace   |                  |                      |           |         |        |               |               |                 | False |
|    torch.nn.functional.silu output(device)     | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | float32 | npu:0  |     False     | torch.strided |  20085414559744 |       |
|      torch.nn.functional.silu output(cpu)      | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | float32 |  cpu   |     False     | torch.strided | 139832256872512 |       |
+------------------------------------------------+------------------+----------------------+-----------+---------+--------+---------------+---------------+-----------------+-------+



+--------------------------+-----------+---------------------------------+------------------------------------------------------------+
|           name           |   target  |              action             |                           config                           |
+--------------------------+-----------+---------------------------------+------------------------------------------------------------+
| torch.nn.functional.silu |  input[0] | torch.bfloat16 -> torch.float32 | torch.float16->torch.float32,torch.bfloat16->torch.float32 |
| torch.nn.functional.silu | output[0] | torch.float32 -> torch.bfloat16 | torch.float16->torch.float32,torch.bfloat16->torch.float32 |
+--------------------------+-----------+---------------------------------+------------------------------------------------------------+

溢出检测工具

with op_tools.OpOverflowCheck():
        x = torch.randn(3, 4, 5, dtype=torch.float32, device="cuda", requires_grad=True)
        y = torch.zeros_like(x)
        z = x / y
        z.backward(torch.ones_like(z))
        x = torch.full((3, 4, 5,), dtype=torch.float32, device="cuda", fill_value=3.402823466e38)
        y = x + x
        z = x * x

outputs:

torch.randn    forward_id: 1
/deeplink_afs/zhaoguochun/ditorch/op_tools/test/test_tool_with_special_op.py:114 test_overflow4: x = torch.randn(3, 4, 5, dtype=torch.float32, device="cuda", requires_grad=True)
+----------------------------------+---------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
|               name               |     value     | device |  dtype  | numel |   shape   |   stride   | requires_grad |     layout    |    data_ptr    |
+----------------------------------+---------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
|      torch.randn inputs[0]       |       3       |        |         |       |           |            |               |               |                |
|      torch.randn inputs[1]       |       4       |        |         |       |           |            |               |               |                |
|      torch.randn inputs[2]       |       5       |        |         |       |           |            |               |               |                |
|     torch.randn inputs dtype     | torch.float32 |        |         |       |           |            |               |               |                |
|    torch.randn inputs device     |      npu      |        |         |       |           |            |               |               |                |
| torch.randn inputs requires_grad |      True     |        |         |       |           |            |               |               |                |
|       torch.randn outputs        |               | npu:0  | float32 |   60  | (3, 4, 5) | (20, 5, 1) |      True     | torch.strided | 20067179823104 |
+----------------------------------+---------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
+-----------------------+------------+--------------------+--------------------+---------------------+--------------------+-------------------+
|          name         | inf_or_nan |        min         |        max         |         mean        |        std         |        norm       |
+-----------------------+------------+--------------------+--------------------+---------------------+--------------------+-------------------+
|  torch.randn input[0] |   False    |                    |                    |                     |                    |                   |
|  torch.randn input[1] |   False    |                    |                    |                     |                    |                   |
|  torch.randn input[2] |   False    |                    |                    |                     |                    |                   |
| torch.randn output[0] |   False    | -2.577465772628784 | 1.8474770784378052 | -0.0570245198905468 | 1.0004568099975586 | 7.697339057922363 |
+-----------------------+------------+--------------------+--------------------+---------------------+--------------------+-------------------+



GarbageCollectEvaluate:  host_memory_usage: 1735 MB, device_memory_usage: 0 MB, device_memory_reserved: 2 MB
GarbageCollectEvaluate: after collect : rss: 1735 MB, current_rss: 1735 MB, max_diff: 1024 MB, device_memory_usage: 0 MB, current_device_memory_usage: 0 MB
apply OpOverflowCheckHook on torch.zeros_like



torch.zeros_like    forward_id: 2
/deeplink_afs/zhaoguochun/ditorch/op_tools/test/test_tool_with_special_op.py:115 test_overflow4: y = torch.zeros_like(x)
+--------------------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
|           name           | device |  dtype  | numel |   shape   |   stride   | requires_grad |     layout    |    data_ptr    |
+--------------------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
| torch.zeros_like inputs  | npu:0  | float32 |   60  | (3, 4, 5) | (20, 5, 1) |      True     | torch.strided | 20067179823104 |
| torch.zeros_like outputs | npu:0  | float32 |   60  | (3, 4, 5) | (20, 5, 1) |     False     | torch.strided | 20067179823616 |
+--------------------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
+----------------------------+------------+--------------------+--------------------+---------------------+--------------------+-------------------+
|            name            | inf_or_nan |        min         |        max         |         mean        |        std         |        norm       |
+----------------------------+------------+--------------------+--------------------+---------------------+--------------------+-------------------+
| torch.zeros_like input[0]  |   False    | -2.577465772628784 | 1.8474770784378052 | -0.0570245198905468 | 1.0004568099975586 | 7.697339057922363 |
| torch.zeros_like output[0] |   False    |        0.0         |        0.0         |         0.0         |        0.0         |        0.0        |
+----------------------------+------------+--------------------+--------------------+---------------------+--------------------+-------------------+



GarbageCollectEvaluate:  host_memory_usage: 1736 MB, device_memory_usage: 0 MB, device_memory_reserved: 2 MB
apply OpOverflowCheckHook on torch.Tensor.div



torch.Tensor.div    forward_id: 3
/deeplink_afs/zhaoguochun/ditorch/op_tools/test/test_tool_with_special_op.py:116 test_overflow4: z = x / y
+----------------------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
|            name            | device |  dtype  | numel |   shape   |   stride   | requires_grad |     layout    |    data_ptr    |
+----------------------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
| torch.Tensor.div inputs[0] | npu:0  | float32 |   60  | (3, 4, 5) | (20, 5, 1) |      True     | torch.strided | 20067179823104 |
| torch.Tensor.div inputs[1] | npu:0  | float32 |   60  | (3, 4, 5) | (20, 5, 1) |     False     | torch.strided | 20067179823616 |
|  torch.Tensor.div outputs  | npu:0  | float32 |   60  | (3, 4, 5) | (20, 5, 1) |      True     | torch.strided | 20067179824128 |
+----------------------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
+----------------------------+------------+--------------------+--------------------+---------------------+--------------------+-------------------+
|            name            | inf_or_nan |        min         |        max         |         mean        |        std         |        norm       |
+----------------------------+------------+--------------------+--------------------+---------------------+--------------------+-------------------+
| torch.Tensor.div input[0]  |   False    | -2.577465772628784 | 1.8474770784378052 | -0.0570245198905468 | 1.0004568099975586 | 7.697339057922363 |
| torch.Tensor.div input[1]  |   False    |        0.0         |        0.0         |         0.0         |        0.0         |        0.0        |
| torch.Tensor.div output[0] |    True    |        -inf        |        inf         |         nan         |        nan         |        inf        |
+----------------------------+------------+--------------------+--------------------+---------------------+--------------------+-------------------+



GarbageCollectEvaluate:  host_memory_usage: 1737 MB, device_memory_usage: 0 MB, device_memory_reserved: 2 MB
apply OpOverflowCheckHook on torch.ones_like



torch.ones_like    forward_id: 4
/deeplink_afs/zhaoguochun/ditorch/op_tools/test/test_tool_with_special_op.py:117 test_overflow4: z.backward(torch.ones_like(z))
+-------------------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
|           name          | device |  dtype  | numel |   shape   |   stride   | requires_grad |     layout    |    data_ptr    |
+-------------------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
|  torch.ones_like inputs | npu:0  | float32 |   60  | (3, 4, 5) | (20, 5, 1) |      True     | torch.strided | 20067179824128 |
| torch.ones_like outputs | npu:0  | float32 |   60  | (3, 4, 5) | (20, 5, 1) |     False     | torch.strided | 20067179824640 |
+-------------------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
+---------------------------+------------+------+-----+------+-----+-------------------+
|            name           | inf_or_nan | min  | max | mean | std |        norm       |
+---------------------------+------------+------+-----+------+-----+-------------------+
|  torch.ones_like input[0] |    True    | -inf | inf | nan  | nan |        inf        |
| torch.ones_like output[0] |   False    | 1.0  | 1.0 | 1.0  | 0.0 | 7.745966911315918 |
+---------------------------+------------+------+-----+------+-----+-------------------+



GarbageCollectEvaluate:  host_memory_usage: 1737 MB, device_memory_usage: 0 MB, device_memory_reserved: 2 MB



torch.Tensor.div     forward_id: 3
/deeplink_afs/zhaoguochun/ditorch/op_tools/test/test_tool_with_special_op.py:116 test_overflow4: z = x / y
+---------------------------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+-------+
|               name              | device |  dtype  | numel |   shape   |   stride   | requires_grad |     layout    |    data_ptr    | value |
+---------------------------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+-------+
|   torch.Tensor.div grad_output  | npu:0  | float32 |   60  | (3, 4, 5) | (20, 5, 1) |     False     | torch.strided | 20067179824640 |       |
| torch.Tensor.div grad_inputs[0] | npu:0  | float32 |   60  | (3, 4, 5) | (20, 5, 1) |     False     | torch.strided | 20067179825152 |       |
| torch.Tensor.div grad_inputs[1] |        |         |       |           |            |               |               |                |  None |
+---------------------------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+-------+
+----------------------------------+------------+-----+-----+------+-----+-------------------+
|               name               | inf_or_nan | min | max | mean | std |        norm       |
+----------------------------------+------------+-----+-----+------+-----+-------------------+
| torch.Tensor.div grad_inputs[0]  |    True    | inf | inf | inf  | nan |        inf        |
| torch.Tensor.div grad_outputs[0] |   False    | 1.0 | 1.0 | 1.0  | 0.0 | 7.745966911315918 |
+----------------------------------+------------+-----+-----+------+-----+-------------------+
GarbageCollectEvaluate:  host_memory_usage: 1738 MB, device_memory_usage: 0 MB, device_memory_reserved: 2 MB
skip OpOverflowCheckHook on torch.Tensor.backward
apply OpOverflowCheckHook on torch.full



torch.full    forward_id: 5
/deeplink_afs/zhaoguochun/ditorch/op_tools/test/test_tool_with_special_op.py:118 test_overflow4: x = torch.full((3, 4, 5,), dtype=torch.float32, device="cuda", fill_value=3.402823466e38)
+------------------------------+-----------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
|             name             |      value      | device |  dtype  | numel |   shape   |   stride   | requires_grad |     layout    |    data_ptr    |
+------------------------------+-----------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
|   torch.full inputs [0][0]   |        3        |        |         |       |           |            |               |               |                |
|   torch.full inputs [0][1]   |        4        |        |         |       |           |            |               |               |                |
|   torch.full inputs [0][2]   |        5        |        |         |       |           |            |               |               |                |
|   torch.full inputs dtype    |  torch.float32  |        |         |       |           |            |               |               |                |
|   torch.full inputs device   |       npu       |        |         |       |           |            |               |               |                |
| torch.full inputs fill_value | 3.402823466e+38 |        |         |       |           |            |               |               |                |
|      torch.full outputs      |                 | npu:0  | float32 |   60  | (3, 4, 5) | (20, 5, 1) |     False     | torch.strided | 20067179825664 |
+------------------------------+-----------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
+----------------------+------------+------------------------+------------------------+------+-----+------+
|         name         | inf_or_nan |          min           |          max           | mean | std | norm |
+----------------------+------------+------------------------+------------------------+------+-----+------+
| torch.full input[0]  |   False    |                        |                        |      |     |      |
| torch.full input[1]  |   False    |                        |                        |      |     |      |
| torch.full input[2]  |   False    |                        |                        |      |     |      |
| torch.full output[0] |   False    | 3.4028234663852886e+38 | 3.4028234663852886e+38 | inf  | inf | inf  |
+----------------------+------------+------------------------+------------------------+------+-----+------+



GarbageCollectEvaluate:  host_memory_usage: 1739 MB, device_memory_usage: 0 MB, device_memory_reserved: 2 MB
apply OpOverflowCheckHook on torch.Tensor.add



torch.Tensor.add    forward_id: 6
/deeplink_afs/zhaoguochun/ditorch/op_tools/test/test_tool_with_special_op.py:119 test_overflow4: y = x + x
+----------------------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
|            name            | device |  dtype  | numel |   shape   |   stride   | requires_grad |     layout    |    data_ptr    |
+----------------------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
| torch.Tensor.add inputs[0] | npu:0  | float32 |   60  | (3, 4, 5) | (20, 5, 1) |     False     | torch.strided | 20067179825664 |
| torch.Tensor.add inputs[1] | npu:0  | float32 |   60  | (3, 4, 5) | (20, 5, 1) |     False     | torch.strided | 20067179825664 |
|  torch.Tensor.add outputs  | npu:0  | float32 |   60  | (3, 4, 5) | (20, 5, 1) |     False     | torch.strided | 20067179826176 |
+----------------------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
+----------------------------+------------+------------------------+------------------------+------+-----+------+
|            name            | inf_or_nan |          min           |          max           | mean | std | norm |
+----------------------------+------------+------------------------+------------------------+------+-----+------+
| torch.Tensor.add input[0]  |   False    | 3.4028234663852886e+38 | 3.4028234663852886e+38 | inf  | inf | inf  |
| torch.Tensor.add input[1]  |   False    | 3.4028234663852886e+38 | 3.4028234663852886e+38 | inf  | inf | inf  |
| torch.Tensor.add output[0] |    True    |          inf           |          inf           | inf  | nan | inf  |
+----------------------------+------------+------------------------+------------------------+------+-----+------+



GarbageCollectEvaluate:  host_memory_usage: 1739 MB, device_memory_usage: 0 MB, device_memory_reserved: 2 MB
apply OpOverflowCheckHook on torch.Tensor.mul



torch.Tensor.mul    forward_id: 7
/deeplink_afs/zhaoguochun/ditorch/op_tools/test/test_tool_with_special_op.py:120 test_overflow4: z = x * x
+----------------------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
|            name            | device |  dtype  | numel |   shape   |   stride   | requires_grad |     layout    |    data_ptr    |
+----------------------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
| torch.Tensor.mul inputs[0] | npu:0  | float32 |   60  | (3, 4, 5) | (20, 5, 1) |     False     | torch.strided | 20067179825664 |
| torch.Tensor.mul inputs[1] | npu:0  | float32 |   60  | (3, 4, 5) | (20, 5, 1) |     False     | torch.strided | 20067179825664 |
|  torch.Tensor.mul outputs  | npu:0  | float32 |   60  | (3, 4, 5) | (20, 5, 1) |     False     | torch.strided | 20067179826688 |
+----------------------------+--------+---------+-------+-----------+------------+---------------+---------------+----------------+
+----------------------------+------------+------------------------+------------------------+------+-----+------+
|            name            | inf_or_nan |          min           |          max           | mean | std | norm |
+----------------------------+------------+------------------------+------------------------+------+-----+------+
| torch.Tensor.mul input[0]  |   False    | 3.4028234663852886e+38 | 3.4028234663852886e+38 | inf  | inf | inf  |
| torch.Tensor.mul input[1]  |   False    | 3.4028234663852886e+38 | 3.4028234663852886e+38 | inf  | inf | inf  |
| torch.Tensor.mul output[0] |    True    |          inf           |          inf           | inf  | nan | inf  |
+----------------------------+------------+------------------------+------------------------+------+-----+------+



GarbageCollectEvaluate:  host_memory_usage: 1740 MB, device_memory_usage: 0 MB, device_memory_reserved: 2 MB

自定义算子工具生效的条件

def apply_feature(ops, feature, condition_func=lambda *args, **kwargs: True):
    ...

op_tools.apply_feature接口可以作用在torch接口和其他第三方接口上，通过condition_func参数可以自定义生效条件，当condition_func返回True时，工具生效，否则不生效。condition_func的输入形参和算子输入形参相同。 feature参数为功能特性，目前支持以下类型：

fallback: 算子fallback
cast_dtype: 算子数据类型转换
op_capture: 算子参数抓取
autocompare: 算子精度对比 (做精度对比时，需要设备实现和cpu实现的调用接口一致)
dump_op_args: 算子参数打印
measure_op_time: 算子执行时间测量

import torch
import ditorch
import op_tools
import os

def custom_condition(a, b):
    if a.dtype == torch.float16:
        print("hook enable because a.dtype is float16")
        return True
    elif a.dim() == 2:
        print("hook enable because a.dim() is 2")
        return True
    else:
        print("hook disable")
        return False

x = torch.randn(2, 3, 4, dtype=torch.float16).cuda()
y = torch.randn(4, 2, dtype=torch.float).cuda()
z = torch.randn(2, 3, 4, dtype=torch.float).cuda()

1.按需fallback

op_tools.apply_feature("torch.add", feature="fallback", condition_func=custom_condition)
torch.add(x, x)

outputs:

hook enable because a.dtype is float16
apply OpFallbackHook on torch.add



torch.add    forward_id: 1
<stdin>:1 <module>:
+----------------------------+-----------+------------+-------+---------+--------+---------------+---------------+----------------+
|            name            |   shape   |   stride   | numel |  dtype  | device | requires_grad |     layout    |    data_ptr    |
+----------------------------+-----------+------------+-------+---------+--------+---------------+---------------+----------------+
| torch.add input(device)[0] | (2, 3, 4) | (12, 4, 1) |   24  | float16 | npu:0  |     False     | torch.strided | 20067179823104 |
| torch.add input(device)[1] | (2, 3, 4) | (12, 4, 1) |   24  | float16 | npu:0  |     False     | torch.strided | 20067179823104 |
|  torch.add input(cpu)[0]   | (2, 3, 4) | (12, 4, 1) |   24  | float16 |  cpu   |     False     | torch.strided |   541245376    |
|  torch.add input(cpu)[1]   | (2, 3, 4) | (12, 4, 1) |   24  | float16 |  cpu   |     False     | torch.strided |   541245504    |
|  torch.add output(device)  | (2, 3, 4) | (12, 4, 1) |   24  | float16 | npu:0  |     False     | torch.strided | 20067179824640 |
|   torch.add output(cpu)    | (2, 3, 4) | (12, 4, 1) |   24  | float16 |  cpu   |     False     | torch.strided |   541249984    |
+----------------------------+-----------+------------+-------+---------+--------+---------------+---------------+----------------+



Warning: Device do not support double dtype now, dtype cast repalce with float.
tensor([[[ 0.8032,  1.8779, -0.6846,  1.5342],
         [-3.9688,  4.1055,  0.8447,  0.6836],
         [ 1.1914, -2.4746,  4.8086, -0.3574]],

        [[-2.1758,  2.1816,  1.1768, -1.0342],
         [-0.0070,  1.8252,  1.7373,  3.2109],
         [ 1.0361, -1.8564,  4.2070,  0.6558]]], device='npu:0',
       dtype=torch.float16)

2.按需autocompare

op_tools.apply_feature("torch.sub", feature="autocompare", condition_func=custom_condition)
torch.sub(y, y)
torch.sub(z, z)

output:

hook enable because a.dim() is 2
apply OpAutoCompareHook on torch.sub



torch.sub forward_id: 2
<stdin>:1 <module>:
+--------------------------+--------+--------+-------+---------+--------+---------------+---------------+----------------+
|           name           | shape  | stride | numel |  dtype  | device | requires_grad |     layout    |    data_ptr    |
+--------------------------+--------+--------+-------+---------+--------+---------------+---------------+----------------+
|   torch.sub inputs[0]    | (4, 2) | (2, 1) |   8   | float32 | npu:0  |     False     | torch.strided | 20067179823616 |
|   torch.sub inputs[1]    | (4, 2) | (2, 1) |   8   | float32 | npu:0  |     False     | torch.strided | 20067179823616 |
|    torch.sub outputs     | (4, 2) | (2, 1) |   8   | float32 | npu:0  |     False     | torch.strided | 20067179825152 |
| torch.sub inputs(cpu)[0] | (4, 2) | (2, 1) |   8   | float32 |  cpu   |     False     | torch.strided |   555367360    |
| torch.sub inputs(cpu)[1] | (4, 2) | (2, 1) |   8   | float32 |  cpu   |     False     | torch.strided |   554988416    |
|  torch.sub outputs(cpu)  | (4, 2) | (2, 1) |   8   | float32 |  cpu   |     False     | torch.strided |   555613184    |
+--------------------------+--------+--------+-------+---------+--------+---------------+---------------+----------------+
+--------------------------------+----------+--------------+-------------------+-------------+-------------+------------+
|              name              | allclose | max_abs_diff | max_relative_diff |     atol    |     rtol    | error_info |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+------------+
| torch.sub input[0]             |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 |            |
| torch.sub input[1]             |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 |            |
| torch.sub output               |   True   | 0.000000000  |    0.000000000    | 0.000010000 | 0.000010000 |            |
+--------------------------------+----------+--------------+-------------------+-------------+-------------+------------+


hook disable
skip OpAutoCompareHook on torch.sub

3.抓取特定输入情况下的算子输入输出

op_tools.apply_feature("torch.mul", feature="op_capture", condition_func=custom_condition)
torch.mul(x, x)

output:

hook enable because a.dtype is float16
apply OpCaptureHook on torch.mul
op_capture_result/torch.mul/366650/3/input.pth saved
op_capture_result/torch.mul/366650/3/output.pth saved

4.将特定算子的输入数据类型转成特定数据类型计算

op_tools.apply_feature("torch.div", feature="cast_dtype", condition_func=custom_condition)
os.environ["OP_DTYPE_CAST_DICT"] = "torch.float32->torch.float16"
torch.div(y, y)

output:

hook enable because a.dim() is 2
+-----------+-----------+--------------------------------+------------------------------+
|    name   |   target  |             action             |            config            |
+-----------+-----------+--------------------------------+------------------------------+
| torch.div |  input[0] | torch.float32 -> torch.float16 | torch.float32->torch.float16 |
| torch.div |  input[1] | torch.float32 -> torch.float16 | torch.float32->torch.float16 |
| torch.div | output[0] | torch.float16 -> torch.float32 | torch.float32->torch.float16 |

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
.github		.github
ci		ci
csrc		csrc
ditorch		ditorch
op_tools		op_tools
.clang-format		.clang-format
.flake8		.flake8
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CODE_OF_CONDUCT_cn.md		CODE_OF_CONDUCT_cn.md
Contributors.md		Contributors.md
LICENSE		LICENSE
README.md		README.md
ditorch.png		ditorch.png
ditorch_config.sh		ditorch_config.sh

工具	环境变量名	值	说明	备注
算子参数抓取工具	OP_CAPTURE_DISABLE_LIST	torch.add,torch.nn.functional.linear,torch.Tensor.relu_	不抓取这些算子的参数	算子名全称，多个算子时以逗号隔开
算子参数抓取工具	OP_CAPTURE_LIST	同上	只抓取这些算子的参数	同上
精度分析工具	OP_AUTOCOMPARE_LIST	同上	只对指定的算子做精度对比	同上
精度分析工具	OP_AUTOCOMPARE_DISABLE_LIST	同上	精度对比时忽略指定的这些算子	同上
算子数据类型转换工具	OP_DTYPE_CAST_DISABLE_LIST	同上	做类型转换时忽略指定的这些算子	同上
算子数据类型转换工具	OP_DTYPE_CAST_LIST	同上	只对指定的算子做类型转换	同上
精度分析工具	AUTOCOMPARE_ERROR_TOLERANCE	atol,rtol	allclose 参数	如设置，则使用给定的误差阈阈值覆盖默认值
精度分析工具	AUTOCOMPARE_ERROR_TOLERANCE_FLOAT16	atol,rtol	allclose 参数	如设置且数据类型满足，则使用给定的误差阈值
精度分析工具	AUTOCOMPARE_ERROR_TOLERANCE_BFLOAT16	atol,rtol	allclose 参数	同上
精度分析工具	AUTOCOMPARE_ERROR_TOLERANCE_FLOAT32	atol,rtol	allclose 参数	同上
精度分析工具	AUTOCOMPARE_ERROR_TOLERANCE_FLOAT64	atol,rtol	allclose 参数	同上
精度分析工具	LINEAR_AUTOCOMPARE_ERROR_TOLERANCE_FLOAT16	atol,rtol	allclose 参数	如设置且算子名和数据类型满足，则使用给定的误差阈值。算子名取算子全称最后一个'.'右边的部分，如torch.add,则算子名为ADD_,torch.nn.functional.linear的算子名为LINEAR_
算子数据类型转换工具	OP_DTYPE_CAST_DICT	torch.float16->torch.float32,torch.bfloat16->torch.float32	给定要转换的数据类型和目标数据类型	有多组时以逗号隔开

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ditorch

核心功能

1. 可无感切换 pytorch 至国产芯片

2. 提供多个基础工具，解决训练过程的问题

算子参数抓取工具

抓取前向和反向的所有输入输出

只抓取sort算子的参数，忽略其他算子 OP_CAPTURE_LIST=torch.Tensor.sort

排除指定算子，抓取所有其他算子 OP_CAPTURE_DISABLE_LIST="torch.Tensor.add,torch.Tensor.sub"

精度分析工具

基于InternEvo + ditorch + torch_npu 在华为910B上实时精度分析输出片段

离线算子精度测试

速度分析工具

只跑指定算子3遍前向

模型训练时算子耗时分析 (前向 + 反向)

算子 fallback

只 fallback 指定算子 (export OP_FALLBACK_LIST="torch.nn.functional.linear")

fallback 指定算子以外所有算子（export OP_FALLBACK_DISABLE_LIST="torch.nn.functional.linear"）

fallback 所有算子时部分输出

算子数据类型转换工具

溢出检测工具

自定义算子工具生效的条件

1.按需fallback

2.按需autocompare

3.抓取特定输入情况下的算子输入输出

4.将特定算子的输入数据类型转成特定数据类型计算

相关环境变量

About

Releases 2

Packages

Contributors 8

Languages

License

DeepLink-org/ditorch

Folders and files

Latest commit

History

Repository files navigation

ditorch

核心功能

1. 可无感切换 pytorch 至国产芯片

2. 提供多个基础工具，解决训练过程的问题

算子参数抓取工具

抓取前向和反向的所有输入输出

只抓取sort算子的参数，忽略其他算子 OP_CAPTURE_LIST=torch.Tensor.sort

排除指定算子，抓取所有其他算子 OP_CAPTURE_DISABLE_LIST="torch.Tensor.add,torch.Tensor.sub"

精度分析工具

基于InternEvo + ditorch + torch_npu 在华为910B上实时精度分析输出片段

离线算子精度测试

速度分析工具

只跑指定算子3遍前向

模型训练时算子耗时分析 (前向 + 反向)

算子 fallback

只 fallback 指定算子 (export OP_FALLBACK_LIST="torch.nn.functional.linear")

fallback 指定算子以外所有算子（export OP_FALLBACK_DISABLE_LIST="torch.nn.functional.linear"）

fallback 所有算子时部分输出

算子数据类型转换工具

溢出检测工具

自定义算子工具生效的条件

1.按需fallback

2.按需autocompare

3.抓取特定输入情况下的算子输入输出

4.将特定算子的输入数据类型转成特定数据类型计算

相关环境变量

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 8

Languages

Packages