ditorch 是设备无关 torch, 旨在屏蔽各硬件厂商 torch 差异,为用户提供一致使用体验。通过 ditorch,开发者可以适配多个硬件算子库;此外,ditorch 提供训练过程中需要的基础工具,解决模型训练过程中出现的痛点问题。
只需添加两行代码,即可在国产芯片上像官方 pytorch 一样使用。
>>> import torch
>>> import ditorch
ditorch.framework: torch_npu:2.1.0.post3 pid: 1729023
>>> x = torch.randn(3,4,device="cuda")
>>> y = x + x
>>> x
Warning: Device do not support double dtype now, dtype cast repalce with float.
tensor([[ 1.3310, 1.0011, -1.0679, -1.5444],
[-0.7345, -0.9888, -1.7310, -0.3305],
[-0.6676, -1.7792, 0.7108, -0.9981]], device='cuda:0')
>>> y
tensor([[ 2.6619, 2.0023, -2.1359, -3.0887],
[-1.4691, -1.9777, -3.4620, -0.6609],
[-1.3353, -3.5583, 1.4216, -1.9962]], device='cuda:0')
ditorch + Ascend910 pytorch原生测例通过情况
ditorch + mlu370_m8 pytorch原生测例通过情况
提供模型训练过程中需要的基础工具,解决模型训练过程中出现的痛点问题 算子工具。
序号 | 工具 | 简介 |
1 | 算子参数抓取工具 | 抓取模型真实训练过程中真实的输入输出 |
2 | 精度分析工具 | 进行离线和实时的精度分析 |
3 | 速度分析工具 | 可进行离线和实时的耗时分析,协助性能优化 |
4 | 算子 Fallback | 可将指定、全部算子在设备上运行的操作 fallback 到 CPU 计算 |
5 | 算子数据类型转换工具 | 可将指定、全部算子的特定数据类型转到给定数据类型去计算 |
6 | 溢出检测工具 | 可对指定、全部算子进行溢出检测 |
# usage1
import op_tools
capture = op_tools.OpCapture()
# usage2
import op_tools
with op_tools.OpCapture():
apply OpCaptureHook on torch.Tensor.add
op_tools_results/op_capture_results/torch.Tensor.add/283699/161/2024-10-09-11-42-15/input.pth saved
op_tools_results/op_capture_results/torch.Tensor.add/283699/161/2024-10-09-11-42-15/output.pth saved
torch.Tensor.add forward_id:161/2024-10-09-11-42-15 /deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/ops/norm.py:14 manual_rms_norm: my_input = my_input * torch.rsqrt(variance + eps)
| name | device | dtype | numel | shape | stride | requires_grad | layout | data_ptr | value |
| torch.Tensor.add inputs[0] | npu:0 | float32 | 16384 | (1, 16384, 1) | (16384, 1, 1) | False | torch.strided | 20067180408320 | |
| torch.Tensor.add inputs[1] | | | | | | | | | 1e-05 |
| torch.Tensor.add outputs | npu:0 | float32 | 16384 | (1, 16384, 1) | (16384, 1, 1) | False | torch.strided | 20067180474368 | |
apply OpCaptureHook on torch.rsqrt
op_tools_results/op_capture_results/torch.rsqrt/283699/162/2024-10-09-11-42-15/input.pth saved
op_tools_results/op_capture_results/torch.rsqrt/283699/162/2024-10-09-11-42-15/output.pth saved
torch.rsqrt forward_id:162/2024-10-09-11-42-15 /deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/ops/norm.py:14 manual_rms_norm: my_input = my_input * torch.rsqrt(variance + eps)
| name | device | dtype | numel | shape | stride | requires_grad | layout | data_ptr |
| torch.rsqrt inputs | npu:0 | float32 | 16384 | (1, 16384, 1) | (16384, 1, 1) | False | torch.strided | 20067180474368 |
| torch.rsqrt outputs | npu:0 | float32 | 16384 | (1, 16384, 1) | (16384, 1, 1) | False | torch.strided | 20067180540416 |
apply OpCaptureHook on torch.Tensor.mul
op_tools_results/op_capture_results/torch.Tensor.mul/283699/163/2024-10-09-11-42-15/input.pth saved
op_tools_results/op_capture_results/torch.Tensor.mul/283699/163/2024-10-09-11-42-15/output.pth saved
torch.Tensor.mul forward_id:163/2024-10-09-11-42-15 /deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/ops/norm.py:14 manual_rms_norm: my_input = my_input * torch.rsqrt(variance + eps)
| name | device | dtype | numel | shape | stride | requires_grad | layout | data_ptr |
| torch.Tensor.mul inputs[0] | npu:0 | bfloat16 | 33554432 | (1, 16384, 2048) | (33554432, 2048, 1) | True | torch.strided | 20074677141504 |
| torch.Tensor.mul inputs[1] | npu:0 | float32 | 16384 | (1, 16384, 1) | (16384, 1, 1) | False | torch.strided | 20067180540416 |
| torch.Tensor.mul outputs | npu:0 | float32 | 33554432 | (1, 16384, 2048) | (33554432, 2048, 1) | False | torch.strided | 20075012687360 |
skip OpCaptureHook on torch.Tensor.mul
skip OpCaptureHook on torch.Tensor.div
op_tools_results/op_capture_results/torch.Tensor.sort/3834328/59/2024-10-09-11-40-14/input.pth saved
op_tools_results/op_capture_results/torch.Tensor.sort/3834328/59/2024-10-09-11-40-14/output.pth saved
torch.Tensor.sort forward_id:59/2024-10-09-11-40-14 /deeplink_afs/zhaoguochun/ditorch2/op_tools/test/test_op_capture.py:15 f: sorted, indices = e.sort() # return torch.return_type.sort
| name | device | dtype | numel | shape | stride | requires_grad | layout | data_ptr |
| torch.Tensor.sort inputs | npu:0 | float32 | 200 | (10, 20) | (20, 1) | True | torch.strided | 20067179830784 |
| torch.Tensor.sort outputs [0][0] | npu:0 | float32 | 200 | (10, 20) | (20, 1) | True | torch.strided | 20067179831808 |
| torch.Tensor.sort outputs [0][1] | npu:0 | int64 | 200 | (10, 20) | (20, 1) | False | torch.strided | 20067179832832 |
skip OpCaptureHook on torch.Tensor.__getitem__
skip OpCaptureHook on torch.Tensor.sum
op_tools_results/op_capture_results/torch.Tensor.sort/3834328/59/2024-10-09-11-40-14/grad_inputs.pth saved
op_tools_results/op_capture_results/torch.Tensor.sort/3834328/59/2024-10-09-11-40-14/grad_outputs.pth saved
torch.Tensor.sort forward_id:<built-in function id>
| name | device | dtype | numel | shape | stride | requires_grad | layout | data_ptr |
| torch.Tensor.sort grad_output | npu:0 | float32 | 200 | (10, 20) | (20, 1) | False | torch.strided | 20067179835904 |
| torch.Tensor.sort grad_inputs | npu:0 | float32 | 200 | (10, 20) | (20, 1) | False | torch.strided | 20067179836928 |
apply OpCaptureHook on torch.Tensor.to
op_capture_result/0/2024-08-06--11-46/torch.Tensor.to/29/input.pth saved
op_capture_result/0/2024-08-06--11-46/torch.Tensor.to/29/output.pth saved
apply OpCaptureHook on torch.Tensor.mul
op_capture_result/0/2024-08-06--11-46/torch.Tensor.mul/30/input.pth saved
op_capture_result/0/2024-08-06--11-46/torch.Tensor.mul/30/output.pth saved
skip OpCaptureHook on torch.Tensor.add
skip OpCaptureHook on torch.Tensor.sub
apply OpCaptureHook on torch.Tensor.div
op_capture_result/0/2024-08-06--11-46/torch.Tensor.div/31/input.pth saved
op_capture_result/0/2024-08-06--11-46/torch.Tensor.div/31/output.pth saved
apply OpCaptureHook on torch.Tensor.sort
op_capture_result/0/2024-08-06--11-46/torch.Tensor.sort/32/input.pth saved
op_capture_result/0/2024-08-06--11-46/torch.Tensor.sort/32/output.pth saved
apply OpCaptureHook on torch.Tensor.sum
op_capture_result/0/2024-08-06--11-46/torch.Tensor.sum/33/input.pth saved
op_capture_result/0/2024-08-06--11-46/torch.Tensor.sum/33/output.pth saved
- 离线分析:用模型训练过程中真实输入输出,离线对比。
- 实时精度对比:模型训练时实时与cpu对比分析精度。
# usage1
import op_tools
with op_tools.OpAutoCompare():
# usage2
import op_tools
autocompare = op_tools.OpAutoCompare()
# for float16
export AUTOCOMPARE_ERROR_TOLERANCE_FLOAT16="1e-3,1e-4" # atol=1e-3, rtol=1e-4
# for bfloat16
export AUTOCOMPARE_ERROR_TOLERANCE_BFLOAT16="1e-2,1e-3" # atol=1e-2, rtol=1e-3
# for other dtype
export AUTOCOMPARE_ERROR_TOLERANCE="1e-4,1e-5" # atol=1e-4, rtol=1e-5
torch.Tensor.mul forward_id: 165 cpu_dtype_cast_info(from:to): {torch.bfloat16: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/ops/norm.py:26 manual_rms_norm: return weight * my_input
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.Tensor.mul inputs[0] | (2048,) | (1,) | 2048 | bfloat16 | npu:0 | True | torch.strided | 20076634832896 |
| torch.Tensor.mul inputs[1] | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | bfloat16 | npu:0 | False | torch.strided | 20067477619200 |
| torch.Tensor.mul outputs | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | bfloat16 | npu:0 | False | torch.strided | 20067544728576 |
| torch.Tensor.mul inputs(cpu)[0] | (2048,) | (1,) | 2048 | float64 | cpu | False | torch.strided | 34492000832 |
| torch.Tensor.mul inputs(cpu)[1] | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | float64 | cpu | False | torch.strided | 140503487049792 |
| torch.Tensor.mul outputs(cpu) | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | float64 | cpu | False | torch.strided | 140503218610240 |
| name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| torch.Tensor.mul input[0] | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
| torch.Tensor.mul input[1] | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
| torch.Tensor.mul output | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
torch.Tensor.reshape forward_id: 173 cpu_dtype_cast_info(from:to): {torch.bfloat16: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/modeling_internlm2.py:421 _packed_forward: q = rearrange(q, "b t h gs d -> b t (h gs) d")
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.Tensor.reshape inputs[0] | (1, 16384, 8, 2, 128) | (67108864, 4096, 512, 128, 1) | 33554432 | bfloat16 | npu:0 | False | torch.strided | 20067611837952 | |
| torch.Tensor.reshape inputs [1][0] | | | | | | | | | 1 |
| torch.Tensor.reshape inputs [1][1] | | | | | | | | | 16384 |
| torch.Tensor.reshape inputs [1][2] | | | | | | | | | 16 |
| torch.Tensor.reshape inputs [1][3] | | | | | | | | | 128 |
| torch.Tensor.reshape outputs | (1, 16384, 16, 128) | (33554432, 2048, 128, 1) | 33554432 | bfloat16 | npu:0 | False | torch.strided | 20067477619200 | |
| torch.Tensor.reshape inputs(cpu)[0] | (1, 16384, 8, 2, 128) | (33554432, 2048, 256, 128, 1) | 33554432 | float64 | cpu | False | torch.strided | 140502405996608 | |
| torch.Tensor.reshape inputs(cpu) [1][0] | | | | | | | | | 1 |
| torch.Tensor.reshape inputs(cpu) [1][1] | | | | | | | | | 16384 |
| torch.Tensor.reshape inputs(cpu) [1][2] | | | | | | | | | 16 |
| torch.Tensor.reshape inputs(cpu) [1][3] | | | | | | | | | 128 |
| torch.Tensor.reshape outputs(cpu) | (1, 16384, 16, 128) | (33554432, 2048, 128, 1) | 33554432 | float64 | cpu | False | torch.strided | 140502405996608 | |
| name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| torch.Tensor.reshape input[0] | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
| torch.Tensor.reshape input[1] | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| torch.Tensor.reshape input[2] | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| torch.Tensor.reshape input[3] | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| torch.Tensor.reshape input[4] | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| torch.Tensor.reshape output | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
torch.nn.functional.linear forward_id: 231 cpu_dtype_cast_info(from:to): {torch.bfloat16: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/utils.py:287 forward: output = F.linear(total_x, weight, bias) # pylint: disable=E1102
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.nn.functional.linear inputs[0] | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | bfloat16 | npu:0 | False | torch.strided | 20088138762240 | |
| torch.nn.functional.linear inputs[1] | (2048, 2048) | (2048, 1) | 4194304 | bfloat16 | npu:0 | True | torch.strided | 20076626444288 | |
| torch.nn.functional.linear inputs[2] | | | | | | | | | None |
| torch.nn.functional.linear outputs | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | bfloat16 | npu:0 | False | torch.strided | 20076634845696 | |
| torch.nn.functional.linear inputs(cpu)[0] | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | float64 | cpu | False | torch.strided | 140502875758656 | |
| torch.nn.functional.linear inputs(cpu)[1] | (2048, 2048) | (2048, 1) | 4194304 | float64 | cpu | False | torch.strided | 140560622940224 | |
| torch.nn.functional.linear inputs(cpu)[2] | | | | | | | | | None |
| torch.nn.functional.linear outputs(cpu) | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | float64 | cpu | False | torch.strided | 140502607319104 | |
| name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| torch.nn.functional.linear input[0] | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
| torch.nn.functional.linear input[1] | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
| torch.nn.functional.linear input[2] | True | 0.000000000 | 0.000000000 | 0.000000000 | 0.000000000 | |
| torch.nn.functional.linear output | False | 0.003906250 | 0.083984375 | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
op_tools_results/op_capture_results/torch.nn.functional.linear/1527072/autocompare/231/2024-10-08-15-20-39/device/input.pth saved
op_tools_results/op_capture_results/torch.nn.functional.linear/1527072/autocompare/231/2024-10-08-15-20-39/device/output.pth saved
op_tools_results/op_capture_results/torch.nn.functional.linear/1527072/autocompare/231/2024-10-08-15-20-39/cpu/input.pth saved
op_tools_results/op_capture_results/torch.nn.functional.linear/1527072/autocompare/231/2024-10-08-15-20-39/cpu/output.pth saved
torch.outer forward_id: 175 cpu_dtype_cast_info(from:to): {torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/modules/embedding.py:351 _update_cos_sin_cache: freqs = torch.outer(t, self.inv_freq.to(device=t.device))
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.outer inputs[0] | (1025,) | (1,) | 1025 | float32 | npu:0 | False | torch.strided | 20067180423168 |
| torch.outer inputs[1] | (64,) | (1,) | 64 | float32 | npu:0 | False | torch.strided | 20067179825152 |
| torch.outer outputs | (1025, 64) | (64, 1) | 65600 | float32 | npu:0 | False | torch.strided | 20067180428800 |
| torch.outer inputs(cpu)[0] | (1025,) | (1,) | 1025 | float64 | cpu | False | torch.strided | 36028922944 |
| torch.outer inputs(cpu)[1] | (64,) | (1,) | 64 | float64 | cpu | False | torch.strided | 33746424192 |
| torch.outer outputs(cpu) | (1025, 64) | (64, 1) | 65600 | float64 | cpu | False | torch.strided | 33745266368 |
| name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| torch.outer input[0] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.outer input[1] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.outer output | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
torch.Tensor.add forward_id: 214 cpu_dtype_cast_info(from:to): {torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/modules/embedding.py:100 _torch_apply_rotary_func: out2.copy_(x1 * sin + x2 * cos)
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.Tensor.add inputs[0] | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0 | False | torch.strided | 20088138762240 |
| torch.Tensor.add inputs[1] | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0 | False | torch.strided | 20088172317184 |
| torch.Tensor.add outputs | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0 | False | torch.strided | 20088205872128 |
| torch.Tensor.add inputs(cpu)[0] | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float64 | cpu | False | torch.strided | 140504225255488 |
| torch.Tensor.add inputs(cpu)[1] | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float64 | cpu | False | torch.strided | 140504158142528 |
| torch.Tensor.add outputs(cpu) | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float64 | cpu | False | torch.strided | 140503009972288 |
| name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| torch.Tensor.add input[0] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.Tensor.add input[1] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.Tensor.add output | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
torch.Tensor.copy_ forward_id: 215 cpu_dtype_cast_info(from:to): {torch.bfloat16: torch.float64, torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/modules/embedding.py:100 _torch_apply_rotary_func: out2.copy_(x1 * sin + x2 * cos)
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.Tensor.copy_ inputs[0] | (1, 16384, 8, 64) | (16777216, 1024, 128, 1) | 8388608 | bfloat16 | npu:0 | False | torch.strided | 20067746056320 |
| torch.Tensor.copy_ inputs[1] | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0 | False | torch.strided | 20088205872128 |
| torch.Tensor.copy_ outputs | (1, 16384, 8, 64) | (16777216, 1024, 128, 1) | 8388608 | bfloat16 | npu:0 | False | torch.strided | 20067746056320 |
| torch.Tensor.copy_ inputs(cpu)[0] | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float64 | cpu | False | torch.strided | 140503144198208 |
| torch.Tensor.copy_ inputs(cpu)[1] | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float64 | cpu | False | torch.strided | 140503077085248 |
| torch.Tensor.copy_ outputs(cpu) | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float64 | cpu | False | torch.strided | 140503144198208 |
| name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| torch.Tensor.copy_ input[0] | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
| torch.Tensor.copy_ input[1] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.Tensor.copy_ output | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
torch.Tensor.tolist forward_id: 225
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/modules/multi_head_attention.py:205 _forward: actual_seq_qlen = actual_seq_qlen[1:].tolist()
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.Tensor.tolist inputs | (4,) | (1,) | 4 | int32 | npu:0 | False | torch.strided | 20067179825668 | |
| torch.Tensor.tolist outputs [0][0] | | | | | | | | | 4096 |
| torch.Tensor.tolist outputs [0][1] | | | | | | | | | 8192 |
| torch.Tensor.tolist outputs [0][2] | | | | | | | | | 12288 |
| torch.Tensor.tolist outputs [0][3] | | | | | | | | | 16384 |
| torch.Tensor.tolist inputs(cpu) | (4,) | (1,) | 4 | int32 | cpu | False | torch.strided | 33694948224 | |
| torch.Tensor.tolist outputs(cpu) [0][0] | | | | | | | | | 4096 |
| torch.Tensor.tolist outputs(cpu) [0][1] | | | | | | | | | 8192 |
| torch.Tensor.tolist outputs(cpu) [0][2] | | | | | | | | | 12288 |
| torch.Tensor.tolist outputs(cpu) [0][3] | | | | | | | | | 16384 |
| name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| torch.Tensor.tolist input | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| torch.Tensor.tolist output[0] | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| torch.Tensor.tolist output[1] | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| torch.Tensor.tolist output[2] | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| torch.Tensor.tolist output[3] | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
torch.Tensor.mul forward_id: 252 cpu_dtype_cast_info(from:to): {torch.bfloat16: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/utils.py:667 Silu: return F.silu(w1_o) * w2_o
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.Tensor.mul inputs[0] | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | bfloat16 | npu:0 | False | torch.strided | 20077980155904 |
| torch.Tensor.mul inputs[1] | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | bfloat16 | npu:0 | False | torch.strided | 20076769064448 |
| torch.Tensor.mul outputs | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | bfloat16 | npu:0 | False | torch.strided | 20078282145792 |
| torch.Tensor.mul inputs(cpu)[0] | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | float64 | cpu | False | torch.strided | 140501332246592 |
| torch.Tensor.mul inputs(cpu)[1] | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | float64 | cpu | False | torch.strided | 140497574117440 |
| torch.Tensor.mul outputs(cpu) | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | float64 | cpu | False | torch.strided | 140496500371520 |
| name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| torch.Tensor.mul input[0] | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
| torch.Tensor.mul input[1] | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
| torch.Tensor.mul output | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
torch.Tensor.mean forward_id: 262 cpu_dtype_cast_info(from:to): {torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/ops/norm.py:13 manual_rms_norm: variance = my_input.to(torch.float32).pow(2).mean(dims, keepdim=True)
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.Tensor.mean inputs[0] | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | float32 | npu:0 | True | torch.strided | 20078114374144 | |
| torch.Tensor.mean inputs [1][0] | | | | | | | | | -1 |
| torch.Tensor.mean inputs keepdim | | | | | | | | | True |
| torch.Tensor.mean outputs | (1, 16384, 1) | (16384, 1, 1) | 16384 | float32 | npu:0 | True | torch.strided | 20067180823040 | |
| torch.Tensor.mean inputs(cpu)[0] | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | float64 | cpu | True | torch.strided | 140500392681536 | |
| torch.Tensor.mean inputs(cpu) [1][0] | | | | | | | | | -1 |
| torch.Tensor.mean inputs(cpu) keepdim | | | | | | | | | True |
| torch.Tensor.mean outputs(cpu) | (1, 16384, 1) | (16384, 1, 1) | 16384 | float64 | cpu | True | torch.strided | 33746053632 | |
| name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| torch.Tensor.mean input[0] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.Tensor.mean input[1] | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| torch.Tensor.mean output | True | 0.000000060 | 0.000000152 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
torch.Tensor.argmax forward_id: 296 cpu_dtype_cast_info(from:to): {torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/metrics/base.py:115 update: (shift_labels == (shift_logits.argmax(dim=-1) + pred_shift)), logits_global
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.Tensor.argmax inputs | (16384, 92544) | (92544, 1) | 1516240896 | float32 | npu:0 | True | torch.strided | 20088635785216 | |
| torch.Tensor.argmax inputs dim | | | | | | | | | -1 |
| torch.Tensor.argmax outputs | (16384,) | (1,) | 16384 | int64 | npu:0 | False | torch.strided | 20067181522432 | |
| torch.Tensor.argmax inputs(cpu) | (16384, 92544) | (92544, 1) | 1516240896 | float64 | cpu | False | torch.strided | 140336718184512 | |
| torch.Tensor.argmax inputs(cpu) dim | | | | | | | | | -1 |
| torch.Tensor.argmax outputs(cpu) | (16384,) | (1,) | 16384 | int64 | cpu | False | torch.strided | 35724171200 | |
| name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| torch.Tensor.argmax input | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.Tensor.argmax output | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
torch.Tensor.mul forward_id: 381 cpu_dtype_cast_info(from:to): {torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/solver/optimizer/hybrid_zero_optim.py:609 backward: loss = self.loss_scale * loss
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.Tensor.mul grad_output(cpu) | (1,) | (1,) | 1 | float32 | cpu | False | torch.strided | 139747700714432 | |
| torch.Tensor.mul grad_inputs[0] | | | | | | | | | None |
| torch.Tensor.mul grad_inputs[1] | () | () | 1 | float32 | npu:0 | False | torch.strided | 20067181907968 | |
| torch.Tensor.mul grad_inputs(cpu)[0] | | | | | | | | | None |
| torch.Tensor.mul grad_inputs(cpu)[1] | () | () | 1 | float64 | cpu | False | torch.strided | 139747700718656 | |
| name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| torch.Tensor.mul grad[0] | True | 0.000000000 | 0.000000000 | 0.000000000 | 0.000000000 | |
| torch.Tensor.mul grad[1] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
torch.Tensor.add_ forward_id: 289 cpu_dtype_cast_info(from:to): {torch.bfloat16: torch.float32, torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/core/scheduler/no_pipeline_scheduler.py:143 _train_one_batch: loss += moe_loss
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.Tensor.add_ grad_output(cpu) | () | () | 1 | float32 | cpu | False | torch.strided | 139747700723072 | |
| torch.Tensor.add_ grad_inputs[0] | () | () | 1 | float32 | npu:0 | False | torch.strided | 20067181907968 | |
| torch.Tensor.add_ grad_inputs[1] | | | | | | | | | None |
| torch.Tensor.add_ grad_inputs(cpu)[0] | () | () | 1 | float64 | cpu | False | torch.strided | 139747700726528 | |
| torch.Tensor.add_ grad_inputs(cpu)[1] | | | | | | | | | None |
| name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| torch.Tensor.add_ grad[0] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.Tensor.add_ grad[1] | True | 0.000000000 | 0.000000000 | 0.000000000 | 0.000000000 | |
torch.Tensor.div_ forward_id: 288 cpu_dtype_cast_info(from:to): {torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/core/scheduler/no_pipeline_scheduler.py:142 _train_one_batch: loss /= scale_loss
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.Tensor.div_ grad_output(cpu) | () | () | 1 | float32 | cpu | False | torch.strided | 139747700733376 | |
| torch.Tensor.div_ grad_inputs[0] | () | () | 1 | float32 | npu:0 | False | torch.strided | 20067247392256 | |
| torch.Tensor.div_ grad_inputs[1] | | | | | | | | | None |
| torch.Tensor.div_ grad_inputs(cpu)[0] | () | () | 1 | float64 | cpu | False | torch.strided | 139747700740672 | |
| torch.Tensor.div_ grad_inputs(cpu)[1] | | | | | | | | | None |
| name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| torch.Tensor.div_ grad[0] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.Tensor.div_ grad[1] | True | 0.000000000 | 0.000000000 | 0.000000000 | 0.000000000 | |
torch.Tensor.sum forward_id: 279 cpu_dtype_cast_info(from:to): {torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/losses/ce_loss.py:645 forward: loss = loss_list.sum() / (cond).sum()
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.Tensor.sum grad_output(cpu) | () | () | 1 | float32 | cpu | False | torch.strided | 139747700752832 |
| torch.Tensor.sum grad_inputs | (16384,) | (0,) | 16384 | float32 | npu:0 | False | torch.strided | 20067181907968 |
| torch.Tensor.sum grad_inputs(cpu) | (16384,) | (0,) | 16384 | float64 | cpu | False | torch.strided | 139747699960640 |
| name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| torch.Tensor.sum grad | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
torch.nn.functional.cross_entropy forward_id: 277 cpu_dtype_cast_info(from:to): {torch.float32: torch.float64}
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/losses/ce_loss.py:635 forward: loss_list = self.loss_fn(
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.nn.functional.cross_entropy grad_output(cpu) | (16384,) | (1,) | 16384 | float32 | cpu | False | torch.strided | 139747701147904 |
| torch.nn.functional.cross_entropy grad_inputs | (16384, 92544) | (92544, 1) | 1516240896 | float32 | npu:0 | False | torch.strided | 20110110621696 |
| torch.nn.functional.cross_entropy grad_inputs(cpu) | (16384, 92544) | (92544, 1) | 1516240896 | float64 | cpu | False | torch.strided | 139473962684480 |
| name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| torch.nn.functional.cross_entropy grad | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| forward_id | name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| 358 | torch.Tensor.expand output | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 359 | torch.Tensor.max input | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 359 | torch.Tensor.max output | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 360 | torch.Tensor.int input | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 360 | torch.Tensor.int output | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 361 | torch.Tensor.add input[0] | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 361 | torch.Tensor.add input[1] | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| 361 | torch.Tensor.add output | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 362 | torch.Tensor.__index__ input | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 362 | torch.Tensor.__index__ output | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| 363 | torch.Tensor.__index__ input | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 363 | torch.Tensor.__index__ output | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| 364 | torch.Tensor.scatter_add_ input[0] | False | 2.062500000 | 0.000011804 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 364 | torch.Tensor.scatter_add_ input[1] | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| 364 | torch.Tensor.scatter_add_ input[2] | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 364 | torch.Tensor.scatter_add_ input[3] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 364 | torch.Tensor.scatter_add_ output | False | 2.062500000 | 0.000011804 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 365 | torch.Tensor.expand input[0] | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 365 | torch.Tensor.expand input[1] | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| 365 | torch.Tensor.expand output | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 366 | torch.Tensor.max input | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 366 | torch.Tensor.max output | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 367 | torch.Tensor.int input | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 367 | torch.Tensor.int output | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 368 | torch.Tensor.add input[0] | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 368 | torch.Tensor.add input[1] | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| 368 | torch.Tensor.add output | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 369 | torch.Tensor.__index__ input | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 369 | torch.Tensor.__index__ output | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| 370 | torch.Tensor.__index__ input | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 370 | torch.Tensor.__index__ output | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| 371 | torch.Tensor.scatter_add_ input[0] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 371 | torch.Tensor.scatter_add_ input[1] | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| 371 | torch.Tensor.scatter_add_ input[2] | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | |
| 371 | torch.Tensor.scatter_add_ input[3] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 371 | torch.Tensor.scatter_add_ output | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 372 | torch.Tensor.__len__ input | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 372 | torch.Tensor.__len__ output | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| 373 | torch.Tensor.__len__ input | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 373 | torch.Tensor.__len__ output | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| 374 | torch.Tensor.new_zeros input[0] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 374 | torch.Tensor.new_zeros input[1] | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| 374 | torch.Tensor.new_zeros output | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 375 | torch.cat input[0] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 375 | torch.cat input[1] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 375 | torch.cat output | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 376 | torch.Tensor.__len__ input | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 376 | torch.Tensor.__len__ output | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| 377 | torch.Tensor.new_zeros input[0] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 377 | torch.Tensor.new_zeros input[1] | True | 0.000000000 | 0.000000000 | 0.000001000 | 0.000001000 | |
| 377 | torch.Tensor.new_zeros output | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 378 | torch.cat input[0] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 378 | torch.cat input[1] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 378 | torch.cat output | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 379 | torch.Tensor.add_ input[0] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 379 | torch.Tensor.add_ input[1] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 379 | torch.Tensor.add_ output | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 380 | torch.Tensor.add_ input[0] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 380 | torch.Tensor.add_ input[1] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 380 | torch.Tensor.add_ output | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 381 | torch.Tensor.mul input[0] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 381 | torch.Tensor.mul input[1] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 381 | torch.Tensor.mul output | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| 381 | torch.Tensor.mul grad[0] | True | 0.000000000 | 0.000000000 | 0.000000000 | 0.000000000 | |
| 381 | torch.Tensor.mul grad[1] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
python op_tools/run_op_from_data.py /deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/op_tools_results/op_capture_results/torch.nn.functional.normalize/ --acc_check --run_ti
mes 1
ditorch.framework: torch_npu:2.1.0.post3
torch.nn.functional.normalize forward_id: 1 cpu_dtype_cast_info(from:to): {torch.bfloat16: torch.float64}
/deeplink_afs/zhaoguochun/ditorch2/op_tools/op_runner.py:124 run_forward: self.result = self.func(*self.args, **self.kwargs)
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.nn.functional.normalize inputs | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | True | torch.strided | 20067179823104 | |
| torch.nn.functional.normalize inputs p | | | | | | | | | 2.0 |
| torch.nn.functional.normalize inputs dim | | | | | | | | | 1 |
| torch.nn.functional.normalize inputs eps | | | | | | | | | 1e-12 |
| torch.nn.functional.normalize inputs out | | | | | | | | | None |
| torch.nn.functional.normalize outputs | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | True | torch.strided | 20076824625152 | |
| torch.nn.functional.normalize inputs(cpu) | (92544, 2048) | (2048, 1) | 189530112 | float64 | cpu | True | torch.strided | 140583390142528 | |
| torch.nn.functional.normalize inputs(cpu) p | | | | | | | | | 2.0 |
| torch.nn.functional.normalize inputs(cpu) dim | | | | | | | | | 1 |
| torch.nn.functional.normalize inputs(cpu) eps | | | | | | | | | 1e-12 |
| torch.nn.functional.normalize inputs(cpu) out | | | | | | | | | None |
| torch.nn.functional.normalize outputs(cpu) | (92544, 2048) | (2048, 1) | 189530112 | float64 | cpu | True | torch.strided | 140573367332928 | |
| name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| torch.nn.functional.normalize input | True | 0.000000000 | 0.000000000 | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
| torch.nn.functional.normalize output | True | 0.000976562 | 0.007751465 | 0.001000000 | 0.001000000 | Inconsistent dtypes: torch.bfloat16 torch.float64 |
torch.nn.functional.normalize forward_id: 1 cpu_dtype_cast_info(from:to): {torch.bfloat16: torch.float64}
/deeplink_afs/zhaoguochun/ditorch2/op_tools/op_runner.py:124 run_forward: self.result = self.func(*self.args, **self.kwargs)
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.nn.functional.normalize grad_output(cpu) | (92544, 2048) | (2048, 1) | 189530112 | float32 | cpu | False | torch.strided | 140572514840640 |
| torch.nn.functional.normalize grad_inputs[0] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | False | torch.strided | 20077898366976 |
| torch.nn.functional.normalize grad_inputs[1] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | False | torch.strided | 20079351693312 |
| torch.nn.functional.normalize grad_inputs(cpu)[0] | (92544, 2048) | (2048, 1) | 189530112 | float64 | cpu | False | torch.strided | 140559542906944 |
| torch.nn.functional.normalize grad_inputs(cpu)[1] | (92544, 2048) | (2048, 1) | 189530112 | float64 | cpu | False | torch.strided | 140554994171968 |
| name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| torch.nn.functional.normalize grad[0] | False | 104.576171875 | 0.005955214 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
| torch.nn.functional.normalize grad[1] | False | 5.168151855 | 0.014576809 | 0.000010000 | 0.000010000 | Inconsistent dtypes: torch.float32 torch.float64 |
# 测量算子耗时(输入为使用算子抓取工具在模型训练时抓取到的真实数据)
python op_tools/run_op_from_data.py /deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/op_tools_results/op_capture_results/torch.nn.functional.normalize/ --sync_time_measure --run_times 3
ditorch.framework: torch_npu:2.1.0.post3
/deeplink_afs/zhaoguochun/ditorch2/op_tools/op_runner.py:124 run_forward: self.result = self.func(*self.args, **self.kwargs)
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.nn.functional.normalize inputs | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | True | torch.strided | 20067179823104 | |
| torch.nn.functional.normalize inputs p | | | | | | | | | 2.0 |
| torch.nn.functional.normalize inputs dim | | | | | | | | | 1 |
| torch.nn.functional.normalize inputs eps | | | | | | | | | 1e-12 |
| torch.nn.functional.normalize inputs out | | | | | | | | | None |
| torch.nn.functional.normalize outputs | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | True | torch.strided | 20076824625152 | |
| name | forward_id | forward_elasped | unit |
| torch.nn.functional.normalize | 1 | 41.59259796 | ms |
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.nn.functional.normalize grad_outputs [0][0] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | False | torch.strided | 20077204209664 |
| torch.nn.functional.normalize grad_inputs [0][0] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | False | torch.strided | 20077898366976 |
| torch.nn.functional.normalize grad_inputs [0][1] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | False | torch.strided | 20079351693312 |
| name | forward_id | backward_elasped | unit |
| torch.nn.functional.normalize | 1 | 6.45375252 | ms |
/opt/miniconda3/envs/torch_npu_py39/lib/python3.9/site-packages/torch/autograd/__init__.py:251: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at torch_npu/csrc/aten/common/TensorFactories.cpp:74.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/deeplink_afs/zhaoguochun/ditorch2/op_tools/op_runner.py:124 run_forward: self.result = self.func(*self.args, **self.kwargs)
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.nn.functional.normalize inputs | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | True | torch.strided | 20067179823104 | |
| torch.nn.functional.normalize inputs p | | | | | | | | | 2.0 |
| torch.nn.functional.normalize inputs dim | | | | | | | | | 1 |
| torch.nn.functional.normalize inputs eps | | | | | | | | | 1e-12 |
| torch.nn.functional.normalize inputs out | | | | | | | | | None |
| torch.nn.functional.normalize outputs | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | True | torch.strided | 20077204209664 | |
| name | forward_id | forward_elasped | unit |
| torch.nn.functional.normalize | 2 | 2.06089020 | ms |
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.nn.functional.normalize grad_outputs [0][0] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | False | torch.strided | 20076824625152 |
| torch.nn.functional.normalize grad_inputs [0][0] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | False | torch.strided | 20077898366976 |
| torch.nn.functional.normalize grad_inputs [0][1] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | False | torch.strided | 20079351693312 |
| name | forward_id | backward_elasped | unit |
| torch.nn.functional.normalize | 2 | 5.55872917 | ms |
/deeplink_afs/zhaoguochun/ditorch2/op_tools/op_runner.py:124 run_forward: self.result = self.func(*self.args, **self.kwargs)
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.nn.functional.normalize inputs | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | True | torch.strided | 20067179823104 | |
| torch.nn.functional.normalize inputs p | | | | | | | | | 2.0 |
| torch.nn.functional.normalize inputs dim | | | | | | | | | 1 |
| torch.nn.functional.normalize inputs eps | | | | | | | | | 1e-12 |
| torch.nn.functional.normalize inputs out | | | | | | | | | None |
| torch.nn.functional.normalize outputs | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | True | torch.strided | 20076824625152 | |
| name | forward_id | forward_elasped | unit |
| torch.nn.functional.normalize | 3 | 2.00271606 | ms |
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.nn.functional.normalize grad_outputs [0][0] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | False | torch.strided | 20077204209664 |
| torch.nn.functional.normalize grad_inputs [0][0] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | False | torch.strided | 20077898366976 |
| torch.nn.functional.normalize grad_inputs [0][1] | (92544, 2048) | (2048, 1) | 189530112 | bfloat16 | npu:0 | False | torch.strided | 20079351693312 |
| name | forward_id | backward_elasped | unit |
| torch.nn.functional.normalize | 3 | 5.68151474 | ms |
| name | forward_id | forward_elasped | backward_elasped | unit |
| torch.nn.functional.normalize | 1 | 41.59259796 | 6.45375252 | ms |
| torch.nn.functional.normalize | 2 | 2.06089020 | 5.55872917 | ms |
| torch.nn.functional.normalize | 3 | 2.00271606 | 5.68151474 | ms |
op elasped info saved to op_tools_results/op_time_measure_result/op_elasped_info_pid3722879_2024-10-08-17-00-42.csv
ditorch/op_tools# python run_op_from_data.py /op_capture_result/torch.Tensor.div/2278281/5 --run_times 3 --only_run_forward --sync_time_measure
# usage1
import op_tools
with op_tools.OpTimeMeasure():
# usage2
import op_tools
timemeasure = op_tools.OpTimeMeasure()
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/modules/embedding.py:99 _torch_apply_rotary_func: out1.copy_(x1 * cos - x2 * sin)
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.Tensor.mul inputs[0] | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0 | False | torch.strided | 20080180069376 |
| torch.Tensor.mul inputs[1] | (16384, 1, 64) | (64, 64, 1) | 1048576 | float32 | npu:0 | False | torch.strided | 20081048292864 |
| torch.Tensor.mul outputs | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0 | False | torch.strided | 20080247179264 |
| name | forward_id | forward_elasped | unit |
| torch.Tensor.mul | 6195 | 0.07629395 | ms |
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/modules/embedding.py:99 _torch_apply_rotary_func: out1.copy_(x1 * cos - x2 * sin)
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.Tensor.mul inputs[0] | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0 | False | torch.strided | 20080213624320 |
| torch.Tensor.mul inputs[1] | (16384, 1, 64) | (64, 64, 1) | 1048576 | float32 | npu:0 | False | torch.strided | 20081052487680 |
| torch.Tensor.mul outputs | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0 | False | torch.strided | 20080280734208 |
| name | forward_id | forward_elasped | unit |
| torch.Tensor.mul | 6196 | 0.06532669 | ms |
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/modules/embedding.py:99 _torch_apply_rotary_func: out1.copy_(x1 * cos - x2 * sin)
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.Tensor.sub inputs[0] | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0 | False | torch.strided | 20080247179264 |
| torch.Tensor.sub inputs[1] | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0 | False | torch.strided | 20080280734208 |
| torch.Tensor.sub outputs | (1, 16384, 8, 64) | (8388608, 512, 64, 1) | 8388608 | float32 | npu:0 | False | torch.strided | 20080316383232 |
| name | forward_id | forward_elasped | unit |
| torch.Tensor.sub | 6197 | 0.06794930 | ms |
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/ops/scatter.py:14 broadcast: src = src.expand(other.size())
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.Tensor.expand inputs[0] | (16384,) | (1,) | 16384 | int64 | npu:0 | False | torch.strided | 20067181255168 | |
| torch.Tensor.expand inputs[1] | | | | | | | | | torch.Size([16384]) |
| torch.Tensor.expand outputs | (16384,) | (1,) | 16384 | int64 | npu:0 | False | torch.strided | 20067181255168 | |
| name | forward_id | forward_elasped | unit |
| torch.Tensor.expand | 5496 | 0.03361702 | ms |
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/ops/scatter.py:35 vanilla_scatter: size[dim] = index.max().int() + 1
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.Tensor.max inputs | (16384,) | (1,) | 16384 | int64 | npu:0 | False | torch.strided | 20067181255168 |
| torch.Tensor.max outputs | () | () | 1 | int64 | npu:0 | False | torch.strided | 20067179832832 |
| name | forward_id | forward_elasped | unit |
| torch.Tensor.max | 5497 | 0.21767616 | ms |
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/solver/optimizer/utils.py:177 multi_tensor_l2norm_torch: l2_norm = torch.norm(norms_tensor, p=2).unsqueeze(0)
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.Tensor.unsqueeze inputs[0] | () | () | 1 | float32 | npu:0 | False | torch.strided | 20067181123072 | |
| torch.Tensor.unsqueeze inputs[1] | | | | | | | | | 0 |
| torch.Tensor.unsqueeze outputs | (1,) | (1,) | 1 | float32 | npu:0 | False | torch.strided | 20067181123072 | |
| name | forward_id | forward_elasped | unit |
| torch.Tensor.unsqueeze | 18299 | 0.01835823 | ms |
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/solver/optimizer/utils.py:224 get_norm: grad_norm = calc_l2_norm(grads) ** norm_type
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.Tensor.__pow__ inputs[0] | (1,) | (1,) | 1 | float32 | npu:0 | False | torch.strided | 20067181123072 | |
| torch.Tensor.__pow__ inputs[1] | | | | | | | | | 2.0 |
| torch.Tensor.__pow__ outputs | (1,) | (1,) | 1 | float32 | npu:0 | False | torch.strided | 20067181917184 | |
| name | forward_id | forward_elasped | unit |
| torch.Tensor.__pow__ | 18300 | 0.04887581 | ms |
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.Tensor.sum grad_outputs [0][0] | () | () | 1 | float32 | npu:0 | False | torch.strided | 20067179833856 |
| torch.Tensor.sum grad_inputs [0][0] | (16384,) | (0,) | 16384 | float32 | npu:0 | False | torch.strided | 20067179833856 |
| name | forward_id | backward_elasped | unit |
| torch.Tensor.sum | 25750 | 0.02026558 | ms |
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.nn.functional.cross_entropy grad_outputs [0][0] | (16384,) | (0,) | 16384 | float32 | npu:0 | False | torch.strided | 20067179833856 |
| torch.nn.functional.cross_entropy grad_inputs [0][0] | (16384, 92544) | (92544, 1) | 1516240896 | float32 | npu:0 | False | torch.strided | 20111184363520 |
| name | forward_id | backward_elasped | unit |
| torch.nn.functional.cross_entropy | 25748 | 4.23026085 | ms |
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.Tensor.mul grad_outputs [0][0] | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | bfloat16 | npu:0 | False | torch.strided | 20067414704128 |
| torch.Tensor.mul grad_inputs [0][0] | (2048,) | (1,) | 2048 | bfloat16 | npu:0 | False | torch.strided | 20067181914112 |
| torch.Tensor.mul grad_inputs [0][1] | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | bfloat16 | npu:0 | False | torch.strided | 20080586915840 |
| name | forward_id | backward_elasped | unit |
| torch.Tensor.mul | 26134 | 0.42295456 | ms |
# usage 1
with op_tools.OpFallback():
# usage 2
fallback = op_tools.OpFallback()
skip OpFallbackHook on torch.Tensor.float
skip OpFallbackHook on torch.Tensor.add
skip OpFallbackHook on torch.Tensor.div
skip OpFallbackHook on torch.Tensor.item
skip OpFallbackHook on torch.Tensor.float
skip OpFallbackHook on torch.Tensor.div
skip OpFallbackHook on torch.Tensor.fill_
skip OpFallbackHook on torch.Tensor.is_complex
skip OpFallbackHook on torch.Tensor.numel
skip OpFallbackHook on torch.Tensor.unbind
skip OpFallbackHook on torch.Tensor.sub
skip OpFallbackHook on torch.Tensor.max
OpFallbackHook: torch.nn.functional.linear input: {'args': ({'shape': torch.Size([1, 16384, 2048]), 'stride': (33554432, 2048, 1), 'numel': 33554432, 'dtype': 'torch.bfloat16', 'device': 'npu:0', 'requires_grad': False, 'layout': 'torch.strided', 'data': 20076203868160}, {'shape': torch.Size([4096, 2048]), 'stride': (2048, 1), 'numel': 8388608, 'dtype': 'torch.bfloat16', 'device': 'npu:0', 'requires_grad': True, 'layout': 'torch.strided', 'data': 20077985398784}, 'None')}
OpFallbackHook: torch.nn.functional.linear output: ({'shape': torch.Size([1, 16384, 4096]), 'stride': (67108864, 4096, 1), 'numel': 67108864, 'dtype': 'torch.bfloat16', 'device': 'npu:0', 'requires_grad': False, 'layout': 'torch.strided', 'data': 20075820089344},) cpu output: ({'shape': torch.Size([1, 16384, 4096]), 'stride': (67108864, 4096, 1), 'numel': 67108864, 'dtype': 'torch.bfloat16', 'device': 'cpu', 'requires_grad': False, 'layout': 'torch.strided', 'data': 139743270527040},) dtype_convert_back_dict:{}
skip OpFallbackHook on torch.Tensor.shape.__get__
torch.Tensor.std forward_id: 475
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/utils/verifiers.py:28 check_parallel_statistic_equality: named_std = params.to(dtype=torch.float64).std()
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.Tensor.std input(device) | (8192, 2048) | (2048, 1) | 16777216 | float64 | npu:0 | True | torch.strided | 20079859204096 |
| torch.Tensor.std input(cpu) | (8192, 2048) | (2048, 1) | 16777216 | float64 | cpu | True | torch.strided | 139687015878720 |
| torch.Tensor.std output(device) | () | () | 1 | float64 | npu:0 | True | torch.strided | 20067180001792 |
| torch.Tensor.std output(cpu) | () | () | 1 | float64 | cpu | True | torch.strided | 33658540800 |
torch.Tensor.to forward_id: 474
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/utils/verifiers.py:28 check_parallel_statistic_equality: named_std = params.to(dtype=torch.float64).std()
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.Tensor.to input(device) | (8192, 2048) | (2048, 1) | 16777216 | bfloat16 | npu:0 | True | torch.strided | 20067790094336 | |
| torch.Tensor.to input(device) dtype | | | | | | | | | torch.float64 |
| torch.Tensor.to input(cpu) | (8192, 2048) | (2048, 1) | 16777216 | bfloat16 | cpu | True | torch.strided | 139738740682816 | |
| torch.Tensor.to input(cpu) dtype | | | | | | | | | torch.float64 |
| torch.Tensor.to output(device) | (8192, 2048) | (2048, 1) | 16777216 | float64 | npu:0 | True | torch.strided | 20079859204096 | |
| torch.Tensor.to output(cpu) | (8192, 2048) | (2048, 1) | 16777216 | float64 | cpu | True | torch.strided | 139687150100544 | |
torch.Tensor.mean forward_id: 477
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/utils/verifiers.py:27 check_parallel_statistic_equality: named_mean = params.to(dtype=torch.float64).mean()
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.Tensor.mean input(device) | (2048,) | (1,) | 2048 | float64 | npu:0 | True | torch.strided | 20067181881344 |
| torch.Tensor.mean input(cpu) | (2048,) | (1,) | 2048 | float64 | cpu | True | torch.strided | 33649506688 |
| torch.Tensor.mean output(device) | () | () | 1 | float64 | npu:0 | True | torch.strided | 20067180002304 |
| torch.Tensor.mean output(cpu) | () | () | 1 | float64 | cpu | True | torch.strided | 33658542784 |
torch.Tensor.to forward_id: 478
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/utils/verifiers.py:28 check_parallel_statistic_equality: named_std = params.to(dtype=torch.float64).std()
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.Tensor.to input(device) | (2048,) | (1,) | 2048 | bfloat16 | npu:0 | True | torch.strided | 20067179883008 | |
| torch.Tensor.to input(device) dtype | | | | | | | | | torch.float64 |
| torch.Tensor.to input(cpu) | (2048,) | (1,) | 2048 | bfloat16 | cpu | True | torch.strided | 33649527040 | |
| torch.Tensor.to input(cpu) dtype | | | | | | | | | torch.float64 |
| torch.Tensor.to output(device) | (2048,) | (1,) | 2048 | float64 | npu:0 | True | torch.strided | 20067181898240 | |
| torch.Tensor.to output(cpu) | (2048,) | (1,) | 2048 | float64 | cpu | True | torch.strided | 33649532480 | |
torch.exp forward_id: 487
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/metrics/base.py:210 get_metric: perplexity = round(torch.exp(self.total_log_probs / self.total).item(), 4)
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.exp input(device) | (1,) | (1,) | 1 | float32 | npu:0 | False | torch.strided | 20067179862016 |
| torch.exp input(cpu) | (1,) | (1,) | 1 | float32 | cpu | False | torch.strided | 399798976 |
| torch.exp output(device) | (1,) | (1,) | 1 | float32 | npu:0 | False | torch.strided | 20067179862528 |
| torch.exp output(cpu) | (1,) | (1,) | 1 | float32 | cpu | False | torch.strided | 399799360 |
torch.Tensor.item forward_id: 488
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/metrics/base.py:210 get_metric: perplexity = round(torch.exp(self.total_log_probs / self.total).item(), 4)
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.Tensor.item input(device) | (1,) | (1,) | 1 | float32 | npu:0 | False | torch.strided | 20067179862528 | |
| torch.Tensor.item input(cpu) | (1,) | (1,) | 1 | float32 | cpu | False | torch.strided | 33581209792 | |
| torch.Tensor.item output(device) | | | | | | | | | 147525.78125 |
| torch.Tensor.item output(cpu) | | | | | | | | | 147525.78125 |
torch.Tensor.float forward_id: 489
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/metrics/base.py:218 get_metric: (self.ds_right[i].float() / (self.ds_tokens[i].float() + 1e-5)).item(), 4
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.Tensor.float input(device) | () | () | 1 | int64 | npu:0 | False | torch.strided | 20067180395520 |
| torch.Tensor.float input(cpu) | () | () | 1 | int64 | cpu | False | torch.strided | 33652536256 |
| torch.Tensor.float output(device) | () | () | 1 | float32 | npu:0 | False | torch.strided | 20067179863040 |
| torch.Tensor.float output(cpu) | () | () | 1 | float32 | cpu | False | torch.strided | 33652536640 |
# usage1
export OP_DTYPE_CAST_DICT="torch.float16->torch.float32,torch.bfloat16->torch.float32"
with op_tools.OpDtypeCast():
# usage2
dtype_caster = op_tools.OpDtypeCast()
for i in range(3):
# usage3
os.environ["OP_DTYPE_CAST_DISABLE_LIST"] = "torch.Tensor.add,torch.Tensor.sub"
# usage4
os.environ["OP_DTYPE_CAST_DISABLE_LIST"] = ""
os.environ["OP_DTYPE_CAST_LIST"] = "torch.Tensor.sort,torch.Tensor.add" # only cast these op
os.environ["OP_DTYPE_CAST_DICT"] = "torch.half->torch.bfloat16"
torch.nn.functional.linear forward_id: 490
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/utils.py:287 forward: output = F.linear(total_x, weight, bias) # pylint: disable=E1102
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.nn.functional.linear input(device)[0] | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | float32 | npu:0 | False | torch.strided | 20082537267200 | |
| torch.nn.functional.linear input(device)[1] | (8192, 2048) | (2048, 1) | 16777216 | float32 | npu:0 | False | torch.strided | 20082673582080 | |
| torch.nn.functional.linear input(device)[2] | | | | | | | | | None |
| torch.nn.functional.linear input(cpu)[0] | (1, 16384, 2048) | (33554432, 2048, 1) | 33554432 | float32 | cpu | False | torch.strided | 139842851389504 | |
| torch.nn.functional.linear input(cpu)[1] | (8192, 2048) | (2048, 1) | 16777216 | float32 | cpu | False | torch.strided | 139842720587840 | |
| torch.nn.functional.linear input(cpu)[2] | | | | | | | | | None |
| torch.nn.functional.linear output(device) | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | float32 | npu:0 | False | torch.strided | 20083267076096 | |
| torch.nn.functional.linear output(cpu) | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | float32 | cpu | False | torch.strided | 139833330622528 | |
| name | target | action | config |
| torch.nn.functional.linear | input[0] | torch.bfloat16 -> torch.float32 | torch.float16->torch.float32,torch.bfloat16->torch.float32 |
| torch.nn.functional.linear | input[1] | torch.bfloat16 -> torch.float32 | torch.float16->torch.float32,torch.bfloat16->torch.float32 |
| torch.nn.functional.linear | output[0] | torch.float32 -> torch.bfloat16 | torch.float16->torch.float32,torch.bfloat16->torch.float32 |
apply OpDtypeCastHook on torch.nn.functional.silu
torch.nn.functional.silu forward_id: 492
/deeplink_afs/zhaoguochun/SmallModelOptimize/InternTrain/internlm/model/utils.py:667 Silu: return F.silu(w1_o) * w2_o
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr | value |
| torch.nn.functional.silu input(device) | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | float32 | npu:0 | False | torch.strided | 20084340817920 | |
| torch.nn.functional.silu input(device) inplace | | | | | | | | | False |
| torch.nn.functional.silu input(cpu) | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | float32 | cpu | False | torch.strided | 139832793747520 | |
| torch.nn.functional.silu input(cpu) inplace | | | | | | | | | False |
| torch.nn.functional.silu output(device) | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | float32 | npu:0 | False | torch.strided | 20085414559744 | |
| torch.nn.functional.silu output(cpu) | (1, 16384, 8192) | (134217728, 8192, 1) | 134217728 | float32 | cpu | False | torch.strided | 139832256872512 | |
| name | target | action | config |
| torch.nn.functional.silu | input[0] | torch.bfloat16 -> torch.float32 | torch.float16->torch.float32,torch.bfloat16->torch.float32 |
| torch.nn.functional.silu | output[0] | torch.float32 -> torch.bfloat16 | torch.float16->torch.float32,torch.bfloat16->torch.float32 |
with op_tools.OpOverflowCheck():
x = torch.randn(3, 4, 5, dtype=torch.float32, device="cuda", requires_grad=True)
y = torch.zeros_like(x)
z = x / y
x = torch.full((3, 4, 5,), dtype=torch.float32, device="cuda", fill_value=3.402823466e38)
y = x + x
z = x * x
torch.randn forward_id: 1
/deeplink_afs/zhaoguochun/ditorch/op_tools/test/test_tool_with_special_op.py:114 test_overflow4: x = torch.randn(3, 4, 5, dtype=torch.float32, device="cuda", requires_grad=True)
| name | value | device | dtype | numel | shape | stride | requires_grad | layout | data_ptr |
| torch.randn inputs[0] | 3 | | | | | | | | |
| torch.randn inputs[1] | 4 | | | | | | | | |
| torch.randn inputs[2] | 5 | | | | | | | | |
| torch.randn inputs dtype | torch.float32 | | | | | | | | |
| torch.randn inputs device | npu | | | | | | | | |
| torch.randn inputs requires_grad | True | | | | | | | | |
| torch.randn outputs | | npu:0 | float32 | 60 | (3, 4, 5) | (20, 5, 1) | True | torch.strided | 20067179823104 |
| name | inf_or_nan | min | max | mean | std | norm |
| torch.randn input[0] | False | | | | | |
| torch.randn input[1] | False | | | | | |
| torch.randn input[2] | False | | | | | |
| torch.randn output[0] | False | -2.577465772628784 | 1.8474770784378052 | -0.0570245198905468 | 1.0004568099975586 | 7.697339057922363 |
GarbageCollectEvaluate: host_memory_usage: 1735 MB, device_memory_usage: 0 MB, device_memory_reserved: 2 MB
GarbageCollectEvaluate: after collect : rss: 1735 MB, current_rss: 1735 MB, max_diff: 1024 MB, device_memory_usage: 0 MB, current_device_memory_usage: 0 MB
apply OpOverflowCheckHook on torch.zeros_like
torch.zeros_like forward_id: 2
/deeplink_afs/zhaoguochun/ditorch/op_tools/test/test_tool_with_special_op.py:115 test_overflow4: y = torch.zeros_like(x)
| name | device | dtype | numel | shape | stride | requires_grad | layout | data_ptr |
| torch.zeros_like inputs | npu:0 | float32 | 60 | (3, 4, 5) | (20, 5, 1) | True | torch.strided | 20067179823104 |
| torch.zeros_like outputs | npu:0 | float32 | 60 | (3, 4, 5) | (20, 5, 1) | False | torch.strided | 20067179823616 |
| name | inf_or_nan | min | max | mean | std | norm |
| torch.zeros_like input[0] | False | -2.577465772628784 | 1.8474770784378052 | -0.0570245198905468 | 1.0004568099975586 | 7.697339057922363 |
| torch.zeros_like output[0] | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
GarbageCollectEvaluate: host_memory_usage: 1736 MB, device_memory_usage: 0 MB, device_memory_reserved: 2 MB
apply OpOverflowCheckHook on torch.Tensor.div
torch.Tensor.div forward_id: 3
/deeplink_afs/zhaoguochun/ditorch/op_tools/test/test_tool_with_special_op.py:116 test_overflow4: z = x / y
| name | device | dtype | numel | shape | stride | requires_grad | layout | data_ptr |
| torch.Tensor.div inputs[0] | npu:0 | float32 | 60 | (3, 4, 5) | (20, 5, 1) | True | torch.strided | 20067179823104 |
| torch.Tensor.div inputs[1] | npu:0 | float32 | 60 | (3, 4, 5) | (20, 5, 1) | False | torch.strided | 20067179823616 |
| torch.Tensor.div outputs | npu:0 | float32 | 60 | (3, 4, 5) | (20, 5, 1) | True | torch.strided | 20067179824128 |
| name | inf_or_nan | min | max | mean | std | norm |
| torch.Tensor.div input[0] | False | -2.577465772628784 | 1.8474770784378052 | -0.0570245198905468 | 1.0004568099975586 | 7.697339057922363 |
| torch.Tensor.div input[1] | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| torch.Tensor.div output[0] | True | -inf | inf | nan | nan | inf |
GarbageCollectEvaluate: host_memory_usage: 1737 MB, device_memory_usage: 0 MB, device_memory_reserved: 2 MB
apply OpOverflowCheckHook on torch.ones_like
torch.ones_like forward_id: 4
/deeplink_afs/zhaoguochun/ditorch/op_tools/test/test_tool_with_special_op.py:117 test_overflow4: z.backward(torch.ones_like(z))
| name | device | dtype | numel | shape | stride | requires_grad | layout | data_ptr |
| torch.ones_like inputs | npu:0 | float32 | 60 | (3, 4, 5) | (20, 5, 1) | True | torch.strided | 20067179824128 |
| torch.ones_like outputs | npu:0 | float32 | 60 | (3, 4, 5) | (20, 5, 1) | False | torch.strided | 20067179824640 |
| name | inf_or_nan | min | max | mean | std | norm |
| torch.ones_like input[0] | True | -inf | inf | nan | nan | inf |
| torch.ones_like output[0] | False | 1.0 | 1.0 | 1.0 | 0.0 | 7.745966911315918 |
GarbageCollectEvaluate: host_memory_usage: 1737 MB, device_memory_usage: 0 MB, device_memory_reserved: 2 MB
torch.Tensor.div forward_id: 3
/deeplink_afs/zhaoguochun/ditorch/op_tools/test/test_tool_with_special_op.py:116 test_overflow4: z = x / y
| name | device | dtype | numel | shape | stride | requires_grad | layout | data_ptr | value |
| torch.Tensor.div grad_output | npu:0 | float32 | 60 | (3, 4, 5) | (20, 5, 1) | False | torch.strided | 20067179824640 | |
| torch.Tensor.div grad_inputs[0] | npu:0 | float32 | 60 | (3, 4, 5) | (20, 5, 1) | False | torch.strided | 20067179825152 | |
| torch.Tensor.div grad_inputs[1] | | | | | | | | | None |
| name | inf_or_nan | min | max | mean | std | norm |
| torch.Tensor.div grad_inputs[0] | True | inf | inf | inf | nan | inf |
| torch.Tensor.div grad_outputs[0] | False | 1.0 | 1.0 | 1.0 | 0.0 | 7.745966911315918 |
GarbageCollectEvaluate: host_memory_usage: 1738 MB, device_memory_usage: 0 MB, device_memory_reserved: 2 MB
skip OpOverflowCheckHook on torch.Tensor.backward
apply OpOverflowCheckHook on torch.full
torch.full forward_id: 5
/deeplink_afs/zhaoguochun/ditorch/op_tools/test/test_tool_with_special_op.py:118 test_overflow4: x = torch.full((3, 4, 5,), dtype=torch.float32, device="cuda", fill_value=3.402823466e38)
| name | value | device | dtype | numel | shape | stride | requires_grad | layout | data_ptr |
| torch.full inputs [0][0] | 3 | | | | | | | | |
| torch.full inputs [0][1] | 4 | | | | | | | | |
| torch.full inputs [0][2] | 5 | | | | | | | | |
| torch.full inputs dtype | torch.float32 | | | | | | | | |
| torch.full inputs device | npu | | | | | | | | |
| torch.full inputs fill_value | 3.402823466e+38 | | | | | | | | |
| torch.full outputs | | npu:0 | float32 | 60 | (3, 4, 5) | (20, 5, 1) | False | torch.strided | 20067179825664 |
| name | inf_or_nan | min | max | mean | std | norm |
| torch.full input[0] | False | | | | | |
| torch.full input[1] | False | | | | | |
| torch.full input[2] | False | | | | | |
| torch.full output[0] | False | 3.4028234663852886e+38 | 3.4028234663852886e+38 | inf | inf | inf |
GarbageCollectEvaluate: host_memory_usage: 1739 MB, device_memory_usage: 0 MB, device_memory_reserved: 2 MB
apply OpOverflowCheckHook on torch.Tensor.add
torch.Tensor.add forward_id: 6
/deeplink_afs/zhaoguochun/ditorch/op_tools/test/test_tool_with_special_op.py:119 test_overflow4: y = x + x
| name | device | dtype | numel | shape | stride | requires_grad | layout | data_ptr |
| torch.Tensor.add inputs[0] | npu:0 | float32 | 60 | (3, 4, 5) | (20, 5, 1) | False | torch.strided | 20067179825664 |
| torch.Tensor.add inputs[1] | npu:0 | float32 | 60 | (3, 4, 5) | (20, 5, 1) | False | torch.strided | 20067179825664 |
| torch.Tensor.add outputs | npu:0 | float32 | 60 | (3, 4, 5) | (20, 5, 1) | False | torch.strided | 20067179826176 |
| name | inf_or_nan | min | max | mean | std | norm |
| torch.Tensor.add input[0] | False | 3.4028234663852886e+38 | 3.4028234663852886e+38 | inf | inf | inf |
| torch.Tensor.add input[1] | False | 3.4028234663852886e+38 | 3.4028234663852886e+38 | inf | inf | inf |
| torch.Tensor.add output[0] | True | inf | inf | inf | nan | inf |
GarbageCollectEvaluate: host_memory_usage: 1739 MB, device_memory_usage: 0 MB, device_memory_reserved: 2 MB
apply OpOverflowCheckHook on torch.Tensor.mul
torch.Tensor.mul forward_id: 7
/deeplink_afs/zhaoguochun/ditorch/op_tools/test/test_tool_with_special_op.py:120 test_overflow4: z = x * x
| name | device | dtype | numel | shape | stride | requires_grad | layout | data_ptr |
| torch.Tensor.mul inputs[0] | npu:0 | float32 | 60 | (3, 4, 5) | (20, 5, 1) | False | torch.strided | 20067179825664 |
| torch.Tensor.mul inputs[1] | npu:0 | float32 | 60 | (3, 4, 5) | (20, 5, 1) | False | torch.strided | 20067179825664 |
| torch.Tensor.mul outputs | npu:0 | float32 | 60 | (3, 4, 5) | (20, 5, 1) | False | torch.strided | 20067179826688 |
| name | inf_or_nan | min | max | mean | std | norm |
| torch.Tensor.mul input[0] | False | 3.4028234663852886e+38 | 3.4028234663852886e+38 | inf | inf | inf |
| torch.Tensor.mul input[1] | False | 3.4028234663852886e+38 | 3.4028234663852886e+38 | inf | inf | inf |
| torch.Tensor.mul output[0] | True | inf | inf | inf | nan | inf |
GarbageCollectEvaluate: host_memory_usage: 1740 MB, device_memory_usage: 0 MB, device_memory_reserved: 2 MB
def apply_feature(ops, feature, condition_func=lambda *args, **kwargs: True):
op_tools.apply_feature接口可以作用在torch接口和其他第三方接口上,通过condition_func参数可以自定义生效条件,当condition_func返回True时,工具生效,否则不生效。condition_func的输入形参和算子输入形参相同。 feature参数为功能特性,目前支持以下类型:
- fallback: 算子fallback
- cast_dtype: 算子数据类型转换
- op_capture: 算子参数抓取
- autocompare: 算子精度对比 (做精度对比时,需要设备实现和cpu实现的调用接口一致)
- dump_op_args: 算子参数打印
- measure_op_time: 算子执行时间测量
import torch
import ditorch
import op_tools
import os
def custom_condition(a, b):
if a.dtype == torch.float16:
print("hook enable because a.dtype is float16")
return True
elif a.dim() == 2:
print("hook enable because a.dim() is 2")
return True
print("hook disable")
return False
x = torch.randn(2, 3, 4, dtype=torch.float16).cuda()
y = torch.randn(4, 2, dtype=torch.float).cuda()
z = torch.randn(2, 3, 4, dtype=torch.float).cuda()
op_tools.apply_feature("torch.add", feature="fallback", condition_func=custom_condition)
torch.add(x, x)
hook enable because a.dtype is float16
apply OpFallbackHook on torch.add
torch.add forward_id: 1
<stdin>:1 <module>:
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.add input(device)[0] | (2, 3, 4) | (12, 4, 1) | 24 | float16 | npu:0 | False | torch.strided | 20067179823104 |
| torch.add input(device)[1] | (2, 3, 4) | (12, 4, 1) | 24 | float16 | npu:0 | False | torch.strided | 20067179823104 |
| torch.add input(cpu)[0] | (2, 3, 4) | (12, 4, 1) | 24 | float16 | cpu | False | torch.strided | 541245376 |
| torch.add input(cpu)[1] | (2, 3, 4) | (12, 4, 1) | 24 | float16 | cpu | False | torch.strided | 541245504 |
| torch.add output(device) | (2, 3, 4) | (12, 4, 1) | 24 | float16 | npu:0 | False | torch.strided | 20067179824640 |
| torch.add output(cpu) | (2, 3, 4) | (12, 4, 1) | 24 | float16 | cpu | False | torch.strided | 541249984 |
Warning: Device do not support double dtype now, dtype cast repalce with float.
tensor([[[ 0.8032, 1.8779, -0.6846, 1.5342],
[-3.9688, 4.1055, 0.8447, 0.6836],
[ 1.1914, -2.4746, 4.8086, -0.3574]],
[[-2.1758, 2.1816, 1.1768, -1.0342],
[-0.0070, 1.8252, 1.7373, 3.2109],
[ 1.0361, -1.8564, 4.2070, 0.6558]]], device='npu:0',
op_tools.apply_feature("torch.sub", feature="autocompare", condition_func=custom_condition)
torch.sub(y, y)
torch.sub(z, z)
hook enable because a.dim() is 2
apply OpAutoCompareHook on torch.sub
torch.sub forward_id: 2
<stdin>:1 <module>:
| name | shape | stride | numel | dtype | device | requires_grad | layout | data_ptr |
| torch.sub inputs[0] | (4, 2) | (2, 1) | 8 | float32 | npu:0 | False | torch.strided | 20067179823616 |
| torch.sub inputs[1] | (4, 2) | (2, 1) | 8 | float32 | npu:0 | False | torch.strided | 20067179823616 |
| torch.sub outputs | (4, 2) | (2, 1) | 8 | float32 | npu:0 | False | torch.strided | 20067179825152 |
| torch.sub inputs(cpu)[0] | (4, 2) | (2, 1) | 8 | float32 | cpu | False | torch.strided | 555367360 |
| torch.sub inputs(cpu)[1] | (4, 2) | (2, 1) | 8 | float32 | cpu | False | torch.strided | 554988416 |
| torch.sub outputs(cpu) | (4, 2) | (2, 1) | 8 | float32 | cpu | False | torch.strided | 555613184 |
| name | allclose | max_abs_diff | max_relative_diff | atol | rtol | error_info |
| torch.sub input[0] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | |
| torch.sub input[1] | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | |
| torch.sub output | True | 0.000000000 | 0.000000000 | 0.000010000 | 0.000010000 | |
hook disable
skip OpAutoCompareHook on torch.sub
op_tools.apply_feature("torch.mul", feature="op_capture", condition_func=custom_condition)
torch.mul(x, x)
hook enable because a.dtype is float16
apply OpCaptureHook on torch.mul
op_capture_result/torch.mul/366650/3/input.pth saved
op_capture_result/torch.mul/366650/3/output.pth saved
op_tools.apply_feature("torch.div", feature="cast_dtype", condition_func=custom_condition)
os.environ["OP_DTYPE_CAST_DICT"] = "torch.float32->torch.float16"
torch.div(y, y)
hook enable because a.dim() is 2
| name | target | action | config |
| torch.div | input[0] | torch.float32 -> torch.float16 | torch.float32->torch.float16 |
| torch.div | input[1] | torch.float32 -> torch.float16 | torch.float32->torch.float16 |
| torch.div | output[0] | torch.float16 -> torch.float32 | torch.float32->torch.float16 |
工具 | 环境变量名 | 值 | 说明 | 备注 |
算子参数抓取工具 | OP_CAPTURE_DISABLE_LIST | torch.add,torch.nn.functional.linear,torch.Tensor.relu_ | 不抓取这些算子的参数 | 算子名全称,多个算子时以逗号隔开 |
算子参数抓取工具 | OP_CAPTURE_LIST | 同上 | 只抓取这些算子的参数 | 同上 |
精度分析工具 | OP_AUTOCOMPARE_LIST | 同上 | 只对指定的算子做精度对比 | 同上 |
精度分析工具 | OP_AUTOCOMPARE_DISABLE_LIST | 同上 | 精度对比时忽略指定的这些算子 | 同上 |
算子数据类型转换工具 | OP_DTYPE_CAST_DISABLE_LIST | 同上 | 做类型转换时忽略指定的这些算子 | 同上 |
算子数据类型转换工具 | OP_DTYPE_CAST_LIST | 同上 | 只对指定的算子做类型转换 | 同上 |
精度分析工具 | AUTOCOMPARE_ERROR_TOLERANCE | atol,rtol | allclose 参数 | 如设置,则使用给定的误差阈阈值覆盖默认值 |
精度分析工具 | AUTOCOMPARE_ERROR_TOLERANCE_FLOAT16 | atol,rtol | allclose 参数 | 如设置且数据类型满足,则使用给定的误差阈值 |
精度分析工具 | AUTOCOMPARE_ERROR_TOLERANCE_BFLOAT16 | atol,rtol | allclose 参数 | 同上 |
精度分析工具 | AUTOCOMPARE_ERROR_TOLERANCE_FLOAT32 | atol,rtol | allclose 参数 | 同上 |
精度分析工具 | AUTOCOMPARE_ERROR_TOLERANCE_FLOAT64 | atol,rtol | allclose 参数 | 同上 |
精度分析工具 | LINEAR_AUTOCOMPARE_ERROR_TOLERANCE_FLOAT16 | atol,rtol | allclose 参数 | 如设置且算子名和数据类型满足,则使用给定的误差阈值。算子名取算子全称最后一个'.'右边的部分,如torch.add,则算子名为ADD_,torch.nn.functional.linear的算子名为LINEAR_ |
算子数据类型转换工具 | OP_DTYPE_CAST_DICT | torch.float16->torch.float32,torch.bfloat16->torch.float32 | 给定要转换的数据类型和目标数据类型 | 有多组时以逗号隔开 |