Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why a kernel like CUDAFunctor_add appears when testing MixedFusedRMSNorm? #1753

Open
HangJie720 opened this issue Nov 26, 2023 · 0 comments
Open
Labels
bug Something isn't working

Comments

@HangJie720
Copy link

HangJie720 commented Nov 26, 2023

When I execute the following python code, I don’t know why a kernel like CUDAFunctor_add appears.

`import torch
from apex.normalization import MixedFusedRMSNorm

datatype = torch.bfloat16
input = torch.randn([1024, 8192], dtype=datatype).cuda()
input1 = torch.rand_like(input)
input1.requires_grad=True

input1_nofuse = input1.detach().requires_grad_(True)

grad = torch.rand_like(input)

for i in range(10):
norm = MixedFusedRMSNorm(input.size()[1:]).to(device='cuda', dtype=datatype)
input3 = input1_nofuse
output_ = norm(input3)
output_ = output_.to('cuda:0')
output_.backward(gradient=grad)
`
run with nvprof tool to show kernel:

==14326== NVPROF is profiling process 14326, command: python3 test_rmsnorm.py ==14326== Profiling application: python3 test_rmsnorm.py ==14326== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 76.46% 12.271ms 11 1.1155ms 3.3270us 12.237ms [CUDA memcpy HtoD] 9.53% 1.5299ms 10 152.99us 151.65us 154.56us void cuComputeGradInput<c10::BFloat16, float, c10::BFloat16>(c10::BFloat16 const *, c10::BFloat16 const *, int, int, float const *, float const *, float, c10::BFloat16 const *, c10::BFloat16*, bool) 6.61% 1.0608ms 10 106.08us 104.38us 108.22us void cuApplyRMSNorm<c10::BFloat16, float, c10::BFloat16>(c10::BFloat16*, float*, c10::BFloat16 const *, int, int, float, c10::BFloat16 const *) 3.40% 545.94us 10 54.594us 53.439us 56.767us void cuComputePartGradGammaBeta<c10::BFloat16, float, c10::BFloat16>(c10::BFloat16 const *, c10::BFloat16 const *, int, int, float const *, float const *, float, float*, float*, bool) 3.03% 486.97us 9 54.107us 53.248us 55.103us void at::native::vectorized_elementwise_kernel<int=4, at::native::CUDAFunctor_add<c10::BFloat16>, at::detail::Array<char*, int=3>>(int, c10::BFloat16, at::native::CUDAFunctor_add<c10::BFloat16>) 0.59% 95.392us 2 47.696us 47.040us 48.352us _ZN2at6native55_GLOBAL__N__722798bb_22_DistributionUniform_cu_f2fea07d43distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda21uniform_and_transformIN3c108BFloat16EfLm4EPNS_17CUDAGeneratorImplEZZZNS4_14uniform_kernelIS9_EEvRNS_18TensorIteratorBaseEddT_ENKUlvE_clEvENKUlvE2_clEvEUlfE_EEvSC_T2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SL_SG_EEvSC_SH_RKSI_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SH_ 0.37% 59.137us 10 5.9130us 5.6000us 6.5600us void cuComputeGradGammaBeta<float, c10::BFloat16>(float const *, float const *, int, int, int, c10::BFloat16*, c10::BFloat16*, bool) API calls: 97.73% 1.24793s 1 1.24793s 1.24793s 1.24793s cudaDeviceGetStreamPriorityRange
There is a CUDAFunctor_add kernel,I want to ask why?

@HangJie720 HangJie720 added the bug Something isn't working label Nov 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant