Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test_conv2d] Memory access fault #142

Closed
shurale-nkn opened this issue Apr 8, 2020 · 18 comments
Closed

[test_conv2d] Memory access fault #142

shurale-nkn opened this issue Apr 8, 2020 · 18 comments

Comments

@shurale-nkn
Copy link
Contributor

shurale-nkn commented Apr 8, 2020

Error in test_conv2d from MIOpenConvBwdWrWS2.cl

System:
Vega 20
HIP
ROCm 3.0.6
checked commits: ce986b2 , 3a716fe, fcd5563, 481d6b9, cea6064, a02171c

For reproduce:
Reproduceable in combination of BWD + WRW only. The error is present in all combinations, but only in this one it can be seen as "Memory access fault".

./bin/test_conv2d --float --cmode conv  --group-count 1 --input 1, 1, 8, 8 --weights 1, 1, 2, 2 --pads_strides_dilations 1 1 1 1 1 1 --pmode same --disable-forward
MIOpen(HIP): Info [Handle] stream: 0x2352180, device_id: 0
MIOpen(HIP): Info [BackwardDataGetWorkSpaceSize] 
MIOpen(HIP): Info [HipGetHccVersionImpl] HCC base: 3.0.19493
MIOpen(HIP): Info [AmdRocmMetadataVersionDetect] ROCm MD version AMDHSA_COv2_COv3, MIOpen version 2.4.0.7759-cea60643c
MIOpen(HIP): Info [ForwardGetWorkSpaceSize] 
MIOpen(HIP): Info [BackwardWeightsGetWorkSpaceSize] 
MIOpen(HIP): Info [FindSolutionImpl] ConvBinWinogradRxSf2x3 (db access disabled)
MIOpen(HIP): Info [GetPerformanceConfig] 1
MIOpen(HIP): Info [BackwardDataGetWorkSpaceSize] 
MIOpen(HIP): Info [FindConvBwdDataAlgorithm] requestAlgoCount = 1, workspace = 1024
MIOpen(HIP): Info [TryLoad] Find-db regenerating.
MIOpen(HIP): Info [FindSolutionImpl] ConvBinWinogradRxSf2x3
MIOpen(HIP): Info [FindSolutionImpl] Perf Db: record not found for: ConvBinWinogradRxSf2x3
MIOpen(HIP): Info [GetPerformanceConfig] 1
MIOpen(HIP): Info [EvaluateInvokers] ConvBinWinogradRxSf2x3: miopenSp3AsmConv_group_20_5_23_M_stride1: 0.02224 < 3.40282e+38
MIOpen(HIP): Info [EvaluateInvokers] Selected: ConvBinWinogradRxSf2x3: miopenSp3AsmConv_group_20_5_23_M_stride1: 0.02224, workspce_sz = 0
MIOpen(HIP): Info [SetValues] 1-8-8-2x2-1-8-8-1-0x0-1x1-1x1-0-NCHW-FP32-B, content inserted: miopenConvolutionBwdDataAlgoWinograd:ConvBinWinogradRxSf2x3,0.02224,0,miopenConvolutionBwdDataAlgoWinograd,<unused>
MIOpen(HIP): Info [SetValues] 1-8-8-2x2-1-8-8-1-0x0-1x1-1x1-0-NCHW-FP32-B, content inserted: miopenConvolutionBwdDataAlgoGEMM:gemm,1851.86,1024,rocBlas,<unused>
MIOpen(HIP): Info [FindConvBwdDataAlgorithm] miopenConvolutionBwdDataAlgoWinograd       0.02224 0
MIOpen(HIP): Info [FindConvBwdDataAlgorithm] miopenConvolutionBwdDataAlgoGEMM   1851.86 1024
MIOpen(HIP): Info [FindConvBwdDataAlgorithm] BWD Chosen Algorithm: ConvBinWinogradRxSf2x3 , 0, 0.02224
MIOpen(HIP): Info [GetBackwardSolutions] 
MIOpen(HIP): Info [Measure] Db::Prefetch time: 47.3772 ms
MIOpen(HIP): Info [ConvolutionBackwardData] algo = 3, workspace = 1024
MIOpen(HIP): Info [BackwardWeightsGetWorkSpaceSize] 
MIOpen(HIP): Info [FindSolutionImpl] ConvBinWinogradRxSf2x3 (db access disabled)
MIOpen(HIP): Info [GetPerformanceConfig] 1
MIOpen(HIP): Info [FindConvBwdWeightsAlgorithm] requestAlgoCount = 1, workspace = 1024
MIOpen(HIP): Info [TryLoad] Find-db regenerating.
MIOpen(HIP): Info [SetValues] 1-8-8-2x2-1-8-8-1-0x0-1x1-1x1-0-NCHW-FP32-W, content inserted: miopenConvolutionBwdWeightsAlgoGEMM:gemm,0.522397,1024,rocBlas,<unused>
MIOpen(HIP): Info [FindSolutionImpl] ConvOclBwdWrW2<1>
MIOpen(HIP): Info [FindSolutionImpl] Perf Db: record not found for: ConvOclBwdWrW2<1>
Memory access fault by GPU node-2 (Agent handle: 0x1487eb0) on address 0x7fabb8400000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)
@daniellowell
Copy link
Contributor

@shurale-nkn Please delete your local user performance database and retest.

@atamazov
Copy link
Contributor

The problem related to a02171c is probably fixed. However I think that the initial issue still persists. I mean this one:

...
MIOpen(HIP): Info [FindSolutionImpl] Perf Db: record not found for: ConvOclBwdWrW2<1>
Memory access fault by GPU node-2 (Agent handle: 0x1487eb0) on address 0x7fabb8400000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

When there is no perfdb record (or the record is invalid), the Solver should provide valid default perf-config

@TejashShah

This comment has been minimized.

@atamazov
Copy link
Contributor

This shouldn't be released.

Also we see memory access faults on Jenkins from time to time, hardly reproducible. This corresponds the random nature of this issue, so it is the "suspect zero".

@atamazov
Copy link
Contributor

The reason of the issue is that OpenCL kernel writes to wrong memory locations. @TejashShah do you have time for this? If not, then we'll introduce a workaround that would disable this solver.

@aserio
Copy link
Contributor

aserio commented May 8, 2020

@TejashShah, @shurale-nkn, and @atamazov, is this issue still at large? Do we have a plan to address it?

@atamazov
Copy link
Contributor

atamazov commented May 8, 2020

Further TODOs:

  • Disable asymmetric padding for all OCL kernels
  • Disable asymmetric padding for asm kernels that do not support it

Originally posted by @atamazov in #179 (comment)

@Kirpich30000

Single and multi pass Winograd works with any padding. Other kernels were designed for symmetrical padding.

@aserio
Copy link
Contributor

aserio commented May 15, 2020

@atamazov do we have a plan to implement these to dos? (or did you mean to add the "priority_medium" ticket :D )

@atamazov
Copy link
Contributor

The plan is to resolve this ticket, see my previous comment.

@atamazov
Copy link
Contributor

@aserio This priority_high is correct here, because of possibility of undefined behavior (UB).

@aserio
Copy link
Contributor

aserio commented May 29, 2020

@atamazov and @TejashShah, have you been able to disable asymmetric padding for all OCL kernels and disable asymmetric padding for asm kernels that do not support it? Who is responsible for creating this pull request?

@atamazov
Copy link
Contributor

No (no time). Me.

@aserio
Copy link
Contributor

aserio commented Jun 30, 2020

@TejashShah will look into this issue.
This is still a high-priority issue as it can lead to UB

@ROCm ROCm deleted a comment from aserio Jun 30, 2020
@ROCm ROCm deleted a comment from aserio Jun 30, 2020
@ROCm ROCm deleted a comment from aserio Jun 30, 2020
@TejashShah
Copy link
Contributor

@atamazov I spent some time to understand this issue.

You are right to point out that we should disable asymmetric padding across kernels. Even if, supposedly, one writes kernel that supports asymmetric padding, how does one (frameworks) describe the problem with asymmetric padding? Our current problem description seems to assume symmetric padding with the following variables defined in ProblemDescription

int pad_h = 0;
int pad_w = 0;
int pad_d = 0;

There doesn't exist any pad_h_top or pad_h_bottom or pad_w_left or pad_w_right to convey asymmetric padding. So, I believe, in general, we pretty much, disable asymmetric padding across the board. I attempted the following configs which I thought to be asymmetrical cases. None of them attempted solver except gemm. So, at present, there is some sort of guard preventing running solvers in asymmetrical cases.


MIOpenDriver conv -V 1 -F 1 -n 1 -c 1 -H 8 -W 8 -k 1 -y 2 -x 2 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -g 1 -t 1 -i 1
MIOpen Error: /PublicMIOpen/MIOpen/src/ocl/convolutionocl.cpp:957: Forward Convolution cannot be executed due to incorrect params
MIOpenDriver conv -V 1 -F 2 -n 1 -c 1 -H 8 -W 8 -k 1 -y 2 -x 2 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -g 1 -t 1 -i 1
MIOpen Error: /PublicMIOpen/MIOpen/src/ocl/convolutionocl.cpp:2651: Backward Data Convolution cannot be executed due to incorrect params
MIOpenDriver conv -V 1 -F 4 -n 1 -c 1 -H 8 -W 8 -k 1 -y 2 -x 2 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -g 1 -t 1 -i 1
MIOpen Error: /PublicMIOpen/MIOpen/src/ocl/convolutionocl.cpp:4029: Backward Weights Convolution cannot be executed due to incorrect params
MIOpenDriver conv -V 1 -F 1 -n 1 -c 1 -H 8 -W 8 -k 1 -y 2 -x 2 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -g 1 -t 1 -i 1
MIOpen Error: /PublicMIOpen/MIOpen/src/ocl/convolutionocl.cpp:957: Forward Convolution cannot be executed due to incorrect params
MIOpenDriver conv -V 1 -F 2 -n 1 -c 1 -H 8 -W 8 -k 1 -y 2 -x 2 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -g 1 -t 1 -i 1
MIOpen Error: /PublicMIOpen/MIOpen/src/ocl/convolutionocl.cpp:2651: Backward Data Convolution cannot be executed due to incorrect params
MIOpenDriver conv -V 1 -F 4 -n 1 -c 1 -H 8 -W 8 -k 1 -y 2 -x 2 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -g 1 -t 1 -i 1
MIOpen Error: /PublicMIOpen/MIOpen/src/ocl/convolutionocl.cpp:4029: Backward Weights Convolution cannot be executed due to incorrect params
MIOpenDriver conv -V 1 -F 1 -n 64 -c 1024 -H 28 -W 28 -k 1024 -y 2 -x 2 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -g 1 -t 1 -i 1
MIOpen Error: /PublicMIOpen/MIOpen/src/ocl/convolutionocl.cpp:957: Forward Convolution cannot be executed due to incorrect params
MIOpenDriver conv -V 1 -F 2 -n 64 -c 1024 -H 28 -W 28 -k 1024 -y 2 -x 2 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -g 1 -t 1 -i 1
MIOpen Error: /PublicMIOpen/MIOpen/src/ocl/convolutionocl.cpp:2651: Backward Data Convolution cannot be executed due to incorrect params
MIOpenDriver conv -V 1 -F 4 -n 64 -c 1024 -H 28 -W 28 -k 1024 -y 2 -x 2 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -g 1 -t 1 -i 1
MIOpen Error: /PublicMIOpen/MIOpen/src/ocl/convolutionocl.cpp:4029: Backward Weights Convolution cannot be executed due to incorrect params
MIOpenDriver conv -V 1 -F 1 -n 64 -c 1024 -H 28 -W 28 -k 1024 -y 2 -x 2 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -g 1 -t 1 -i 1
MIOpen Error: /PublicMIOpen/MIOpen/src/ocl/convolutionocl.cpp:957: Forward Convolution cannot be executed due to incorrect params
MIOpenDriver conv -V 1 -F 2 -n 64 -c 1024 -H 28 -W 28 -k 1024 -y 2 -x 2 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -g 1 -t 1 -i 1
MIOpen Error: /PublicMIOpen/MIOpen/src/ocl/convolutionocl.cpp:2651: Backward Data Convolution cannot be executed due to incorrect params
MIOpenDriver conv -V 1 -F 4 -n 64 -c 1024 -H 28 -W 28 -k 1024 -y 2 -x 2 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -g 1 -t 1 -i 1
MIOpen Error: /PublicMIOpen/MIOpen/src/ocl/convolutionocl.cpp:4029: Backward Weights Convolution cannot be executed due to incorrect params
MIOpenDriver conv -V 1 -F 1 -n 64 -c 1024 -H 14 -W 14 -k 1024 -y 2 -x 2 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -g 1 -t 1 -i 1
MIOpen Error: /PublicMIOpen/MIOpen/src/ocl/convolutionocl.cpp:957: Forward Convolution cannot be executed due to incorrect params
MIOpenDriver conv -V 1 -F 2 -n 64 -c 1024 -H 14 -W 14 -k 1024 -y 2 -x 2 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -g 1 -t 1 -i 1
MIOpen Error: /PublicMIOpen/MIOpen/src/ocl/convolutionocl.cpp:2651: Backward Data Convolution cannot be executed due to incorrect params
MIOpenDriver conv -V 1 -F 4 -n 64 -c 1024 -H 14 -W 14 -k 1024 -y 2 -x 2 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -g 1 -t 1 -i 1
MIOpen Error: /PublicMIOpen/MIOpen/src/ocl/convolutionocl.cpp:4029: Backward Weights Convolution cannot be executed due to incorrect params
MIOpenDriver conv -V 1 -F 1 -n 64 -c 1024 -H 14 -W 14 -k 1024 -y 2 -x 2 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -g 1 -t 1 -i 1
MIOpen Error: /PublicMIOpen/MIOpen/src/ocl/convolutionocl.cpp:957: Forward Convolution cannot be executed due to incorrect params
MIOpenDriver conv -V 1 -F 2 -n 64 -c 1024 -H 14 -W 14 -k 1024 -y 2 -x 2 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -g 1 -t 1 -i 1
MIOpen Error: /PublicMIOpen/MIOpen/src/ocl/convolutionocl.cpp:2651: Backward Data Convolution cannot be executed due to incorrect params
MIOpenDriver conv -V 1 -F 4 -n 64 -c 1024 -H 14 -W 14 -k 1024 -y 2 -x 2 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -g 1 -t 1 -i 1
MIOpen Error: /PublicMIOpen/MIOpen/src/ocl/convolutionocl.cpp:4029: Backward Weights Convolution cannot be executed due to incorrect params

Do you have specific configs that you find problematic in MIOpen due to missing asymmetric padding check?

@TejashShah
Copy link
Contributor

@atamazov Essentially, I am trying to find failed (segfault/verification fail) asymmetrical configs which are currently exercised by MIOpen which would later be un-exercised once I have changed applicability to explicitly disable asymmetrical cases.

@atamazov
Copy link
Contributor

@TejashShah

@atamazov What about igemm solvers?

@asroy This is question for you. Do these support paddimg mode = same && even filters? Such configs hav asymmetric padding.

@daniellowell
Copy link
Contributor

Let's open a ticket for asymmetrical padding support.

@atamazov
Copy link
Contributor

#341 merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants