Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[CI] illegal memory access #15925

Open
ChaiBapchya opened this issue Aug 16, 2019 · 13 comments
Open

[CI] illegal memory access #15925

ChaiBapchya opened this issue Aug 16, 2019 · 13 comments
Labels

Comments

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Aug 16, 2019

Multiple gpu tests fail due to illegal memory access

Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered

PR - #15736
Pipeline - http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15736/9/pipeline

Excerpt:

test_operator_gpu.test_np_flatten ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1294594193 to reproduce.
ERROR
test_operator_gpu.test_np_linspace ... [22:52:29] src/operator/tensor/./.././../common/../operator/mxnet_op.h:845: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered
/work/runtime_functions.sh: line 880:     6 Aborted                 (core dumped) nosetests-2.7 $NOSE_COVERAGE_ARGUMENTS $NOSE_TIMER_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug, CI

@ChaiBapchya
Copy link
Contributor Author

@mxnet-label-bot add [CI, Bug]

@ChaiBapchya
Copy link
Contributor Author

@aaronmarkham
Copy link
Contributor

@wkcn
Copy link
Member

wkcn commented Dec 7, 2019

‘’’

reproduce.
Setting test np/mx/python random seeds, use MXNET_TEST_SEED=976443772 to reproduce.
ERROR
test_operator_gpu.test_np_linalg_slogdet ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1853898693 to reproduce.
Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1853898693 to reproduce.
ERROR
test_operator_gpu.test_np_linalg_svd ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1913742322 to reproduce.
Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1913742322 to reproduce.
ERROR
test_operator_gpu.test_np_linspace ... [22:08:30] src/operator/tensor/./.././../common/../operator/mxnet_op.h:1113: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered
/work/runtime_functions.sh: line 1114: 146 Aborted (core dumped) nosetests-3.4 $NOSE_COVERAGE_ARGUMENTS
‘’’

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/master/1359/pipeline

@leezu
Copy link
Contributor

leezu commented Feb 27, 2020

Could the CI issue be related to #17713 ? That can be reproduced deterministically on G4 instance

@ChaiBapchya
Copy link
Contributor Author

G4 instance with cuda10.0 that is?

@leezu
Copy link
Contributor

leezu commented Feb 28, 2020

Yes

@leezu
Copy link
Contributor

leezu commented Apr 30, 2020

@ChaiBapchya does this issue still occur on dev environment with updated AMI (in particular with updated drivers)

Given the issue in #17713 was due to a bug in cuda, it appears possible that this issue may be due to a bug in the driver..

CC @zhreshold

@ChaiBapchya
Copy link
Contributor Author

Hasn't occurred so far [15 tests on commits merged into master for unix-gpu pipeline]
Will keep monitoring & get back.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

7 participants