Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

Error in sst2 mission with BERT #691

Closed
chaojieji opened this issue May 3, 2019 · 6 comments
Closed

Error in sst2 mission with BERT #691

chaojieji opened this issue May 3, 2019 · 6 comments

Comments

@chaojieji
Copy link

chaojieji commented May 3, 2019

Hi,
Thanks for your sharing.
When I run below code in terminal, I got a fault code and interrupted.
Code is "python finetune_classifier.py --task_name SST --epochs 4 --batch_size 16 --accumulate 1 --optimizer bertadam --lr 2e-5 --log_interval 500".
Error code is "Segmentation fault: 11
Stack trace:
[bt] (0) /home/bit0427/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2981500) [0x7f16104ba500]
[bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f16478cff20]
[bt] (2) /home/bit0427/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x285a260) [0x7f1610393260]
[bt] (3) /home/bit0427/anaconda3/bin/../lib/libgomp.so.1(+0x11bef) [0x7f1642d50bef]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f1647c896db]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f16479b288f]
"
My python is 3.6.8, mxnet is 1.5 and gluno is the most recent version.
Tanks for your support.

@szha
Copy link
Member

szha commented May 3, 2019

@GeorgieJi thank you for reporting the issue. Since it's an error in the mxnet library, I think this issue can be best handled there. Would you mind follow this issue template and report the error there? I will help you there.

@eric-haibin-lin
Copy link
Member

eric-haibin-lin commented May 3, 2019

We found similar regression in apache/mxnet#14872 where an invalid index access was introduced recently to mxnet. Would you mind installing an MXNet version in early April? For example:
pip uninstall mxnet-c90 -y; pip install mxnet-cu90==1.5.0b20190412 --user

BTW the issue will be fixed in apache/mxnet#14873

@chaojieji
Copy link
Author

@eric-haibin-lin Thanks for your suggestion. I degraded MXNet version to 1.5.0b20190412. Now it seems to be working well. But I have another question whether training process is limited on a single cpu, since I want all cpus to execute task parallel.
The log information terminal printed is "Now we are doing BERT classification training on cpu(0)!".
@szha Tanks for your great supports. For a quick solution and no limitation on version of MXNet, I prefered to degrade MXNet, and it works now. :)
Thanks for all your considerations.

@szha
Copy link
Member

szha commented May 4, 2019

@GeorgieJi sounds good. We will note the anecdote here to make sure the bug can be fixed and verified in mxnet.

@szha szha closed this as completed May 4, 2019
@eric-haibin-lin
Copy link
Member

Cpu(0) uses all available cores on the cpu :)

@chaojieji
Copy link
Author

@eric-haibin-lin Em, that's interesting. I use application "htop" to monitor the conditions of CPUs. in Ubuntu 18.04. When gluon with BERT is running, only partial cores are executing.
image
ps: When I run google-research/bert, all of them is burning.
I have no idea about whether this is normal.
Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants