core dump in running tensorflow benchmark #18

myotheone · 2019-06-28T06:07:23Z

I have successfully installed byteps using "python setup.py install".
when i run benchmark, byteps core dumped.

core backtrack:

env:
1.tf version 1.14
2.cuda version: 9.0
3.nccl version: 2.4.7 for cuda9.0
4.os: ubuntu 16.04
5.g++: 5.4.0

script: copy from step_by_step_tutorial.md
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64/:$LD_LIBRARY_PATH
export NVIDIA_VISIBLE_DEVICES=0,1,2,3
export DMLC_WORKER_ID=0
export DMLC_NUM_WORKER=1
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=x.x.x.x
export DMLC_PS_ROOT_PORT=9999
export EVAL_TYPE=benchmark

python /home/mark/mark/code/byteps/launcher/launch.py \
/home/mark/mark/code/byteps/example/tensorflow/run_tensorflow_byteps.sh \
--model ResNet50 --num-iters 1000

ymjiang · 2019-06-28T06:20:32Z

Looks like your gcc version is higher than our tested gcc-4.9.

Can you try to pin down gcc before you build BytePS? Here is an example: https://github.com/bytedance/byteps/blob/master/docker/Dockerfile.worker.tensorflow#L115-L123

bobzhuyb · 2019-06-28T08:01:27Z

I think it's a very similar issue as this tensorflow/tensorflow#13308 (comment)
It happens when BytePS is compiled with gcc 5 while TF is compiled with gcc 4.

It's still an open issue..

If your TF is compiled with gcc 4 (and it seems so), you have to use gcc-4.9 to build BytePS. You can try the suggestion given by @ymjiang , or use our pre-built docker image.

ymjiang · 2019-06-28T10:40:56Z

@un-knight Can you share more information about your OS and env?

bobzhuyb · 2019-06-28T10:41:48Z

@un-knight Would you reply to the issue thread you already opened #20? From your log, I don't see how your question is related to this issue.

byronyi · 2019-06-29T02:24:46Z

Just provide the binary release. No need to build one on users' environment, we do not need mpicc or mpicxx.

bobzhuyb · 2019-06-30T04:42:23Z

Our plan is to release binary installation package built for different frameworks/CUDA versions, in order to ease user installation process.

ymjiang · 2019-07-02T10:08:30Z

@myotheone We just uploaded some pypi lists for easier installation. See https://github.com/bytedance/byteps/blob/master/docs/pip-list.md

bobzhuyb · 2019-07-03T07:52:54Z

Closing this since we started providing pypi packages. Feel free to reopen.

* compression: update cifar100 training script (#15) * cifar: update cifar script * cifar: update lr * cifar: add warmup * cifar: update parse * cifar: update * cifar: add log * cifar: fix typo * cifar: fix bug * cifar: fix lr * cifar: fix typo * cifar: update num samples * 1bit: update packing * 1bit: fix compile bug * 1bit: exp * 1bit: exp * 1bit: exp * 1bit: exp * 1bit: exp * 1bit: exp * 1bit: exp * 1bit: exp * 1bit: exp * 1bit: exp * 1bit: fix typo * 1bit: fix compile bug * 1bit: exp * 1bit: test * 1bit: exp * 1bit: test * 1bit: exp * 1bit: exp * 1bit: exp * 1bit: fix typo * 1bit: fix typo * 1bit: fix typo * 1bit: try5 final * 1bit: exp rm decompress in ef * 1bit: fix typo * 1bit: fix typo * 1bit: fix bug * 1bit: fix typo * 1bit: add log * 1bit: debug * 1bit: debug * 1bit: debug * 1bit: debug * 1bit: debug * 1bit: debug * 1bit: debug * 1bit: debug * 1bit: fix * 1bit: debug * 1bit: fix * 1bit: debug * 1bit: fix typo * 1bit: fix typo * 1bit: debug * 1bit: debug * 1bit: debug * 1bit: debug * 1bit: debug * 1bit: debug * 1bit: fix bug * 1bit: fix bug * 1bit: fix typo * 1bit: add test * 1bit: update test * 1bit: update test * 1bit: update test script * 1bit: fix test bug * 1bit: fix test script * 1bit: update script * 1bit: update test * refactor: update name and api * refactor: fix indent * refactor: add fastupdateerror * refactor: fix link error * topk: impl fastupdateerror * topk: debug * randomk: fix fatal bug Co-authored-by: Ubuntu <[email protected]>

This comment has been minimized.

Sign in to view

bobzhuyb added the enhancement New feature or request label Jun 30, 2019

bobzhuyb closed this as completed Jul 3, 2019

ghost mentioned this issue Feb 25, 2020

Coredump at ps::Meta::~Meta called in ReleaseFirstMsg #211

Closed

DeruiLiu mentioned this issue Aug 6, 2020

some question about to start server. Check failed: mr ibv_reg_mr failed: Cannot allocate memory #282

Closed

pleasantrabbit pushed a commit that referenced this issue Nov 3, 2020

rdma: only reg_mr for vals (#18)

7a708d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core dump in running tensorflow benchmark #18

core dump in running tensorflow benchmark #18

myotheone commented Jun 28, 2019

ymjiang commented Jun 28, 2019 •

edited

Loading

bobzhuyb commented Jun 28, 2019 •

edited

Loading

This comment has been minimized.

ymjiang commented Jun 28, 2019

bobzhuyb commented Jun 28, 2019 •

edited

Loading

byronyi commented Jun 29, 2019

bobzhuyb commented Jun 30, 2019

ymjiang commented Jul 2, 2019

bobzhuyb commented Jul 3, 2019

core dump in running tensorflow benchmark #18

core dump in running tensorflow benchmark #18

Comments

myotheone commented Jun 28, 2019

ymjiang commented Jun 28, 2019 • edited Loading

bobzhuyb commented Jun 28, 2019 • edited Loading

This comment has been minimized.

ymjiang commented Jun 28, 2019

bobzhuyb commented Jun 28, 2019 • edited Loading

byronyi commented Jun 29, 2019

bobzhuyb commented Jun 30, 2019

ymjiang commented Jul 2, 2019

bobzhuyb commented Jul 3, 2019

ymjiang commented Jun 28, 2019 •

edited

Loading

bobzhuyb commented Jun 28, 2019 •

edited

Loading

bobzhuyb commented Jun 28, 2019 •

edited

Loading