Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run demo.sh #24

Open
abhigyan2001 opened this issue Nov 30, 2019 · 12 comments
Open

Unable to run demo.sh #24

abhigyan2001 opened this issue Nov 30, 2019 · 12 comments

Comments

@abhigyan2001
Copy link

abhigyan2001 commented Nov 30, 2019

I've installed all the prerequisites, and the program should run, but it's not... I'm getting an error at line 70 of the colmap_wrapper.py file, which is trying to call the subprocess and there's some problem...

(Here's my setup: Nvidia mx250, Ubuntu 19.10, tensorflow 1.13)
(Is it a known issue with Ubuntu 19.10 or hasn't been tested yet?)

Here's my entire traceback:

output.txt
errors.txt:

PC: @     0x7f92c7012ae8 ceres::internal::ProgramEvaluator<>::Evaluate()
*** SIGSEGV (@0x0) received by PID 22202 (TID 0x7f92acda5700) from PID 0; stack trace: ***
    @     0x7f92c7452641 (unknown)
    @     0x7f92c67dd540 (unknown)
    @     0x7f92c7012ae8 ceres::internal::ProgramEvaluator<>::Evaluate()
    @     0x7f92c709265f ceres::internal::TrustRegionMinimizer::EvaluateGradientAndJacobian()
    @     0x7f92c7092f4a ceres::internal::TrustRegionMinimizer::IterationZero()
    @     0x7f92c70972d4 ceres::internal::TrustRegionMinimizer::Minimize()
    @     0x7f92c7088cbc ceres::Solver::Solve()
    @     0x7f92c70899b9 ceres::Solve()
    @     0x55bb3bb1a2eb colmap::BundleAdjuster::Solve()
    @     0x55bb3bb78037 colmap::IncrementalMapper::AdjustGlobalBundle()
    @     0x55bb3bac3f0c (unknown)
    @     0x55bb3bac521d colmap::IncrementalMapperController::Reconstruct()
    @     0x55bb3bac6a9b colmap::IncrementalMapperController::Run()
    @     0x55bb3bbd7dfc colmap::Thread::RunFunc()
    @     0x7f92c5ad9f74 (unknown)
    @     0x7f92c67d1669 start_thread
    @     0x7f92c578f323 clone
Traceback (most recent call last):
  File "imgs2poses.py", line 11, in <module>
    gen_poses(args.scenedir)
  File "/home/abhigyan/Code/CVProject/LLFF/llff/poses/pose_utils.py", line 265, in gen_poses
    run_colmap(basedir)
  File "/home/abhigyan/Code/CVProject/LLFF/llff/poses/colmap_wrapper.py", line 70, in run_colmap
    map_output = ( subprocess.check_output(mapper_args, universal_newlines=True) )
  File "/home/abhigyan/anaconda3/lib/python3.7/subprocess.py", line 395, in check_output
    **kwargs).stdout
  File "/home/abhigyan/anaconda3/lib/python3.7/subprocess.py", line 487, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['colmap', 'mapper', '--database_path', 'data/testscene/database.db', '--image_path', 'data/testscene/images', '--output_path', 'data/testscene/sparse', '--Mapper.num_threads', '16', '--Mapper.init_min_tri_angle', '4', '--Mapper.multiple_models', '0', '--Mapper.extract_colors', '0']' died with <Signals.SIGSEGV: 11>.
Traceback (most recent call last):
  File "/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/home/abhigyan/anaconda3/lib/python3.7/imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "/home/abhigyan/anaconda3/lib/python3.7/imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: libcublas.so.10.0: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "imgs2mpis.py", line 10, in <module>
    from llff.inference.mpi_utils import run_inference
  File "/home/abhigyan/Code/CVProject/LLFF/llff/inference/mpi_utils.py", line 6, in <module>
    from llff.inference.mpi_tester import DeepIBR
  File "/home/abhigyan/Code/CVProject/LLFF/llff/inference/mpi_tester.py", line 1, in <module>
    import tensorflow as tf
  File "/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/home/abhigyan/anaconda3/lib/python3.7/imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "/home/abhigyan/anaconda3/lib/python3.7/imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: libcublas.so.10.0: cannot open shared object file: No such file or directory


Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.
mkdir: cannot create directory ‘data/testscene/outputs/’: File exists
Traceback (most recent call last):
  File "imgs2renderpath.py", line 34, in <module>
    poses, bds = load_data(args.scenedir, load_imgs=False)
  File "/home/abhigyan/Code/CVProject/LLFF/llff/poses/pose_utils.py", line 195, in load_data
    poses_arr = np.load(os.path.join(basedir, 'poses_bounds.npy'))
  File "/home/abhigyan/anaconda3/lib/python3.7/site-packages/numpy/lib/npyio.py", line 428, in load
    fid = open(os_fspath(file), "rb")
FileNotFoundError: [Errno 2] No such file or directory: 'data/testscene/poses_bounds.npy'
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
demo.sh: line 24: 22221 Aborted                 (core dumped) cuda_renderer/cuda_renderer data/testscene/mpis_360 data/testscene/outputs/test_path.txt data/testscene/outputs/test_vid.mp4 360 .8 18

Please tell me what I need to do to get it to run

@bmild
Copy link
Collaborator

bmild commented Dec 2, 2019

Hmm it looks like there are a couple potential errors...

Tensorflow failed to load based on the error ImportError: libcublas.so.10.0: cannot open shared object file: No such file or directory -- does Tensorflow work if you run import tensorflow as tf in the python prompt?

Also, looks like the poses_bounds.npy file doesn't exist based on FileNotFoundError: [Errno 2] No such file or directory: 'data/testscene/poses_bounds.npy' -- is the example dataset stored in data/testscene/?

@abhigyan2001
Copy link
Author

@bmild I managed to get rid of the import error, and it was happening due to me using Nvidia CUDA 10.1, whereas Tensorflow supports only CUDA 10.0.

I did download the data using the download.sh script, but I still don't have this poses_bounds.py stored in that directory... Any pointers as to why it didn't work?

@bmild
Copy link
Collaborator

bmild commented Dec 3, 2019

That's strange. Could be a permissions issue. Maybe try just running the steps in the script one by one on the command line:

cd checkpoints
wget http://people.eecs.berkeley.edu/~bmild/llff/data/llff_trained_model.zip
unzip llff_trained_model.zip
cd ..

cd data
wget http://people.eecs.berkeley.edu/~bmild/llff/data/testscene.zip
unzip testscene.zip
cd ..

@abhigyan2001
Copy link
Author

abhigyan2001 commented Dec 6, 2019

I tried the above as well, still there's no poses_bounds.py after running it... I have no clue why this is happening, but I have a feeling that the program is getting stuck somewhere before it has to generate the poses_bounds.py file.

How do I know where the problem is? I've literally tried everything, and yet there's some problem or the other...

@bmild
Copy link
Collaborator

bmild commented Dec 7, 2019

Sorry I got confused, the poses_bounds.npy file is not included in the download -- it should be generated by the demo script code.
Is the error message still the same as the original one you posted? I noticed the line

subprocess.CalledProcessError: Command '['colmap', 'mapper', '--database_path', 'data/testscene/database.db', '--image_path', 'data/testscene/images', '--output_path', 'data/testscene/sparse', '--Mapper.num_threads', '16', '--Mapper.init_min_tri_angle', '4', '--Mapper.multiple_models', '0', '--Mapper.extract_colors', '0']' died with <Signals.SIGSEGV: 11>.

which looks like COLMAP segfaulted and died. This may be related to this COLMAP issue? colmap/colmap#40

@abhigyan2001
Copy link
Author

abhigyan2001 commented Dec 14, 2019

@bmild I've now downloaded and compiled colmap from source, and all the above errors have now disappeared... Thanks so much for that colmap link, and all the support!

However, I've now got a new error, where it looks like the program is trying to allocate 446341359079 MB of data, due to which I'm getting a bad_alloc error... Any hints for that? Or does this only work on a supercomputer with that much RAM?

Here's the output of bash demo.sh:

Don't need to run COLMAP
Post-colmap
Cameras 5
Images # 20
Points (9907, 3) Visibility (9907, 20)
Depth stats 13.713585634774399 120.24088554413024 30.44841722354554
Done with imgs2poses
/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

factor/width/height args: [None, None, 360]
demo.sh: line 8: 11663 Killed                  python imgs2mpis.py data/testscene/ data/testscene/mpis_360 --height 360
mkdir: cannot create directory ‘data/testscene/outputs/’: File exists
Path components [False, False, False, False, True]
Saved to data/testscene/outputs/test_path.txt
make: 'cuda_renderer' is up to date.
Loading data/testscene/mpis_360
loading 1041242782 mpis
32579 x 1040763848
32579 1040763848 32579 1041242782
Big request (host) 446341359079 MB
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
demo.sh: line 24: 11702 Aborted                 (core dumped) cuda_renderer/cuda_renderer data/testscene/mpis_360 data/testscene/outputs/test_path.txt data/testscene/outputs/test_vid.mp4 360 .8 18

(I understand that this probably warrants a new issue be created)

@bmild
Copy link
Collaborator

bmild commented Dec 14, 2019

Ah that's just bad error checking on my part, looks like imgs2mpis.py was terminated early because of mkdir: cannot create directory ‘data/testscene/outputs/’: File exists, then the cuda renderer tried to load MPIs that didn't exist and got garbage numbers. Hopefully it should be fine if you just delete data/testscene/outputs/ and rerun demo.sh!

@abhigyan2001
Copy link
Author

That didn't work :/

I deleted that directory, and now the mkdir error line is missing, but the output is still mostly the same...

factor/width/height args: [None, None, 360]
demo.sh: line 8:  9660 Killed                  python imgs2mpis.py data/testscene/ data/testscene/mpis_360 --height 360
Path components [False, False, False, False, True]
Saved to data/testscene/outputs/test_path.txt
make: 'cuda_renderer' is up to date.
Loading data/testscene/mpis_360
loading 2003558560 mpis
22020 x 1697023584
32765 1697023584 22020 2003558560
Big request (host) 3528743056298 MB
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
demo.sh: line 24:  9697 Aborted                 (core dumped) cuda_renderer/cuda_renderer data/testscene/mpis_360 data/testscene/outputs/test_path.txt data/testscene/outputs/test_vid.mp4 360 .8 18

@abhigyan2001
Copy link
Author

abhigyan2001 commented Dec 15, 2019

Also, I tried running the imgs2mpis.py script alone, using the same arguments as in the demo.sh script... (python imgs2mpis.py data/testscene data/testscene/mpis_360 --height 360)

It gets killed for some reason after calling the gen_mpis() function...

Here's the error log:

data/testscene data/testscene/mpis_360 --height 360
/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/abhigyan/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

factor/width/height args: [None, None, 360]
Killed

I believe that because of this error, my mpis_360 directory is remaining empty (as I have checked), due to which the cuda renderer is making weird huge numbers for the mpis...

@bmild
Copy link
Collaborator

bmild commented Dec 15, 2019

Hmm that is strange, no useful error message. I just realized you are using a smaller GPU, how much GPU memory is there? You may have to mess with the patched settings (here) since the MPIs not be fitting completely in memory.

@abhigyan2001
Copy link
Author

@bmild I have only 2GB of graphics memory. I changed the two lines that you mentioned in the linked issue, but I'm still getting the exact same error message...

Is it possible that this program won't run on my setup for some reason? And has such an error ever been encountered before even on higher end systems?

@MoisesFelipe
Copy link

Any hints on how to solve the "out-of-memory" error? I'm struggling with the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants