Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building CUDA 12.1 (nv23.02) from source fails #84

Open
bergentruckung opened this issue Mar 23, 2023 · 5 comments
Open

Building CUDA 12.1 (nv23.02) from source fails #84

bergentruckung opened this issue Mar 23, 2023 · 5 comments

Comments

@bergentruckung
Copy link

We're trying to build CUDA 12.1 (from HEAD of r1.15.5+nv23.02) from source on both RHEL 7 and 8 (3.10.0-1160.88.1 and 4.18.0-425.13.1 kernels respectively), but they both fail towards the end at the same place.

CUDNN: 8.8.1.3
NCCL: 2.17.1
bazel: 0.25.3 (and a bunch of others as well)
gcc: 12.1.1 (from devtoolset-12)
Python: 3.10

Relevant trace:

ERROR: /local/apps/bergentruckung/nv-tensorflow/tensorflow/contrib/ignite/BUILD:146:21: Linking of rule '//tensorflow/contrib/ignite:gen_gen_dataset_ops_py_wrappers_cc' failed (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc @bazel-out/host/bin/tensorflow/contrib/ignite/gen_gen_dataset_ops_py_wrappers_cc-2.params
bazel-out/host/bin/tensorflow/python/libpython_op_gen_main.a(python_op_gen_main.o): In function `tensorflow::(anonymous namespace)::ReadOpListFromFile(std::string const&, std::vector<std::string, std::allocator<std::string> >*)':
python_op_gen_main.cc:(.text._ZN10tensorflow12_GLOBAL__N_118ReadOpListFromFileERKSsPSt6vectorISsSaISsEE+0xa3): undefined reference to `tensorflow::io::InputBuffer::InputBuffer(tensorflow::RandomAccessFile*, unsigned long)'
python_op_gen_main.cc:(.text._ZN10tensorflow12_GLOBAL__N_118ReadOpListFromFileERKSsPSt6vectorISsSaISsEE+0xd8): undefined reference to `tensorflow::io::InputBuffer::ReadLine(std::string*)'
python_op_gen_main.cc:(.text._ZN10tensorflow12_GLOBAL__N_118ReadOpListFromFileERKSsPSt6vectorISsSaISsEE+0x19f): undefined reference to `tensorflow::io::InputBuffer::ReadLine(std::string*)'
python_op_gen_main.cc:(.text._ZN10tensorflow12_GLOBAL__N_118ReadOpListFromFileERKSsPSt6vectorISsSaISsEE+0x330): undefined reference to `tensorflow::io::InputBuffer::~InputBuffer()'
python_op_gen_main.cc:(.text._ZN10tensorflow12_GLOBAL__N_118ReadOpListFromFileERKSsPSt6vectorISsSaISsEE+0x44d): undefined reference to `tensorflow::io::InputBuffer::~InputBuffer()'
bazel-out/host/bin/tensorflow/python/libpython_op_gen.lo(python_op_gen.o): In function `tensorflow::(anonymous namespace)::GenEagerPythonOp::Code()':
python_op_gen.cc:(.text._ZN10tensorflow12_GLOBAL__N_116GenEagerPythonOp4CodeEv+0x2d1): undefined reference to `tensorflow::_ApiDef_Attr_default_instance_'                                                                                                                                                                   
bazel-out/host/bin/tensorflow/python/libpython_op_gen.lo(python_op_gen_internal.o): In function `tensorflow::python_op_gen_internal::GenPythonOp::AddDocStringAttrs()':                                                                                                                                                      
python_op_gen_internal.cc:(.text._ZN10tensorflow22python_op_gen_internal11GenPythonOp17AddDocStringAttrsEv+0x170): undefined reference to `tensorflow::_ApiDef_Attr_default_instance_'                                                                                                                                       
python_op_gen_internal.cc:(.text._ZN10tensorflow22python_op_gen_internal11GenPythonOp17AddDocStringAttrsEv+0x3f1): undefined reference to `tensorflow::_ApiDef_Attr_default_instance_'                                                                                                                                       
bazel-out/host/bin/tensorflow/python/libpython_op_gen.lo(python_op_gen_internal.o): In function `tensorflow::python_op_gen_internal::GenPythonOp::Code()':                                                                                                                                                                   
python_op_gen_internal.cc:(.text._ZN10tensorflow22python_op_gen_internal11GenPythonOp4CodeEv+0x203): undefined reference to `tensorflow::_ApiDef_Attr_default_instance_'                                                                                                                                                     
bazel-out/host/bin/tensorflow/core/libop_gen_lib.a(op_gen_lib.o): In function `std::__detail::_Map_base<std::string, std::pair<std::string const, tensorflow::ApiDef>, std::allocator<std::pair<std::string const, tensorflow::ApiDef> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std:
:__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&)':                                                                                                                        
op_gen_lib.cc:(.text._ZNSt8__detail9_Map_baseISsSt4pairIKSsN10tensorflow6ApiDefEESaIS5_ENS_10_Select1stESt8equal_toISsESt4hashISsENS_18_Mod_range_hashingENS_20_Default_ranged_hashENS_20_Prime_rehash_policyENS_17_Hashtable_traitsILb1ELb0ELb1EEELb1EEixERS2_[_ZNSt8__detail9_Map_baseISsSt4pairIKSsN10tensorflow6ApiDefEES
aIS5_ENS_10_Select1stESt8equal_toISsESt4hashISsENS_18_Mod_range_hashingENS_20_Default_ranged_hashENS_20_Prime_rehash_policyENS_17_Hashtable_traitsILb1ELb0ELb1EEELb1EEixERS2_]+0x98): undefined reference to `tensorflow::ApiDef::ApiDef()'                                                                                  
op_gen_lib.cc:(.text._ZNSt8__detail9_Map_baseISsSt4pairIKSsN10tensorflow6ApiDefEESaIS5_ENS_10_Select1stESt8equal_toISsESt4hashISsENS_18_Mod_range_hashingENS_20_Default_ranged_hashENS_20_Prime_rehash_policyENS_17_Hashtable_traitsILb1ELb0ELb1EEELb1EEixERS2_[_ZNSt8__detail9_Map_baseISsSt4pairIKSsN10tensorflow6ApiDefEES
aIS5_ENS_10_Select1stESt8equal_toISsESt4hashISsENS_18_Mod_range_hashingENS_20_Default_ranged_hashENS_20_Prime_rehash_policyENS_17_Hashtable_traitsILb1ELb0ELb1EEELb1EEixERS2_]+0x13e): undefined reference to `tensorflow::ApiDef::~ApiDef()'                                                                                
bazel-out/host/bin/tensorflow/core/libop_gen_lib.a(op_gen_lib.o): In function `std::_Hashtable<std::string, std::pair<std::string const, tensorflow::ApiDef>, std::allocator<std::pair<std::string const, tensorflow::ApiDef> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail
::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable()':                                                                                                                                                       
op_gen_lib.cc:(.text._ZNSt10_HashtableISsSt4pairIKSsN10tensorflow6ApiDefEESaIS4_ENSt8__detail10_Select1stESt8equal_toISsESt4hashISsENS6_18_Mod_range_hashingENS6_20_Default_ranged_hashENS6_20_Prime_rehash_policyENS6_17_Hashtable_traitsILb1ELb0ELb1EEEED2Ev[_ZNSt10_HashtableISsSt4pairIKSsN10tensorflow6ApiDefEESaIS4_ENS
t8__detail10_Select1stESt8equal_toISsESt4hashISsENS6_18_Mod_range_hashingENS6_20_Default_ranged_hashENS6_20_Prime_rehash_policyENS6_17_Hashtable_traitsILb1ELb0ELb1EEEED5Ev]+0x38): undefined reference to `tensorflow::ApiDef::~ApiDef()'                                                                                   
bazel-out/host/bin/tensorflow/core/libop_gen_lib.a(op_gen_lib.o): In function `tensorflow::ApiDefMap::ApiDefMap(tensorflow::OpList const&)':                                                                                                                                                                                 
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMapC2ERKNS_6OpListE+0xcc): undefined reference to `tensorflow::ApiDef::ApiDef()'                                                                                                                                                                                                  
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMapC2ERKNS_6OpListE+0x13f): undefined reference to `tensorflow::ApiDef_Endpoint* google::protobuf::Arena::CreateMaybeMessage<tensorflow::ApiDef_Endpoint>(google::protobuf::Arena*)'                                                                                              
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMapC2ERKNS_6OpListE+0x1f3): undefined reference to `tensorflow::ApiDef_Arg* google::protobuf::Arena::CreateMaybeMessage<tensorflow::ApiDef_Arg>(google::protobuf::Arena*)'                                                                                                        
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMapC2ERKNS_6OpListE+0x3cb): undefined reference to `tensorflow::ApiDef_Arg* google::protobuf::Arena::CreateMaybeMessage<tensorflow::ApiDef_Arg>(google::protobuf::Arena*)'                                                                                                        
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMapC2ERKNS_6OpListE+0x563): undefined reference to `tensorflow::ApiDef_Attr* google::protobuf::Arena::CreateMaybeMessage<tensorflow::ApiDef_Attr>(google::protobuf::Arena*)'                                                                                                      
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMapC2ERKNS_6OpListE+0x76b): undefined reference to `tensorflow::ApiDef::CopyFrom(tensorflow::ApiDef const&)'                                                                                                                                                                      
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMapC2ERKNS_6OpListE+0x777): undefined reference to `tensorflow::ApiDef::~ApiDef()'                                                                                                                                                                                                
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMapC2ERKNS_6OpListE+0x9a4): undefined reference to `tensorflow::ApiDef::~ApiDef()'                                                                                                                                                                                                
bazel-out/host/bin/tensorflow/core/libop_gen_lib.a(op_gen_lib.o): In function `tensorflow::ApiDefMap::LoadApiDef(std::string const&)':                                                                                                                                                                                       
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMap10LoadApiDefERKSs+0x56): undefined reference to `tensorflow::ApiDefs::ApiDefs()'                                                                                                                                                                                               
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMap10LoadApiDefERKSs+0x43a): undefined reference to `tensorflow::_ApiDef_Attr_default_instance_'                                                                                                                                                                                  
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMap10LoadApiDefERKSs+0x6b1): undefined reference to `tensorflow::ApiDefs::~ApiDefs()'                                                                                                                                                                                             
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMap10LoadApiDefERKSs+0x725): undefined reference to `tensorflow::ApiDef_Endpoint::Clear()'                                                                                                                                                                                        
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMap10LoadApiDefERKSs+0x789): undefined reference to `tensorflow::ApiDef_Endpoint* google::protobuf::Arena::CreateMaybeMessage<tensorflow::ApiDef_Endpoint>(google::protobuf::Arena*)'                                                                                             
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMap10LoadApiDefERKSs+0x7a7): undefined reference to `tensorflow::ApiDef_Endpoint::CopyFrom(tensorflow::ApiDef_Endpoint const&)'                                                                                                                                                   op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMap10LoadApiDefERKSs+0x103d): undefined reference to `tensorflow::ApiDefs::~ApiDefs()'                                                                                                                                                                        
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::AlgorithmProto::CopyFrom(stream_executor::dnn::AlgorithmProto const&)'                             
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::ExecutionPlanProto::~ExecutionPlanProto()'                                                         
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::ExecutionPlanProto::InternalSwap(stream_executor::dnn::ExecutionPlanProto*)'                       
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::AlgorithmProto::AlgorithmProto(stream_executor::dnn::AlgorithmProto const&)'                       
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `tensorflow::ProtoDebugString(tensorflow::DeviceAttributes const&)'                                                       
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::TensorDescriptorProto::InternalSwap(stream_executor::dnn::TensorDescriptorProto*)'                 
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `tensorflow::ProtoShortDebugString(tensorflow::ConfigProto const&)'                                                       
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::ExecutionPlanProto::CopyFrom(stream_executor::dnn::ExecutionPlanProto const&)'                     
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::AlgorithmProto::~AlgorithmProto()'                                                                 
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::ConvolutionDescriptorProto::~ConvolutionDescriptorProto()'                                         
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::ExecutionPlanProto::ExecutionPlanProto(stream_executor::dnn::ExecutionPlanProto const&)'           
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::TensorDescriptorProto::~TensorDescriptorProto()'                                                   
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::TensorDescriptorProto::clear_layout_oneof()'                                                       
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::ConvolutionDescriptorProto::ConvolutionDescriptorProto()'                                          
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::TensorDescriptorProto::TensorDescriptorProto(stream_executor::dnn::TensorDescriptorProto const&)'  
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::TensorDescriptorProto::TensorDescriptorProto()'                                                    
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::AlgorithmProto::InternalSwap(stream_executor::dnn::AlgorithmProto*)'                               
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::AlgorithmProto::AlgorithmProto()'                                                                  
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::ExecutionPlanProto::ExecutionPlanProto()'                                                          
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::TensorDescriptorProto::CopyFrom(stream_executor::dnn::TensorDescriptorProto const&)'               
collect2: error: ld returned 1 exit status

Please let us know how we can go ahead with the build process here.

Thanks in advance.

@nluehr
Copy link
Contributor

nluehr commented Mar 23, 2023

Those linker errors aren't red flags for any issue I'm familiar with. If you provide reproducer instructions similar to the ubuntu build instructions here I can take a look at what is going wrong.

@bergentruckung
Copy link
Author

I see, if it helps to add some additional colour, this is the last portion of the build process (after the message from above).

SUBCOMMAND: # //tensorflow/contrib/image:gen_image_ops_py_wrappers_cc [action 'Linking tensorflow/contrib/image/gen_image_ops_py_wrappers_cc [for host]', conf                    
iguration: fc79f5a2b8c3ab837b6ff6617f003a26cc0d2bb81faca1c37ceff7494a54f2ff, execution platform: @local_config_platform//:host]                                                   
(cd /var/tmp/tf.aSaiDr/.cache/bazel/_bazel_bergentruckung/5c5196772a9ffda2bec0a4e93f813cd8/execroot/org_tensorflow && \                                                                 
  exec env - \                                                                                                                                                                    
    PATH=<redacted> \                                                          
    PWD=/proc/self/cwd \                                                                                                                                                          
  external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc @bazel-out/host/bin/tensorflow/contrib/image/gen_image_ops_py_wrappers_cc-2.params)          
Target //tensorflow/tools/pip_package:build_pip_package failed to build                                                                                                           
Use --verbose_failures to see the command lines of failed build steps.                                                                                                            
INFO: Elapsed time: 977.812s, Critical Path: 251.56s                                                                                                                              
INFO: 31971 processes: 15017 internal, 16954 local.                                                                                                                               
FAILED: Build did NOT complete successfully                                                                                                                                       

@bergentruckung
Copy link
Author

A reproducer will be similar to this:

  1. Install CUDA toolkit, newer drivers from Nvidia's official page (follow the corresponding documentation there)
  2. Install the right version of CUDNN and NCCL from Nvidia's official pages (I used the local rpm method for installing the packages and then made sure to install both the devel packages - for header files that we need later on)
  3. Checkout the 1.15.5nv23.02 branch from github.com/NVIDIA/tensorflow and also, v0.7.3 from github.com/NVIDIA/cudnn-frontend
  4. Create a new python-3.10 virtual environment
  5. Rest of the steps are similar to the doc that you linked above - we figure out the bazel version that's required, set all the configuration env variables (that we need) and then run ./configure, followed by running the bazel build commands

@nluehr
Copy link
Contributor

nluehr commented Mar 29, 2023

Have you tried a bazel clean --expunge as well as disabling any bazel caching / ccache options you might use?

@bergentruckung
Copy link
Author

Sorry, I missed to respond back here. It's still the same for me, even after running bazel clean --expunge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants