Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf.load_op_library unable to load manylinux2010 repaired custom ops #31807

Closed
seanpmorgan opened this issue Aug 20, 2019 · 7 comments
Closed
Assignees
Labels
comp:ops OPs related issues TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug

Comments

@seanpmorgan
Copy link
Member

seanpmorgan commented Aug 20, 2019

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No -- using https://github.com/tensorflow/custom-op (But it breaks for addons too)
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu16.04
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below): tf-nightly & tf-nighty-2.0-preview

Describe the current behavior
Currently when I build a custom op in the tensorflow/tensorflow:custom-op-ubuntu16 docker image using the defined steps I get an install-able pip package tensorflow_zero_out-0.0.1-cp27-cp27mu-linux_x86_64.whl

This works fine, however if I repair that wheel to be manylinux2010 compliant, then tf.load_op_library will fail to find the custom-op.

python -c "import tensorflow as tf; print(dir(tf.load_op_library('manylinux/tensorflow_zero_out/python/ops/_zero_out_ops.so')))"

['LIB_HANDLE', 'OP_LIST', 'ZeroOut', '_IS_TENSORFLOW_PLUGIN', 
'_InitOpDefLibrary', '__builtins__', '__doc__', '__name__', '__package__', 
'_collections', '_common_shapes', '_context', '_core', '_dispatch', '_doc_controls', 
'_dtypes', '_errors', '_execute', '_kwarg_only', '_op_def_lib', '_op_def_library', 
'_op_def_pb2', '_op_def_registry', '_ops', '_pywrap_tensorflow', '_six', 
'_tensor_shape', 'deprecated_endpoints', 'tf_export', 'zero_out',
 'zero_out_eager_fallback']
python -c "import tensorflow as tf;print(dir(tf.load_op_library('manylinux2010/tensorflow_zero_out/python/ops/_zero_out_ops.so')))"

['LIB_HANDLE', 'OP_LIST', '_IS_TENSORFLOW_PLUGIN', 
'_InitOpDefLibrary', '__builtins__', '__doc__', '__name__', '__package__', 
'_collections', '_common_shapes', '_context', '_core', 
'_dispatch', '_doc_controls', '_dtypes', '_errors', '_execute', '_kwarg_only', 
'_op_def_lib', '_op_def_library', '_op_def_pb2', '_op_def_registry', '_ops', 
'_pywrap_tensorflow', '_six', '_tensor_shape', 'deprecated_endpoints', 'tf_export']

Notice 'zero_out' & 'zero_out_eager_fallback' are not found in the loaded library for manylinux2010

Code to reproduce the issue

git clone https://github.com/tensorflow/custom-op.git && cd custom-op
docker run -it --rm -v ${PWD}:/workspace -w /workspace tensorflow/tensorflow:custom-op-ubuntu16 /bin/bash

pip install tf-nightly
./configure.sh
bazel build build_pip_pkg
bazel-bin/build_pip_pkg artifacts

# Installed auditwheel is too old for manylinux2010
pip3 install --upgrade auditwheel

# Libtensorflow framework needs to be on LD path
export LD_LIBRARY_PATH="/usr/local/lib/python2.7/dist-packages/tensorflow_core"

# Repair logs look more or less okay
auditwheel -v repair --plat manylinux2010_x86_64 artifacts/tensorflow_zero_out-0.0.1-cp27-cp27mu-linux_x86_64.whl &> repair.txt

Other info / logs
Here are the auditwheel repair logs:
repair.txt

Here are the readelf inspections of the so files:
readelf.txt
readelf-manylinux2010.txt

Here are the so files:
so-files.zip

cc @perfinion @gunan @yifeif

--------------------------EDIT--------------------
Here are the extracted whl directories which will work with the python tf.load_op_library commands from above. (Manylinux2010 repair makes it so the custom op depends on a newly copied libtensorflow_framework.so which is part of the new whl):
custom-op-dirs.zip

@yongtang
Copy link
Member

I remember I encountered an issue when there is a collision of names for added kernel ops. (used to be fine for 1.14, not with new tf-nightly) Wondering if there are multiple versions of zero_out kernel ops?

@seanpmorgan
Copy link
Member Author

seanpmorgan commented Aug 20, 2019

I remember I encountered an issue when there is a collision of names for added kernel ops. (used to be fine for 1.14, not with new tf-nightly) Wondering if there are multiple versions of zero_out kernel ops?

Thanks! Looking at the binaries' symbols I'm not seeing any duplication that isn't present in the .so before auditwheel repair though:
https://www.diffchecker.com/pfJbJX8g

Is there a way to increase the verbosity of the load_library call so we could see if there is a conflict or something else?

The only major difference I see is that the repaired binary requires the newly copied libtensorflow_framework-65610c2c.so.1 instead of the libtensorflow_framework.so.1 that would get picked up from the TF install. I'm not sure what the implications of that are though and without being able to step through load_library function it's a bit tough.

@yongtang
Copy link
Member

My previous issue was the LMDBDataset. I initially implemented LMDBDataset (C++) into TF's core rep (tf.contrib) some time ago. Later on since we try to modularize, the LMDBDataset has been moved to tensorflow/io. So there are two copies if both tensorflow and tensorflow-io are loaded.

That used to be fine. However, very recently I noticed that LMDBDataset in tensorflow/io is not working anymore with tf-nightly (couldn't remember which version but must be very recent), and I have to change the name in tensorflow/io to LMDBDatasetV2 to get around it.

Don't know if this could be related as well.

@yongtang
Copy link
Member

yongtang commented Aug 20, 2019

Ah the libtensorflow_framework.so.1 is a known limitation of auditwheel.

I wrote a patch for auditwheel, to get around the issue : tensorflow/io@02dcf4a

@seanpmorgan
Copy link
Member Author

seanpmorgan commented Aug 20, 2019

@yongtang Amazing thanks so much! Could you explain what that file edit does / why that patch works (I'm assuming somehow tricks auditwheel to thinking the sharedlib is a common one on all systems)?

We should probably describe this and include the patch in custom-op repo.

@seanpmorgan
Copy link
Member Author

EDIT -- Found out what policy.json was being editted:
https://github.com/pypa/auditwheel/blob/master/auditwheel/policy/policy.json

Thanks again for the patch @yongtang!

@njzjz
Copy link
Contributor

njzjz commented Nov 8, 2022

In the auditwheel 5.2.0 which is recently released, one can use --exclude option instead of editting policy.json:

auditwheel repair --exclude libtensorflow_framework.so.2 --exclude libtensorflow_framework.so.1 --exclude libtensorflow_framework.so some_wheel.whl

If one uses cibuildwheel, add the following option to pyproject.toml.

[tool.cibuildwheel.linux]
repair-wheel-command = "auditwheel repair --exclude libtensorflow_framework.so.2 --exclude libtensorflow_framework.so.1 --exclude libtensorflow_framework.so -w {dest_dir} {wheel}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:ops OPs related issues TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug
Projects
None yet
Development

No branches or pull requests

4 participants