Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python-snappy not found during execution of CSVExampleGen #6843

Open
sagnik-t opened this issue Jun 23, 2024 · 1 comment
Open

Python-snappy not found during execution of CSVExampleGen #6843

sagnik-t opened this issue Jun 23, 2024 · 1 comment

Comments

@sagnik-t
Copy link

sagnik-t commented Jun 23, 2024

System information

  • Have I specified the code to reproduce the issue (Yes, No): Yes
  • Environment in which the code is executed: Zorin OS (Ubuntu 22.04)
    Interactive Notebook, Google Cloud, etc):
  • TensorFlow version: 2.15.1
  • TFX Version: 1.15.1
  • Python version: 3.10.14
  • Python dependencies (from pip freeze output):
    absl-py==1.4.0 annotated-types==0.7.0 anyio==4.4.0 apache-beam==2.56.0 argon2-cffi==23.1.0 argon2-cffi-bindings==21.2.0 array_record==0.5.1 arrow==1.3.0 asttokens==2.4.1 astunparse==1.6.3 async-lru==2.0.4 async-timeout==4.0.3 attrs==23.2.0 Babel==2.15.0 backcall==0.2.0 beautifulsoup4==4.12.3 bleach==6.1.0 cachetools==5.3.3 certifi==2024.6.2 cffi==1.16.0 charset-normalizer==3.3.2 click==8.1.7 cloudpickle==2.2.1 colorama==0.4.6 comm==0.2.2 contourpy==1.2.1 cramjam==2.8.3 crcmod==1.7 cycler==0.12.1 debugpy==1.8.1 decorator==5.1.1 defusedxml==0.7.1 dill==0.3.1.1 dm-tree==0.1.8 dnspython==2.6.1 docker==4.4.4 docopt==0.6.2 docstring_parser==0.16 etils==1.7.0 exceptiongroup==1.2.1 executing==2.0.1 facets-overview==1.1.1 fastavro==1.9.4 fasteners==0.19 fastjsonschema==2.20.0 flatbuffers==24.3.25 fonttools==4.53.0 fqdn==1.5.1 fsspec==2024.6.0 gast==0.5.4 google-api-core==2.19.0 google-api-python-client==1.12.11 google-apitools==0.5.31 google-auth==2.30.0 google-auth-httplib2==0.2.0 google-auth-oauthlib==1.2.0 google-cloud-aiplatform==1.56.0 google-cloud-bigquery==3.24.0 google-cloud-bigquery-storage==2.25.0 google-cloud-bigtable==2.24.0 google-cloud-core==2.4.1 google-cloud-dataproc==5.9.3 google-cloud-datastore==2.19.0 google-cloud-dlp==3.18.0 google-cloud-language==2.13.3 google-cloud-pubsub==2.21.3 google-cloud-pubsublite==1.10.0 google-cloud-recommendations-ai==0.10.10 google-cloud-resource-manager==1.12.3 google-cloud-spanner==3.47.0 google-cloud-storage==2.17.0 google-cloud-videointelligence==2.13.3 google-cloud-vision==3.7.2 google-crc32c==1.5.0 google-pasta==0.2.0 google-resumable-media==2.7.1 googleapis-common-protos==1.63.1 grpc-google-iam-v1==0.13.0 grpc-interceptor==0.15.4 grpcio==1.64.1 grpcio-status==1.48.2 h11==0.14.0 h5py==3.11.0 hdfs==2.7.3 httpcore==1.0.5 httplib2==0.22.0 httpx==0.27.0 idna==3.7 immutabledict==4.2.0 importlib_resources==6.4.0 ipykernel==6.29.4 ipython==8.25.0 ipython-genutils==0.2.0 ipywidgets==8.1.3 isoduration==20.11.0 jedi==0.19.1 Jinja2==3.1.4 joblib==1.4.2 Js2Py==0.74 json5==0.9.25 jsonpickle==3.2.1 jsonpointer==3.0.0 jsonschema==4.22.0 jsonschema-specifications==2023.12.1 jupyter-events==0.10.0 jupyter-lsp==2.2.5 jupyter_client==8.2.0 jupyter_core==5.7.2 jupyter_server==2.14.1 jupyter_server_terminals==0.5.3 jupyterlab==4.2.2 jupyterlab_pygments==0.3.0 jupyterlab_server==2.27.2 jupyterlab_widgets==3.0.11 keras==2.15.0 keras-tuner==1.4.7 kiwisolver==1.4.5 kt-legacy==1.0.5 kubernetes==12.0.1 libclang==18.1.1 lxml==5.2.2 Markdown==3.6 MarkupSafe==2.1.5 matplotlib==3.9.0 matplotlib-inline==0.1.7 mistune==3.0.2 ml-dtypes==0.3.2 ml-metadata==1.15.0 ml-pipelines-sdk==1.15.1 mplcyberpunk==0.7.1 nbclient==0.10.0 nbconvert==7.16.4 nbformat==5.10.4 nest-asyncio==1.6.0 nltk==3.8.1 notebook==7.2.1 notebook_shim==0.2.4 numpy==1.26.4 oauth2client==4.1.3 oauthlib==3.2.2 objsize==0.7.0 opt-einsum==3.3.0 orjson==3.10.5 overrides==7.7.0 packaging==24.1 pandas==1.5.3 pandocfilters==1.5.1 parso==0.8.4 pathlib==1.0.1 pexpect==4.9.0 pickleshare==0.7.5 pillow==10.3.0 platformdirs==4.2.2 portalocker==2.8.2 portpicker==1.6.0 prometheus_client==0.20.0 promise==2.3 prompt_toolkit==3.0.47 proto-plus==1.23.0 protobuf==3.20.3 psutil==5.9.8 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow==10.0.1 pyarrow-hotfix==0.6 pyasn1==0.6.0 pyasn1_modules==0.4.0 pycparser==2.22 pydantic==2.7.4 pydantic_core==2.18.4 pydot==1.4.2 pyfarmhash==0.3.2 Pygments==2.18.0 pyjsparser==2.7.1 pymongo==4.7.3 pyparsing==3.1.2 python-dateutil==2.9.0.post0 python-json-logger==2.0.7 python-snappy==0.7.2 pytz==2024.1 PyYAML==6.0.1 pyzmq==26.0.3 redis==5.0.6 referencing==0.35.1 regex==2024.5.15 requests==2.32.3 requests-oauthlib==2.0.0 rfc3339-validator==0.1.4 rfc3986-validator==0.1.1 rouge_score==0.1.2 rpds-py==0.18.1 rsa==4.9 sacrebleu==2.4.2 scipy==1.12.0 seaborn==0.13.2 Send2Trash==1.8.3 shapely==2.0.4 simple_parsing==0.1.5 six==1.16.0 sniffio==1.3.1 soupsieve==2.5 sqlparse==0.5.0 stack-data==0.6.3 tabulate==0.9.0 tensorboard==2.15.2 tensorboard-data-server==0.7.2 tensorflow==2.15.1 tensorflow-data-validation==1.15.1 tensorflow-datasets==4.9.6 tensorflow-estimator==2.15.0 tensorflow-hub==0.15.0 tensorflow-io-gcs-filesystem==0.37.0 tensorflow-metadata==1.15.0 tensorflow-serving-api==2.15.1 tensorflow-transform==1.15.0 tensorflow_model_analysis==0.46.0 termcolor==2.4.0 terminado==0.18.1 tfx==1.15.1 tfx-bsl==1.15.1 timeloop==1.0.2 tinycss2==1.3.0 toml==0.10.2 tomli==2.0.1 tornado==6.4.1 tqdm==4.66.4 traitlets==5.14.3 types-python-dateutil==2.9.0.20240316 typing_extensions==4.12.2 tzlocal==5.2 uri-template==1.3.0 uritemplate==3.0.1 urllib3==2.2.2 wcwidth==0.2.13 webcolors==24.6.0 webencodings==0.5.1 websocket-client==1.8.0 Werkzeug==3.0.3 widgetsnbextension==4.0.11 wrapt==1.14.1 zipp==3.19.2 zstandard==0.22.0

Current Behavior

I have a simple pipeline consisting of only one component (CSVExampleGen) to ingest csv files and convert them to TFRecords.
However, upon running the pipeline I get the following warning:
WARNING:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.
I have already installed python-snappy and the corresponding C library using the following commands:

sudo apt-get install libsnappy-dev
pip install python-snappy

Expected behavior: The execution of this simple pipeline should be much faster and no such warnings should be produced.

Standalone code to reproduce the issue

Download any moderately sized csv file with numerical data and run the following code:

from tfx.proto import example_gen_pb2
from tfx.components.example_gen.csv_example_gen.component import CsvExampleGen
from tfx.orchestration.pipeline import Pipeline
from tfx.orchestration.local.local_dag_runner import LocalDagRunner
from tfx.orchestration import metadata

pipeline_root = 'artifacts'
data_dir = 'data'

input_config = example_gen_pb2.Input(
    splits=[
        example_gen_pb2.Input.Split(name='data', pattern='data.csv')
    ]
)

output_config = example_gen_pb2.Output(
    split_config=example_gen_pb2.SplitConfig(
        splits=[
            example_gen_pb2.SplitConfig.Split(name='train', hash_buckets=8),
            example_gen_pb2.SplitConfig.Split(name='eval', hash_buckets=2)
        ]
    )
)

example_gen = CsvExampleGen(
    input_base=data_dir,
    input_config=input_config,
    output_config=output_config,
)

pipeline = Pipeline(
    pipeline_name='testing pipeline',
    pipeline_root=pipeline_root,
    components=[
        example_gen
    ],
    enable_cache=True,
    metadata_connection_config=metadata.sqlite_metadata_connection_config(
        os.path.join(pipeline_root, 'metadata.sqlite')
    )
)

LocalDagRunner().run(pipeline)
@lego0901
Copy link
Member

Hi, sorry for responding this issue so late.

The warning appears when it fails to import python snappy, as per https://github.com/apache/beam/blob/v2.56.0/sdks/python/apache_beam/io/tfrecordio.py#L48.

Could you please test running python3 -c 'import snappy' if it is properly imported? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants