Extending the corresponding version of the official tensorflow/tfx image causes hanging Dataflow Worker #6911

adammkerr · 2024-09-04T16:26:17Z

System information

Have I specified the code to reproduce the issue (Yes, No): Yes
Environment in which the code is executed: Vertex AI
TensorFlow version: 2.15.1
TFX Version: 1.15.1
Python version: 3.10.14
Python dependencies (from pip freeze output):

absl-py==1.4.0
annotated-types==0.7.0
anyio==4.4.0
apache-beam==2.58.1
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
astunparse==1.6.3
async-lru==2.0.4
async-timeout==4.0.3
attrs==23.2.0
babel==2.16.0
backcall==0.2.0
beautifulsoup4==4.12.3
bleach==6.1.0
cachetools==5.5.0
certifi==2024.7.4
cffi==1.17.0
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==2.2.1
colorama==0.4.6
comm==0.2.2
crcmod==1.7
debugpy==1.8.5
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.14
dill==0.3.1.1
dnspython==2.6.1
docker==4.4.4
docopt==0.6.2
docstring_parser==0.16
exceptiongroup==1.2.2
fastavro==1.9.5
fasteners==0.19
fastjsonschema==2.20.0
fire==0.6.0
flatbuffers==24.3.25
fqdn==1.5.1
gast==0.6.0
google-api-core==2.19.2
google-api-python-client==1.12.11
google-apitools==0.5.31
google-auth==2.34.0
google-auth-httplib2==0.2.0
google-auth-oauthlib==1.2.1
google-cloud-aiplatform==1.64.0
google-cloud-bigquery==3.25.0
google-cloud-bigquery-storage==2.25.0
google-cloud-bigtable==2.26.0
google-cloud-core==2.4.1
google-cloud-datastore==2.20.1
google-cloud-dlp==3.22.0
google-cloud-language==2.14.0
google-cloud-pubsub==2.23.0
google-cloud-pubsublite==1.11.1
google-cloud-recommendations-ai==0.10.12
google-cloud-resource-manager==1.12.5
google-cloud-spanner==3.48.0
google-cloud-storage==2.18.2
google-cloud-videointelligence==2.13.5
google-cloud-vision==3.7.4
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.7.2
googleapis-common-protos==1.65.0
grpc-google-iam-v1==0.13.1
grpc-interceptor==0.15.4
grpcio==1.66.0
grpcio-status==1.48.2
h11==0.14.0
h5py==3.11.0
hdfs==2.7.3
httpcore==1.0.5
httplib2==0.22.0
httpx==0.27.2
idna==3.8
ipykernel==6.29.5
ipython==7.34.0
ipython-genutils==0.2.0
ipywidgets==7.8.3
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.4
joblib==1.4.2
Js2Py==0.74
json5==0.9.25
jsonpickle==3.2.2
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
jupyter-events==0.10.0
jupyter-lsp==2.2.5
jupyter_client==8.6.2
jupyter_core==5.7.2
jupyter_server==2.14.2
jupyter_server_terminals==0.5.3
jupyterlab==4.2.5
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
jupyterlab_widgets==1.1.9
keras==2.15.0
keras-tuner==1.4.7
kfp==1.8.22
kfp-pipeline-spec==0.1.16
kfp-server-api==1.8.5
kt-legacy==1.0.5
kubernetes==12.0.1
libclang==18.1.1
lxml==5.3.0
Markdown==3.7
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib-inline==0.1.7
mdurl==0.1.2
mistune==3.0.2
ml-dtypes==0.3.2
ml-metadata==1.15.0
ml-pipelines-sdk==1.15.1
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
nltk==3.9.1
notebook==7.2.2
notebook_shim==0.2.4
numpy==1.26.4
oauth2client==4.1.3
oauthlib==3.2.2
objsize==0.7.0
opt-einsum==3.3.0
orjson==3.10.7
overrides==7.7.0
packaging==24.1
pandas==1.5.3
pandocfilters==1.5.1
parso==0.8.4
pexpect==4.9.0
pickleshare==0.7.5
pillow==10.4.0
platformdirs==4.2.2
portalocker==2.10.1
portpicker==1.6.0
prometheus_client==0.20.0
prompt_toolkit==3.0.47
proto-plus==1.24.0
protobuf==3.20.3
psutil==6.0.0
ptyprocess==0.7.0
pyarrow==10.0.1
pyarrow-hotfix==0.6
pyasn1==0.6.0
pyasn1_modules==0.4.0
pycparser==2.22
pydantic==1.10.18
pydantic_core==2.20.1
pydot==1.4.2
pyfarmhash==0.3.2
Pygments==2.18.0
pyjsparser==2.7.1
pymongo==4.8.0
pyparsing==3.1.4
python-dateutil==2.9.0.post0
python-json-logger==2.0.7
pytz==2024.1
PyYAML==6.0.2
pyzmq==26.2.0
redis==5.0.8
referencing==0.35.1
regex==2024.7.24
requests==2.31.0
requests-oauthlib==2.0.0
requests-toolbelt==0.10.1
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.8.0
rouge-score==0.1.2
rpds-py==0.20.0
rsa==4.9
sacrebleu==2.4.3
scipy==1.12.0
Send2Trash==1.8.3
shapely==2.0.6
shellingham==1.5.4
six==1.16.0
sniffio==1.3.1
soupsieve==2.6
sqlparse==0.5.1
strip-hints==0.1.10
tabulate==0.9.0
tensorboard==2.15.2
tensorboard-data-server==0.7.2
tensorflow==2.15.1
tensorflow-data-validation==1.15.1
tensorflow-estimator==2.15.0
tensorflow-hub==0.15.0
tensorflow-io-gcs-filesystem==0.37.1
tensorflow-metadata==1.15.0
tensorflow-serving-api==2.15.1
tensorflow-transform==1.15.0
tensorflow_model_analysis==0.46.0
termcolor==2.4.0
terminado==0.18.1
tfx==1.15.1
tfx-bsl==1.15.1
tinycss2==1.3.0
tomli==2.0.1
tornado==6.4.1
tqdm==4.66.5
traitlets==5.14.3
typer==0.12.5
types-python-dateutil==2.9.0.20240821
typing_extensions==4.12.2
tzlocal==5.2
uri-template==1.3.0
uritemplate==3.0.1
urllib3==1.26.19
wcwidth==0.2.13
webcolors==24.8.0
webencodings==0.5.1
websocket-client==1.8.0
Werkzeug==3.0.4
widgetsnbextension==3.6.8
wrapt==1.14.1
zstandard==0.23.0

Describe the current behavior
As per the following links and issues:

I am using a custom docker container, extended from the corresponding version of the official tensorflow/tfx image, as my Pipeline default image and Dataflow sdk_container_image.

I have:

Created a custom image using the methods defined in the links above
Specified that image as my default_image in KubeflowV2DagRunnerConfig
Added the following params to my beam_args for each component which leverages Beam:

    '--experiments=use_runner_v2', 
    '--sdk_container_image=' + PIPELINE_IMAGE,
    '--sdk_location=container',

The worker will start, however the worker container hangs with the following error:

I am not sure why Apache Beam is not installed in the runtime environment, I can investigate the built image using:

docker run --rm -it \
		--name ${DOCKER_IMAGE_NAME} \
		--entrypoint=/bin/bash \
		${DOCKER_IMAGE_URI}

and I can confirm that apache-beam exists within the container:

root@dd7dd38913f6:/pipeline# pip list --format=freeze
absl-py==1.4.0
aiofiles==22.1.0
aiohttp==3.9.3
aiohttp-cors==0.7.0
aiosignal==1.3.1
aiosqlite==0.20.0
anyio==4.3.0
apache-beam==2.56.0
archspec==0.2.3
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
array_record==0.5.1
arrow==1.3.0
asttokens==2.4.1
astunparse==1.6.3
async-timeout==4.0.3
attrs==23.2.0
Babel==2.14.0
backcall==0.2.0
backports.tarfile==1.1.0
beatrix_jupyterlab==2024.49.170251
beautifulsoup4==4.12.3
bleach==6.1.0
blessed==1.20.0
boltons==24.0.0
Brotli==1.1.0
brotlipy==0.7.0
cached-property==1.5.2
cachetools==4.2.4
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
cloud-tpu-client==0.10
cloud-tpu-profiler==2.4.0
cloudpickle==2.2.1
colorama==0.4.6
colorful==0.5.6
comm==0.2.2
conda==24.3.0
conda-libmamba-solver==24.1.0
conda-package-handling==2.2.0
conda_package_streaming==0.9.0
contourpy==1.2.1
crcmod==1.7
cryptography==42.0.5
cycler==0.12.1
Cython==3.0.10
dacite==1.8.1
dataproc_jupyter_plugin==0.1.74
db-dtypes==1.2.0
debugpy==1.8.1
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.14
dill==0.3.1.1
distlib==0.3.8
distro==1.9.0
dm-tree==0.1.8
docker==4.4.4
docopt==0.6.2
docstring_parser==0.16
entrypoints==0.4
etils==1.7.0
exceptiongroup==1.2.0
executing==2.0.1
explainable-ai-sdk==1.3.3
Farama-Notifications==0.0.4
fastapi==0.110.2
fastavro==1.9.4
fasteners==0.19
fastjsonschema==2.19.1
filelock==3.13.4
fire==0.6.0
flatbuffers==24.3.25
fonttools==4.51.0
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.3.1
gast==0.5.4
gcsfs==2024.3.1
gitdb==4.0.11
GitPython==3.1.43
google-api-core==2.11.1
google-api-python-client==1.12.11
google-apitools==0.5.31
google-auth==2.29.0
google-auth-httplib2==0.1.1
google-auth-oauthlib==1.2.0
google-cloud-aiplatform==1.48.0
google-cloud-artifact-registry==1.11.3
google-cloud-bigquery==3.21.0
google-cloud-bigquery-storage==2.16.2
google-cloud-bigtable==2.23.1
google-cloud-core==2.4.1
WARNING: No metadata found in /opt/conda/lib/python3.10/site-packages
google-cloud-datastore==1.15.5
google-cloud-dlp==3.16.0
google-cloud-jupyter-config==0.0.5
google-cloud-language==2.13.3
google-cloud-monitoring==2.21.0
google-cloud-pubsub==2.21.1
google-cloud-pubsublite==1.10.0
google-cloud-recommendations-ai==0.7.1
google-cloud-resource-manager==1.12.3
google-cloud-spanner==3.45.0
google-cloud-storage==2.14.0
google-cloud-videointelligence==2.13.3
google-cloud-vision==3.7.2
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.7.0
googleapis-common-protos==1.63.0
gpustat==1.0.0
greenlet==3.0.3
grpc-google-iam-v1==0.12.7
grpc-interceptor==0.15.4
grpcio==1.62.2
grpcio-status==1.48.1
gviz-api==1.10.0
gymnasium==0.28.1
h11==0.14.0
h5py==3.11.0
hdfs==2.7.3
htmlmin==0.1.12
httplib2==0.21.0
httptools==0.6.1
idna==3.7
ImageHash==4.3.1
imageio==2.34.0
importlib-metadata==7.0.0
importlib_resources==6.4.0
ipykernel==6.29.3
ipython==7.34.0
ipython-genutils==0.2.0
ipython-sql==0.5.0
ipywidgets==7.8.1
isoduration==20.11.0
jaraco.classes==3.4.0
jaraco.context==5.3.0
jaraco.functools==4.0.1
jax-jumpy==1.0.0
jedi==0.19.1
jeepney==0.8.0
Jinja2==3.1.3
joblib==1.4.0
Js2Py==0.74
json5==0.9.25
jsonpatch==1.33
jsonpickle==3.0.4
jsonpointer==2.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
jupyter_client==7.4.9
jupyter_core==5.7.2
jupyter-events==0.10.0
jupyter-http-over-ws==0.0.8
jupyter_server==2.14.0
jupyter_server_fileid==0.9.2
jupyter-server-mathjax==0.2.6
jupyter_server_proxy==4.1.2
jupyter_server_terminals==0.5.3
jupyter_server_ydoc==0.8.0
jupyter-ydoc==0.2.5
jupyterlab==3.6.6
jupyterlab_git==0.44.0
jupyterlab_pygments==0.3.0
jupyterlab_server==2.26.0
jupyterlab-widgets==1.1.7
jupytext==1.16.1
keras==2.15.0
keras-tuner==1.4.7
kernels-mixer==0.0.10
keyring==25.1.0
keyrings.google-artifactregistry-auth==1.1.2
kfp==1.8.22
kfp-pipeline-spec==0.1.16
kfp-server-api==1.8.5
kiwisolver==1.4.5
kt-legacy==1.0.5
kubernetes==12.0.1
lazy_loader==0.4
libclang==18.1.1
libmambapy==1.5.8
linkify-it-py==2.0.3
llvmlite==0.41.1
lxml==5.2.2
lz4==4.3.3
Markdown==3.6
markdown-it-py==3.0.0
MarkupSafe==2.0.1
matplotlib==3.7.3
matplotlib-inline==0.1.7
mdit-py-plugins==0.4.0
mdurl==0.1.2
memray==1.12.0
menuinst==2.0.2
mistune==3.0.2
ml-dtypes==0.3.2
ml-metadata==1.15.0
ml-pipelines-sdk==1.15.1
mmh==2.2
more-itertools==10.2.0
msgpack==1.0.8
multidict==6.0.5
multimethod==1.11.2
nb_conda==2.2.1
nb_conda_kernels==2.3.1
nbclassic==1.0.0
nbclient==0.10.0
nbconvert==7.16.3
nbdime==3.2.0
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.3
nltk==3.8.1
notebook==6.5.4
notebook_executor==0.2
notebook_shim==0.2.4
numba==0.58.1
numpy==1.24.4
nvidia-ml-py==11.495.46
oauth2client==4.1.3
oauthlib==3.2.2
objsize==0.6.1
opencensus==0.11.4
opencensus-context==0.1.3
opentelemetry-api==1.24.0
opentelemetry-exporter-otlp==1.24.0
opentelemetry-exporter-otlp-proto-common==1.24.0
opentelemetry-exporter-otlp-proto-grpc==1.24.0
opentelemetry-exporter-otlp-proto-http==1.24.0
opentelemetry-proto==1.24.0
opentelemetry-sdk==1.24.0
opentelemetry-semantic-conventions==0.45b0
opt-einsum==3.3.0
orjson==3.10.1
overrides==7.7.0
packaging==24.0
pandas==1.5.3
pandas-profiling==3.6.6
pandocfilters==1.5.0
papermill==2.5.0
parso==0.8.4
patsy==0.5.6
pendulum==3.0.0
pexpect==4.9.0
phik==0.12.4
pickleshare==0.7.5
pillow==10.3.0
pip==24.0
pkgutil_resolve_name==1.3.10
platformdirs==3.11.0
plotly==5.21.0
pluggy==1.5.0
portalocker==2.8.2
portpicker==1.6.0
prettytable==3.10.0
prometheus_client==0.20.0
promise==2.3
prompt-toolkit==3.0.42
proto-plus==1.23.0
protobuf==3.20.3
psutil==5.9.3
ptyprocess==0.7.0
pure-eval==0.2.2
py-spy==0.3.14
pyarrow==10.0.1
pyarrow-hotfix==0.6
pyasn1==0.6.0
pyasn1_modules==0.4.0
pycosat==0.6.6
pycparser==2.22
pydantic==1.10.15
pydot==1.4.2
pyfarmhash==0.3.2
Pygments==2.17.2
pyjsparser==2.7.1
PyJWT==2.8.0
pymongo==3.13.0
pyOpenSSL==24.0.0
pyparsing==3.1.2
PySocks==1.7.1
python-dateutil==2.9.0
python-dotenv==1.0.1
python-json-logger==2.0.7
python-snappy==0.5.4
pytz==2024.1
pyu2f==0.1.5
PyWavelets==1.6.0
PyYAML==6.0.1
pyzmq==24.0.1
ray==2.11.0
ray-cpp==2.11.0
redis==5.0.4
referencing==0.34.0
regex==2024.4.16
requests==2.31.0
requests-oauthlib==2.0.0
requests-toolbelt==0.10.1
retrying==1.3.3
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.7.1
rouge_score==0.1.2
rpds-py==0.18.0
rsa==4.9
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
ruamel-yaml-conda==0.15.100
sacrebleu==2.4.2
scikit-image==0.23.2
scikit-learn==1.4.2
scipy==1.11.4
seaborn==0.12.2
SecretStorage==3.3.3
Send2Trash==1.8.3
setuptools==69.5.1
shapely==2.0.4
shellingham==1.5.4
simpervisor==1.0.0
six==1.16.0
smart-open==7.0.4
smmap==5.0.1
sniffio==1.3.1
soupsieve==2.5
SQLAlchemy==2.0.29
sqlparse==0.5.0
stack-data==0.6.2
starlette==0.37.2
statsmodels==0.14.2
strip-hints==0.1.10
tabulate==0.9.0
tangled-up-in-unicode==0.2.0
tenacity==8.2.3
tensorboard==2.15.2
tensorboard-data-server==0.7.2
tensorboard_plugin_profile==2.15.1
tensorboardX==2.6.2.2
tensorflow==2.15.1
tensorflow-cloud==0.1.16
tensorflow-data-validation==1.15.1
tensorflow-datasets==4.9.4
tensorflow-estimator==2.15.0
tensorflow-hub==0.15.0
tensorflow-io==0.24.0
tensorflow-io-gcs-filesystem==0.24.0
tensorflow-metadata==1.15.0
tensorflow_model_analysis==0.46.0
tensorflow-probability==0.24.0
tensorflow-serving-api==2.15.1
tensorflow-transform==1.15.0
termcolor==2.4.0
terminado==0.18.1
textual==0.57.1
tf_keras==2.15.1
tfx==1.15.1
tfx-bsl==1.15.1
threadpoolctl==3.4.0
tifffile==2024.4.18
time-machine==2.14.1
tinycss2==1.2.1
toml==0.10.2
tomli==2.0.1
tornado==6.4
tqdm==4.66.2
traitlets==5.9.0
truststore==0.8.0
typeguard==4.2.1
typer==0.12.3
types-python-dateutil==2.9.0.20240316
typing_extensions==4.11.0
typing-utils==0.1.0
tzdata==2024.1
tzlocal==5.2
uc-micro-py==1.0.3
uri-template==1.3.0
uritemplate==3.0.1
urllib3==1.26.18
uvicorn==0.29.0
uvloop==0.19.0
virtualenv==20.21.0
visions==0.7.5
watchfiles==0.21.0
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.7.0
websockets==12.0
Werkzeug==2.1.2
wheel==0.43.0
widgetsnbextension==3.6.6
witwidget==1.8.1
wordcloud==1.9.3
wrapt==1.14.1
y-py==0.6.2
yarl==1.9.4
ydata-profiling==4.6.0
ypy-websocket==0.8.4
zipp==3.17.0
zstandard==0.22.0

Describe the expected behavior

The worker stages should start successfully, write worker logs, and complete. For example:

Note the image above was achieved using the following beam args:

EXAMPLE_GEN_BEAM_ARGS = [
    '--job_name={}'.format(PIPELINE_NAME.replace("_","-") + "-examplegen"),
    '--runner=DataflowRunner',
    '--project=' + GOOGLE_CLOUD_PROJECT,
    '--region=' + GOOGLE_CLOUD_REGION,
    '--temp_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'tmp'),
    '--staging_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'staging'),
    '--service_account_email=<redacted sa>',
    '--no_use_public_ips',
    '--subnetwork=<redacted subnet>',
    '--machine_type=e2-standard-8',
    '--experiments=use_runner_v2',
    '--max_num_workers=1',
    '--num_workers=1',
    '--disk_size_gb=100',
    '--prebuild_sdk_container_engine=cloud_build',
    '--docker_registry_push_url=northamerica-northeast1-docker.pkg.dev/prj-cxbi-dev-nane1-dsc-ttep/ml',
]

However this is not feasible because I need to get my custom code into the image for components which reference it (Transform, Trainer, Evalutaor). For example I get the following error with the Transform component (as I should):

Error processing instruction process_bundle-789893474543504259-101. Original traceback is Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/apache_beam/internal/dill_pickler.py", line 418, in loads return dill.loads(s) File "/usr/local/lib/python3.10/site-packages/dill/_dill.py", line 275, in loads return load(file, ignore, **kwds) File "/usr/local/lib/python3.10/site-packages/dill/_dill.py", line 270, in load return Unpickler(file, ignore=ignore, **kwds).load() File "/usr/local/lib/python3.10/site-packages/dill/_dill.py", line 472, in load obj = StockUnpickler.load(self) File "/usr/local/lib/python3.10/site-packages/dill/_dill.py", line 462, in find_class return StockUnpickler.find_class(self, module, name) ModuleNotFoundError: No module named 'models'

Standalone code to reproduce the issue

My files are arranged as such:

- my_pipeline (the name of my pipeline)
     - data
     - models
          - model
              -  __init__.py
              - constants.py
              - model.py
              - predict_extractor.py
              - tuner.py
          - __init__.py
          - features.py
          - features.py
          - preprocessing.py
          - query.sql
     - pipeline
          - __init__.py
          - configs.py
          - pipeline.py
     - __init__.py
     - Dockerfile
     - kubeflow_v2_runner.py
     - Makefile
     - requirements.txt

Dockerfile:

FROM tensorflow/tfx:1.15.1
WORKDIR /pipeline
COPY ./ ./
ENV PYTHONPATH="/pipeline:${PYTHONPATH}"

Apache Beam Args:

EXAMPLE_GEN_BEAM_ARGS = [
    '--job_name={}'.format(PIPELINE_NAME.replace("_","-") + "-examplegen"),
    '--runner=DataflowRunner',
    '--project=' + GOOGLE_CLOUD_PROJECT,
    '--region=' + GOOGLE_CLOUD_REGION,
    '--temp_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'tmp'),
    '--staging_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'staging'),
    '--service_account_email=<redacted sa>',
    '--no_use_public_ips',
    '--subnetwork=<redacted subnet>',
    '--machine_type=e2-standard-8',
    '--experiments=use_runner_v2',
    '--max_num_workers=1',
    '--num_workers=1',
    '--disk_size_gb=100',
    '--sdk_container_image=' + PIPELINE_IMAGE,
    '--sdk_location=container',
]

KubeFlow Runner & Config:

runner_config = kubeflow_v2_dag_runner.KubeflowV2DagRunnerConfig(
        default_image=configs.PIPELINE_IMAGE)

dsl_pipeline = pipeline.create_pipeline(**args)

runner = kubeflow_v2_dag_runner.KubeflowV2DagRunner(config=runner_config)
runner.run(pipeline=dsl_pipeline)

Component init in pipeline.py:

components = []

example_gen = tfx.components.CsvExampleGen(input_base=data_path)

if example_gen_beam_args is not None:
    example_gen.with_beam_pipeline_args(example_gen_beam_args)


return tfx.dsl.Pipeline(
        pipeline_name=pipeline_name,
        pipeline_root=pipeline_root,
        components=components,
        # Change this value to control caching of execution results. Default value is `False`.
        enable_cache=True,
        metadata_connection_config=metadata_connection_config
)

The text was updated successfully, but these errors were encountered:

pritamdodeja · 2024-09-28T19:43:01Z

I have a fully functional tfx pipeline that was running in vertex a couple of months ago. The same pipeline no longer works, does not even get past examplegen. I am also running into dataflow related issues. Is there a canonical pipeline that can be run to verify vertex/dataflow is working as expected? I am able to get that same pipeline to locally via DirectRunner. How should one go about porting local tfx pipelines to vertex/dataflow runner?

IzakMaraisTAL · 2024-10-01T06:21:15Z

stockUnpickler.load(self) File "/usr/local/lib/python3.10/site-packages/dill/_dill.py", line 462, in find_class return StockUnpickler.find_class(self, module, name) ModuleNotFoundError: No module named 'models'

It sounds like there is some path related problem. The Dockerfile looks reasonable.

Have you tried debugging this by running the container locally and then confirming that there is a module named models?

docker run --rm -it --entrypoint /bin/bash <your image>

adammkerr · 2024-10-01T13:10:40Z

stockUnpickler.load(self) File "/usr/local/lib/python3.10/site-packages/dill/_dill.py", line 462, in find_class return StockUnpickler.find_class(self, module, name) ModuleNotFoundError: No module named 'models'

It sounds like there is some path related problem. The Dockerfile looks reasonable.

Have you tried debugging this by running the container locally and then confirming that there is a module named models?
docker run --rm -it --entrypoint /bin/bash <your image>

Sorry, my description above was a bit obfuscated, let me clarify.

Attempt 1: Extend official Docker Image and add my custom code
If I extend the official image and pre-build a container containing my custom code (as noted by my dockerfile) the Dataflow worker hangs.

So if I use the following .with_beam_pipeline_args(example_gen_beam_args) to launch a dataflow job, it hangs:

EXAMPLE_GEN_BEAM_ARGS = [
    '--job_name={}'.format(PIPELINE_NAME.replace("_","-") + "-examplegen"),
    '--runner=DataflowRunner',
    '--project=' + GOOGLE_CLOUD_PROJECT,
    '--region=' + GOOGLE_CLOUD_REGION,
    '--temp_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'tmp'),
    '--staging_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'staging'),
    '--service_account_email=<redacted sa>',
    '--no_use_public_ips',
    '--subnetwork=<redacted subnet>',
    '--machine_type=e2-standard-8',
    '--experiments=use_runner_v2',
    '--max_num_workers=1',
    '--num_workers=1',
    '--disk_size_gb=100',
    '--sdk_container_image=' + PIPELINE_IMAGE,
    '--sdk_location=container',
]

Attempt 2: use --prebuild_sdk_container_engine

So if I use the following .with_beam_pipeline_args(example_gen_beam_args) to launch a dataflow job, the container does not hang:

EXAMPLE_GEN_BEAM_ARGS = [
    '--job_name={}'.format(PIPELINE_NAME.replace("_","-") + "-examplegen"),
    '--runner=DataflowRunner',
    '--project=' + GOOGLE_CLOUD_PROJECT,
    '--region=' + GOOGLE_CLOUD_REGION,
    '--temp_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'tmp'),
    '--staging_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'staging'),
    '--service_account_email=<redacted sa>',
    '--no_use_public_ips',
    '--subnetwork=<redacted subnet>',
    '--machine_type=e2-standard-8',
    '--experiments=use_runner_v2',
    '--max_num_workers=1',
    '--num_workers=1',
    '--disk_size_gb=100',
    '--prebuild_sdk_container_engine=cloud_build',
    '--docker_registry_push_url=northamerica-northeast1-docker.pkg.dev/prj-cxbi-dev-nane1-dsc-ttep/ml',
]

However this isn't a valid solution, because these flags are pre-building the SDK using cloud build, and does not contain any of my custom code required for downstream components (transform, trainer - hence the No module named 'models' error you referenced).

Attempt 3: use custom container, without Dataflow
What even more puzzling, is that if I use my custom container from attempt 1, remove .with_beam_pipeline_args(example_gen_beam_args) from each component which I am trying to leverage dataflow, the pipeline completes.

Which tells there isn't a bug with how the TFX lib is leveraging apache beam, but leveraging apache beam with Dataflow.

IzakMaraisTAL · 2024-10-01T13:28:32Z

OK, that narrows it down a bit.

We also extend the TFX base image and successfully use the custom image with Dataflow. I don't have the time to try to reproduce your ticket (I am a user of the project, not a dev), but I can share our working config.

Here is what we pass into .with_beam_pipeline_args:

DATAFLOW_BEAM_PIPELINE_ARGS = [
    f"--project={GOOGLE_CLOUD_PROJECT}",
    "--runner=DataflowRunner",
    f"--temp_location={TEMP_LOCATION}",
    f"--region={GOOGLE_CLOUD_REGION}",
    "--disk_size_gb=50",
    "--machine_type=n2-standard-2",
    "--experiments=use_runner_v2",
    f"--subnetwork={SUBNETWORK}",
    f"--sdk_container_image={PIPELINE_IMAGE}",
    f"--labels=group={GROUP}",
    f"--labels=team={TEAM}",
    f"--labels=project={PROJECT}",
]

Hope that helps.

adammkerr · 2024-10-01T15:03:09Z

OK, that narrows it down a bit.

We also extend the TFX base image and successfully use the custom image with Dataflow. I don't have the time to try to reproduce your ticket (I am a user of the project, not a dev), but I can share our working config.

Here is what we pass into .with_beam_pipeline_args:
DATAFLOW_BEAM_PIPELINE_ARGS = [
    f"--project={GOOGLE_CLOUD_PROJECT}",
    "--runner=DataflowRunner",
    f"--temp_location={TEMP_LOCATION}",
    f"--region={GOOGLE_CLOUD_REGION}",
    "--disk_size_gb=50",
    "--machine_type=n2-standard-2",
    "--experiments=use_runner_v2",
    f"--subnetwork={SUBNETWORK}",
    f"--sdk_container_image={PIPELINE_IMAGE}",
    f"--labels=group={GROUP}",
    f"--labels=team={TEAM}",
    f"--labels=project={PROJECT}",
]
Hope that helps.

Unfortunately no luck @IzakMaraisTAL, my DF worker still gives the following logs / error on endless loop:

Thank though for the suggestions! What TFX Version are you using? If I roll back to 1.12.0, my exact same code works for dataflow. Any version greater than 1.12.0 seems to hang with the same code / setup.

IzakMaraisTAL · 2024-10-02T06:42:10Z

Thank though for the suggestions! What TFX Version are you using? If I roll back to 1.12.0, my exact same code works for dataflow. Any version greater than 1.12.0 seems to hang with the same code / setup.

Aah, another clue.

This might be related to one of these tickets #6368, #6386? We currently use TFX 1.14 with the following Dockerfile

FROM tensorflow/tfx:1.14.0
WORKDIR /pipeline

COPY requirements.txt requirements.txt
# Fix pip in the tfx:1.14.0 base image to use the correct python environment (3.8, not 3.10)
RUN sed -i 's/python3/python/g' /usr/bin/pip

RUN pip install  -r requirements.txt

ENV PYTHONPATH="/pipeline:${PYTHONPATH}"
# Fix dataflow job sometimes not starting 
ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1
COPY version \
    vertex_runner.py \
    config.py ./
COPY ml_pipeline ./ml_pipeline

adammkerr · 2024-10-02T20:05:51Z

Thank though for the suggestions! What TFX Version are you using? If I roll back to 1.12.0, my exact same code works for dataflow. Any version greater than 1.12.0 seems to hang with the same code / setup.

Aah, another clue.

This might be related to one of these tickets #6368, #6386? We currently use TFX 1.14 with the following Dockerfile
FROM tensorflow/tfx:1.14.0
WORKDIR /pipeline

COPY requirements.txt requirements.txt
# Fix pip in the tfx:1.14.0 base image to use the correct python environment (3.8, not 3.10)
RUN sed -i 's/python3/python/g' /usr/bin/pip

RUN pip install  -r requirements.txt

ENV PYTHONPATH="/pipeline:${PYTHONPATH}"
# Fix dataflow job sometimes not starting 
ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1
COPY version \
    vertex_runner.py \
    config.py ./
COPY ml_pipeline ./ml_pipeline

If I roll back to 1.14.0 this worked for me!

However using the same Dockerfile and extending from 1.15.1 instead still gives me this endless loop in my Dataflow worker log:

I think I am going to roll back to 1.14.0 for now since it seems stable, however going to leave this issue open because there still seems to be a bug in 1.15.1

@IzakMaraisTAL thank you for all your help! Really really appreciated :)

pritamdodeja · 2024-10-02T22:04:09Z

How would one go about debugging something like this? Would the following be possible/right approach?

Get it to work with direct runner, verify correctness
Execute with portable runner, see if vanilla beam container works (use gcs as filesystem)
If 2 works, then it's something with dataflow, if it doesn't, inject python debugger into container. In this scenario, how do we enter the runtime? I understand containers have a go binary as entrypoint.

Thanks for sharing your knowledge!

adammkerr · 2024-10-07T19:30:21Z

How would one go about debugging something like this? Would the following be possible/right approach?

Get it to work with direct runner, verify correctness

Execute with portable runner, see if vanilla beam container works (use gcs as filesystem)

If 2 works, then it's something with dataflow, if it doesn't, inject python debugger into container. In this scenario, how do we enter the runtime? I understand containers have a go binary as entrypoint.

Thanks for sharing your knowledge!

That is a pretty solid approach in my opinion - my approach was:

Develop the pipeline logic in an notebook using InteractiveContext - ensure that I can call each component with specified UDFs.
Test your code - write pytests for each UDF that you create (any module_path / module_file provided to any component)
Launch the job with no beam args - remove .with_beam_pipeline_args from each component which leverages beam. Default behavior will execute the beam jobs in your specified container using DirectRunner.
If a beam-backed component fails - follow the error.
If all beam-backed component are successful - add .with_beam_pipeline_args back to each component which leverages beam. Ensure that you specify beam args to leverage Dataflow (ex. --runner="DataflowRunner")
Launch the job , watch the logs, if a dataflow job fails - follow the error.

In this case, since the logs indicate we are stuck in an endless loop upon worker start up, the idea would be to hop into the code and see how the respective component is launching the Dataflow job and what is going on - something I am currently struggling with (hence the bug report here).

pritamdodeja · 2024-10-21T22:49:08Z

I have a fully functional tfx pipeline that was running in vertex a couple of months ago. The same pipeline no longer works, does not even get past examplegen. I am also running into dataflow related issues. Is there a canonical pipeline that can be run to verify vertex/dataflow is working as expected? I am able to get that same pipeline to locally via DirectRunner. How should one go about porting local tfx pipelines to vertex/dataflow runner?

My pipeline issue is fixed. I solved it by doing what the error you're seeing is suggesting. The beam sdk version (what we see in python pip output) should match the "boot" version of apache beam (i.e. the entry point of that docker container, that go binary). I solved it by porting my pipeline to a local version that uses PortableRunner, debugging there, and then porting back to GCP.

adammkerr added the type:bug label Sep 4, 2024

adammkerr changed the title ~~Extending the corresponding version of the official tensorflow/tfx image causes hanging Dataflow Worker.~~ Extending the corresponding version of the official tensorflow/tfx image causes hanging Dataflow Worker Sep 5, 2024

lego0901 self-assigned this Sep 25, 2024

adammkerr mentioned this issue Sep 30, 2024

TFX 1.15.0 Issues #6792

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending the corresponding version of the official tensorflow/tfx image causes hanging Dataflow Worker #6911

Extending the corresponding version of the official tensorflow/tfx image causes hanging Dataflow Worker #6911

adammkerr commented Sep 4, 2024 •

edited

Loading

pritamdodeja commented Sep 28, 2024

IzakMaraisTAL commented Oct 1, 2024

adammkerr commented Oct 1, 2024 •

edited

Loading

IzakMaraisTAL commented Oct 1, 2024 •

edited

Loading

adammkerr commented Oct 1, 2024 •

edited

Loading

IzakMaraisTAL commented Oct 2, 2024

adammkerr commented Oct 2, 2024 •

edited

Loading

pritamdodeja commented Oct 2, 2024

adammkerr commented Oct 7, 2024

pritamdodeja commented Oct 21, 2024

Extending the corresponding version of the official tensorflow/tfx image causes hanging Dataflow Worker #6911

Extending the corresponding version of the official tensorflow/tfx image causes hanging Dataflow Worker #6911

Comments

adammkerr commented Sep 4, 2024 • edited Loading

pritamdodeja commented Sep 28, 2024

IzakMaraisTAL commented Oct 1, 2024

adammkerr commented Oct 1, 2024 • edited Loading

IzakMaraisTAL commented Oct 1, 2024 • edited Loading

adammkerr commented Oct 1, 2024 • edited Loading

IzakMaraisTAL commented Oct 2, 2024

adammkerr commented Oct 2, 2024 • edited Loading

pritamdodeja commented Oct 2, 2024

adammkerr commented Oct 7, 2024

pritamdodeja commented Oct 21, 2024

adammkerr commented Sep 4, 2024 •

edited

Loading

adammkerr commented Oct 1, 2024 •

edited

Loading

IzakMaraisTAL commented Oct 1, 2024 •

edited

Loading

adammkerr commented Oct 1, 2024 •

edited

Loading

adammkerr commented Oct 2, 2024 •

edited

Loading