-
Notifications
You must be signed in to change notification settings - Fork 708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFX1.14.0 causing Google Cloud Dataflow jobs to fail #6386
Comments
I came across few troubleshooting steps to debug the issue further. Can you please follow the step as shown in Troubleshooting dataflow error and let us know what is causing this error. This will help us lot in finding the root cause of the issue. Thank you! |
@coocloud could you please provide us with Dataflow Job ID? |
@AnuarTB This is the Dataflow Job ID that failed |
This issue was previously reported by one of the TFX users and apologies for missing that. Dataflow struggles to fetch the TFX image as TFX image is of large size. Please try the above solution and let us know if you face any issues. Thank you! |
@singhniraj08 Dataflow job id: |
Can you share your Dataflow job id? |
Regarding |
It is not about the image size. If you check your
|
Any update on this. I am facing the exact same issue highlighted above. |
check my comment in #6386 (comment). If you see that error, you can ssh to your container like |
Thanks @liferoad : I got the exact same error so i build a custom image, which basically ran tfx 1.14.0 but added that ENV and it all worked fine. |
@liferoad Thanks, my dataflow job seems to run successfully after adding that environment variable. |
Closing this issue, since the issue is resolved for you. Please take a look into the answers provided above, feel free to reopen and post your comments(if you still have queries on this). Thank you! |
@singhniraj08, should this environment variable not be added to the TFX base image before the issue is closed? Is the TFX base image not intended to be used to run TFX jobs (on Vertex AI or Kubeflow)? Those TFX jobs might reasonably include with Dataflow components. |
@IzakMaraisTAL, Yes, It will make more sense to add the environment variable to TFX base image to avoid these issues in future. I have to make sure that it doesn't break any other scenarios where DockerFile is being used apart from Dataflow. |
I want to mention that Dataflow-side, it seems that when the job is cancelled after the 1-hour timeout, the pyenv global 3.8.10 # issue also occurs on 3.10.8
mkdir /tmp/venv
python -m venv /tmp/venv
source /tmp/venv/bin/activate
pip install 'tfx==1.14.0' Which outputs logs such as
It has been doing so for the last 20+ minutes. I believe the Dataflow job fails because |
I tried the update the ENV variable in TFX dockerfile and build image but it takes forever to build because of #6468. TFX dependencies takes lot of time to install and results in installation failure. Once that issue is fixed, I will be able to integrate the environment variable in docker file and test it. Thanks. @jonathan-lemos, Thank you for bringing this up. This should be fixed once we fix #6468 issue. |
Hello @singhniraj08 and @liferoad , I'm still encountering the same problem. Even after adding the "--experiments=disable_worker_container_image_prepull" and My job id : Here is my code for docker image : ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1 COPY requirementsfinal.txt requirements.txt RUN sed -i 's/python3/python/g' /usr/bin/pip COPY src/ src/ ENV PYTHONPATH="/pipeline:${PYTHONPATH}" `
`
and worker logs |
@IzakMaraisTAL and @coocloud can you share your config files or anything you have done differently to make it work, please ! |
This also didn't work for me the first time I tried it. Then I realised you also need to make sure your custom image is used by Dataflow by adding |
It worked !! Thank you @IzakMaraisTAL ! |
I hope your issues is resolved. If you have issue, please reopen this thread. |
System information
Describe the current behavior
When running the BigQueryExampleGen component on Google Cloud Dataflow using TFX1.14.0, the dataflow job gets stuck with the error:
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
Describe the expected behavior
It should not fail/get stuck.
Standalone code to reproduce the issue
Other info / logs
The job fails after 1hr, regardless of the machine type or query used.
Setting
PIPELINE_IMAGE
to tfx1.13.0 still fails, it currently only works on tfx.1.12.0The text was updated successfully, but these errors were encountered: