TFX1.14.0 causing Google Cloud Dataflow jobs to fail #6386

coocloud · 2023-10-20T14:19:54Z

System information

Have I specified the code to reproduce the issue: Yes
Environment in which the code is executed: Google Cloud Dataflow
TensorFlow version: 2.13.0
TFX Version: 1.14.0
Python version: 3.8.10
Python dependencies: Docker Image

Describe the current behavior
When running the BigQueryExampleGen component on Google Cloud Dataflow using TFX1.14.0, the dataflow job gets stuck with the error:
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.

Describe the expected behavior
It should not fail/get stuck.

Standalone code to reproduce the issue

PIPELINE_IMAGE = "tensorflow/tfx:1.14.0"
DATAFLOW_BEAM_PIPELINE_ARGS = [
    f"--project={GOOGLE_CLOUD_PROJECT}",
    "--runner=DataflowRunner",
    f"--temp_location={TEMP_LOCATION}",
    f"--region={GOOGLE_CLOUD_REGION}",
    "--disk_size_gb=50",
    "--machine_type=e2-standard-8",
    "--experiments=use_runner_v2",
    f"--subnetwork={SUBNETWORK}",
    f"‑‑experiments=use_sibling_sdk_workers",
    f"--sdk_container_image={PIPELINE_IMAGE}",
    f"--labels=group={GROUP}",
    f"--labels=team={TEAM}",
    f"--labels=project={PROJECT}",
    "--job_name=test-tfx-1-14",
]

def run():
    query = {
        "train": "SELECT 1 as one",
        "eval": "SELECT 1 as one",
        "test": "SELECT 1 as one",
    }

    input_config = example_gen_pb2.Input(
        splits=[
            example_gen_pb2.Input.Split(name="train", pattern=query["train"]),
            example_gen_pb2.Input.Split(name="eval", pattern=query["eval"]),
            example_gen_pb2.Input.Split(name="test", pattern=query["test"]),
        ]
    )

    BQ_BEAM_ARGS = [
        f"--project={GOOGLE_CLOUD_PROJECT}",
        f"--temp_location={TEMP_LOCATION}",
    ]

    example_gen = BigQueryExampleGen(
        input_config=input_config
    ).with_beam_pipeline_args(DATAFLOW_BEAM_PIPELINE_ARGS)

    metadata_config = (
        tfx.orchestration.experimental.get_default_kubeflow_metadata_config()
    )
    pipeline_operator_funcs = get_default_pipeline_operator_funcs()

    runner_config = tfx.orchestration.experimental.KubeflowDagRunnerConfig(
        kubeflow_metadata_config=metadata_config,
        tfx_image=PIPELINE_IMAGE,
        pipeline_operator_funcs=pipeline_operator_funcs
    )

    pod_labels = {
        "add-pod-env": "true",
        tfx.orchestration.experimental.LABEL_KFP_SDK_ENV: "tfx-template",
    }

    tfx.orchestration.experimental.KubeflowDagRunner(
        config=runner_config,
        pod_labels_to_attach=pod_labels
    ).run(
        pipeline=pipeline.Pipeline(
            pipeline_name=PIPELINE_NAME,
            pipeline_root=PIPELINE_ROOT,
            components=[example_gen],
            beam_pipeline_args=BQ_BEAM_ARGS
        )
    )


if __name__ == "__main__":
    logging.set_verbosity(logging.INFO)
    run()

Other info / logs
The job fails after 1hr, regardless of the machine type or query used.
Setting PIPELINE_IMAGE to tfx1.13.0 still fails, it currently only works on tfx.1.12.0

The text was updated successfully, but these errors were encountered:

IzakMaraisTAL · 2023-10-23T05:26:08Z

I also observed this going from TFX 1.12.0 to 1.14.0. My only Dataflow component is the Transform component, so it is the one that gets stuck.

Here are some metrics. There is no throughput:

CPU usage has a cyclical pattern:

It is not lack of memory: it is only using a fraction of the available memory.

singhniraj08 · 2023-10-26T09:04:25Z

@coocloud,

I came across few troubleshooting steps to debug the issue further. Can you please follow the step as shown in Troubleshooting dataflow error and let us know what is causing this error. This will help us lot in finding the root cause of the issue. Thank you!

coocloud · 2023-10-26T09:42:30Z

@singhniraj08

These are the errors in the logs

System logs:

ima: Can not allocate sha384 (reason: -2)

Kubelet logs:

"Error initializing dynamic plugin prober" err="error (re-)creating driver directory: mkdir /usr/libexec/kubernetes: read-only file system"
"Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"

Also seems like it's struggling to download the container image

AnuarTB · 2023-11-16T06:14:22Z

@coocloud could you please provide us with Dataflow Job ID?

coocloud · 2023-11-16T07:08:13Z

@AnuarTB This is the Dataflow Job ID that failed
2023-10-24_02_28_52-1220281879361432238

singhniraj08 · 2023-11-17T04:22:48Z

@coocloud,

This issue was previously reported by one of the TFX users and apologies for missing that. Dataflow struggles to fetch the TFX image as TFX image is of large size.
The solution for this issue is to set the flag --experiments=disable_worker_container_image_prepull in pipeline options. Ref: similar issue

Please try the above solution and let us know if you face any issues. Thank you!

coocloud · 2023-11-17T13:45:25Z

@singhniraj08
I have added that flag but I am still getting the same error

Dataflow job id: 2023-11-17_04_23_23-6315535713304255245

liferoad · 2023-11-17T14:30:51Z

Can you share your Dataflow job id?

jonathan-lemos · 2023-11-18T00:26:06Z

@coocloud

Regarding 2023-11-17_04_23_23-6315535713304255245, it looks like the image is so big that you ran out of disk space. Try bumping up it up.

coocloud · 2023-11-23T07:20:37Z

Increasing the disk space also didn't seem to fix the issue.
Job id: 2023-11-20_05_04_45-14348954509617130059

The size of the image is around 11GB so surely it should be enough?

liferoad · 2023-11-23T13:30:16Z

It is not about the image size. If you check your worker-startup log, one error occurred:

2023/11/20 13:11:49 failed to create a virtual environment. If running on Ubuntu systems, you might need to install `python3-venv` package. To run the SDK process in default environment instead, set the environment variable `RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1`. In custom Docker images, you can do that with an `ENV` statement. Encountered error: python venv initialization failed: exit status 1

ee07dazn · 2023-11-26T19:18:01Z

Any update on this. I am facing the exact same issue highlighted above.

liferoad · 2023-11-27T01:09:25Z

check my comment in #6386 (comment). If you see that error, you can ssh to your container like docker run --rm -it --entrypoint=/bin/bash YOUR_CONTAINER_IMAGE and check whether venv is installed. or when building your container, use ENV to define RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1.

ee07dazn · 2023-11-27T12:19:00Z

Thanks @liferoad : I got the exact same error so i build a custom image, which basically ran tfx 1.14.0 but added that ENV and it all worked fine.
Thank you!!!

coocloud · 2023-11-27T14:46:23Z

@liferoad Thanks, my dataflow job seems to run successfully after adding that environment variable.

singhniraj08 · 2023-11-28T04:09:45Z

@coocloud,

Closing this issue, since the issue is resolved for you. Please take a look into the answers provided above, feel free to reopen and post your comments(if you still have queries on this). Thank you!

github-actions · 2023-11-28T04:09:58Z

Are you satisfied with the resolution of your issue?
Yes
No

IzakMaraisTAL · 2023-11-28T05:41:56Z

@singhniraj08, should this environment variable not be added to the TFX base image before the issue is closed? Is the TFX base image not intended to be used to run TFX jobs (on Vertex AI or Kubeflow)? Those TFX jobs might reasonably include with Dataflow components.

singhniraj08 · 2023-11-28T06:09:32Z

@IzakMaraisTAL, Yes, It will make more sense to add the environment variable to TFX base image to avoid these issues in future. I have to make sure that it doesn't break any other scenarios where DockerFile is being used apart from Dataflow.
Reopening this issue. We will update this thread. Thank you for bringing this up!

jonathan-lemos · 2023-11-28T16:50:13Z

I want to mention that Dataflow-side, it seems that when the job is cancelled after the 1-hour timeout, the pip install of the TFX package is still running with high CPU usage. I can reproduce this issue locally with

pyenv global 3.8.10 # issue also occurs on 3.10.8
mkdir /tmp/venv
python -m venv /tmp/venv
source /tmp/venv/bin/activate

pip install 'tfx==1.14.0'

Which outputs logs such as

INFO: pip is looking at multiple versions of exceptiongroup to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of anyio to determine which version is compatible with other requirements. This could take a while.
Collecting anyio>=3.1.0
  Downloading anyio-4.0.0-py3-none-any.whl (83 kB)
     |████████████████████████████████| 83 kB 2.3 MB/s 
INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. If you want to abort this run, you can press Ctrl + C to do so. To improve how pip performs, tell us what happened here: https://pip.pypa.io/surveys/backtracking
  Using cached anyio-3.7.1-py3-none-any.whl (80 kB)
  Using cached anyio-3.7.0-py3-none-any.whl (80 kB)
  Using cached anyio-3.6.2-py3-none-any.whl (80 kB)
  Using cached anyio-3.6.1-py3-none-any.whl (80 kB)
  Using cached anyio-3.6.0-py3-none-any.whl (80 kB)
  Using cached anyio-3.5.0-py3-none-any.whl (79 kB)
INFO: pip is looking at multiple versions of anyio to determine which version is compatible with other requirements. This could take a while.
  Using cached anyio-3.4.0-py3-none-any.whl (78 kB)
  Using cached anyio-3.3.4-py3-none-any.whl (78 kB)
  Using cached anyio-3.3.3-py3-none-any.whl (78 kB)
  Using cached anyio-3.3.2-py3-none-any.whl (78 kB)
  Using cached anyio-3.3.1-py3-none-any.whl (77 kB)
INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. If you want to abort this run, you can press Ctrl + C to do so. To improve how pip performs, tell us what happened here: https://pip.pypa.io/surveys/backtracking
  Using cached anyio-3.3.0-py3-none-any.whl (77 kB)
  Using cached anyio-3.2.1-py3-none-any.whl (75 kB)
  Using cached anyio-3.2.0-py3-none-any.whl (75 kB)
  Using cached anyio-3.1.0-py3-none-any.whl (74 kB)
INFO: pip is looking at multiple versions of jupyter-server to determine which version is compatible with other requirements. This could take a while.
Collecting jupyter-server<3,>=2.4.0
  Using cached jupyter_server-2.10.1-py3-none-any.whl (378 kB)
...

It has been doing so for the last 20+ minutes.

I believe the Dataflow job fails because pip is unable to resolve the versions within the 1-hour time limit.

singhniraj08 · 2023-11-29T09:24:58Z

@IzakMaraisTAL,

I tried the update the ENV variable in TFX dockerfile and build image but it takes forever to build because of #6468. TFX dependencies takes lot of time to install and results in installation failure. Once that issue is fixed, I will be able to integrate the environment variable in docker file and test it. Thanks.

@jonathan-lemos, Thank you for bringing this up. This should be fixed once we fix #6468 issue.

Saoussen-CH · 2024-01-29T08:03:18Z

Hello @singhniraj08 and @liferoad ,

I'm still encountering the same problem.

Even after adding the "--experiments=disable_worker_container_image_prepull" and ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1

My job id : 2024-01-28_15_08_34-3324575780147226127

Here is my code for docker image :
`FROM gcr.io/tfx-oss-public/tfx:1.14.0

ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1

COPY requirementsfinal.txt requirements.txt

RUN sed -i 's/python3/python/g' /usr/bin/pip
RUN pip install -r requirements.txt

COPY src/ src/

ENV PYTHONPATH="/pipeline:${PYTHONPATH}"
`
And the rest is similar to the one

`
train_output_config = example_gen_pb2.Output(
split_config=example_gen_pb2.SplitConfig(
splits=[
example_gen_pb2.SplitConfig.Split(
name="train", hash_buckets=int(config.NUM_TRAIN_SPLITS)
),
example_gen_pb2.SplitConfig.Split(
name="eval", hash_buckets=int(config.NUM_EVAL_SPLITS)
),
]
)
)

# Train example generation

train_example_gen = tfx.extensions.google_cloud_big_query.BigQueryExampleGen(
    query=train_sql_query,
    output_config=train_output_config,
    custom_config=json.dumps({})
).with_beam_pipeline_args(config.BEAM_DATAFLOW_PIPELINE_ARGS).with_id("TrainDataGen")

`

BEAM_DATAFLOW_PIPELINE_ARGS = [ f"--project={PROJECT}", f"--temp_location={os.path.join(GCS_LOCATION, 'temp')}", f"--region=us-east1", f"--runner={BEAM_RUNNER}", "--disk_size_gb=50", "--experiments=disable_worker_container_image_prepull", ]
for the runner
runner = tfx.orchestration.experimental.KubeflowV2DagRunner( config=tfx.orchestration.experimental.KubeflowV2DagRunnerConfig(default_image=config.TFX_IMAGE_URI), output_filename=PIPELINE_DEFINITION_FILE)

the logs

and worker logs

Saoussen-CH · 2024-01-29T08:05:04Z

@IzakMaraisTAL and @coocloud can you share your config files or anything you have done differently to make it work, please !

IzakMaraisTAL · 2024-01-29T08:09:59Z

This also didn't work for me the first time I tried it. Then I realised you also need to make sure your custom image is used by Dataflow by adding f"--sdk_container_image={PIPELINE_IMAGE}", to BEAM_DATAFLOW_PIPELINE_ARGS.

Saoussen-CH · 2024-01-29T12:26:11Z

It worked !! Thank you @IzakMaraisTAL !

AnuarTB · 2024-04-30T01:22:21Z

I hope your issues is resolved. If you have issue, please reopen this thread.

github-actions · 2024-04-30T01:22:33Z

Are you satisfied with the resolution of your issue?
Yes
No

coocloud added the type:bug label Oct 20, 2023

singhniraj08 self-assigned this Oct 25, 2023

singhniraj08 added the stat:awaiting response label Oct 26, 2023

google-ml-butler bot removed the stat:awaiting response label Oct 26, 2023

singhniraj08 assigned AnuarTB and unassigned singhniraj08 Oct 27, 2023

singhniraj08 added the stat:awaiting tensorflower label Oct 27, 2023

singhniraj08 self-assigned this Nov 17, 2023

singhniraj08 added stat:awaiting response and removed stat:awaiting tensorflower labels Nov 17, 2023

google-ml-butler bot removed the stat:awaiting response label Nov 17, 2023

singhniraj08 added the stat:awaiting response label Nov 20, 2023

google-ml-butler bot removed the stat:awaiting response label Nov 23, 2023

IzakMaraisTAL mentioned this issue Nov 23, 2023

TFX 1.14 docker image pip broken #6368

Closed

singhniraj08 added the stat:awaiting response label Nov 24, 2023

google-ml-butler bot removed the stat:awaiting response label Nov 26, 2023

singhniraj08 added the stat:awaiting response label Nov 27, 2023

google-ml-butler bot removed the stat:awaiting response label Nov 27, 2023

singhniraj08 closed this as completed Nov 28, 2023

singhniraj08 reopened this Nov 28, 2023

singhniraj08 mentioned this issue Jan 9, 2024

DataFlow Job in TFX pipeline fails after running for an hour #6565

Closed

IzakMaraisTAL mentioned this issue Jan 9, 2024

Transform and BulkInferrer dynamic batch size grows too large, causing OOM on 16GB GPU #5777

Open

AnuarTB closed this as completed Apr 30, 2024

IzakMaraisTAL mentioned this issue Oct 2, 2024

Extending the corresponding version of the official tensorflow/tfx image causes hanging Dataflow Worker #6911

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFX1.14.0 causing Google Cloud Dataflow jobs to fail #6386

TFX1.14.0 causing Google Cloud Dataflow jobs to fail #6386

coocloud commented Oct 20, 2023 •

edited

Loading

IzakMaraisTAL commented Oct 23, 2023 •

edited

Loading

singhniraj08 commented Oct 26, 2023

coocloud commented Oct 26, 2023

AnuarTB commented Nov 16, 2023

coocloud commented Nov 16, 2023

singhniraj08 commented Nov 17, 2023

coocloud commented Nov 17, 2023 •

edited

Loading

liferoad commented Nov 17, 2023

jonathan-lemos commented Nov 18, 2023

coocloud commented Nov 23, 2023

liferoad commented Nov 23, 2023

ee07dazn commented Nov 26, 2023

liferoad commented Nov 27, 2023

ee07dazn commented Nov 27, 2023

coocloud commented Nov 27, 2023

singhniraj08 commented Nov 28, 2023

github-actions bot commented Nov 28, 2023

IzakMaraisTAL commented Nov 28, 2023

singhniraj08 commented Nov 28, 2023

jonathan-lemos commented Nov 28, 2023

singhniraj08 commented Nov 29, 2023

Saoussen-CH commented Jan 29, 2024

Saoussen-CH commented Jan 29, 2024

IzakMaraisTAL commented Jan 29, 2024 •

edited

Loading

Saoussen-CH commented Jan 29, 2024

AnuarTB commented Apr 30, 2024

github-actions bot commented Apr 30, 2024

TFX1.14.0 causing Google Cloud Dataflow jobs to fail #6386

TFX1.14.0 causing Google Cloud Dataflow jobs to fail #6386

Comments

coocloud commented Oct 20, 2023 • edited Loading

IzakMaraisTAL commented Oct 23, 2023 • edited Loading

singhniraj08 commented Oct 26, 2023

coocloud commented Oct 26, 2023

AnuarTB commented Nov 16, 2023

coocloud commented Nov 16, 2023

singhniraj08 commented Nov 17, 2023

coocloud commented Nov 17, 2023 • edited Loading

liferoad commented Nov 17, 2023

jonathan-lemos commented Nov 18, 2023

coocloud commented Nov 23, 2023

liferoad commented Nov 23, 2023

ee07dazn commented Nov 26, 2023

liferoad commented Nov 27, 2023

ee07dazn commented Nov 27, 2023

coocloud commented Nov 27, 2023

singhniraj08 commented Nov 28, 2023

github-actions bot commented Nov 28, 2023

IzakMaraisTAL commented Nov 28, 2023

singhniraj08 commented Nov 28, 2023

jonathan-lemos commented Nov 28, 2023

singhniraj08 commented Nov 29, 2023

Saoussen-CH commented Jan 29, 2024

Saoussen-CH commented Jan 29, 2024

IzakMaraisTAL commented Jan 29, 2024 • edited Loading

Saoussen-CH commented Jan 29, 2024

AnuarTB commented Apr 30, 2024

github-actions bot commented Apr 30, 2024

coocloud commented Oct 20, 2023 •

edited

Loading

IzakMaraisTAL commented Oct 23, 2023 •

edited

Loading

coocloud commented Nov 17, 2023 •

edited

Loading

IzakMaraisTAL commented Jan 29, 2024 •

edited

Loading