-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: GRCPIO versions from 1.59.0 to 1.62.1 can cause Beam Python pipelines to get stuck #30867
Comments
@tvalentyn FYI. |
Thanks.
Double checking, is this still the case? |
cc: @damccorm |
Actually still happens on 1.60.0 for me. It just looked fixed for a brief moment. |
Ok. Thank you very much for reporting the issue, please let us know if you have more information, that might also help grpcio maintainers. |
@tvalentyn Some further investigation might point at a combination of the protobuf library version with grpcio: grpc/grpc#36256 (comment) |
Can you share Dataflow Job IDs where you've seen this error? |
also what are the exact errors you are seeing? |
@tvalentyn Job ID: 2024-04-03_08_14_02-12227946365357908481 in europe-west4 It also seems to manifest in the Google Cloud Console Dataflow Job viewer UI locking up in the browser until the browser considers tab unresponsive while a fixed job stays responsive.. |
I am observing the pattern that the jobs you start with Beam 2.55.0 SDK have many errors like These errors appear fairly early in pipeline execution. Dataflow workers serve the SDK status page on localhost:8081/sdk_status, and it can be queried manually via: Would it be possible to take a closer look at the differences between 2.55.0 and 2.53.0 setup that you have to narrow down the exact change that increases instances of these errors? For example: upgrading/downgrading a dependency X and doing nothing else increases/decreases instances of this error. I'll also try to repro this issue myself.
That is likely an unrelated UI issue. |
one other thing I would try: use Beam 2.55.0 setup but downgrade protobuf to "protbuf==3.20.3" |
I see an instance of "Unable to retrieve status info from SDK harness" integration test happens in executions of some beam tests, for example apache_beam.examples.cookbook.bigtableio_it_test.BigtableIOWriteTest , it might be reproducible in other beam tests as well; we should be able to use these for a repro. |
I was able to repro the issue with a couple of execution of an integration test:
|
follow up investigation: ssh into the VM:
is unresponsive. However, we can ssh into the docker container and use pystack:
Which reveals:
|
TLDR is that a thread that executes bigtable/transports/grpc.create_channel() later calls into likely a python extension cygrpc.Channel(), which holds GIL indefinitely, so other threads cannot run, and we hence SDK is not responsive on /sdk_status RPC calls. https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact also explains this failure mode
|
With
|
looks like |
@DerRidda would it be possible for you to re-run your job using Beam 2.55 SDK, then find a stuck worker VM, and retrieve stacktraces with pystack as I did in: #30867 (comment) ? Note: you might have to use Dataflow Classic instead of Dataflow Prime to be able to access Dataflow workers. To find a Stuck VM, look for "Unable to retrieve status info from SDK harness." logs, then find which worker emits those logs by expanding the logging entry in Cloud Logging. you might see something like:
the Then, SSH to that VM from UI or via a gcloud command, log into the running python container in a privileged mode and run pystack. |
@tvalentyn Sorry I didn't reply sooner but I can't really siphon of the time to repro this issue more on Dataflow. Seeing your latest comments in the grpcio bug report I assume you have what you need now, though? |
I am not certain they are the same bugs. |
we can fix the bug we are currently investigating and then come back to your issue and see if you still reproduce it. |
There is a confirmed issue googleapis/python-bigtable#949 that affects google-api-core and grpcio library, which caused a regression in Apache Beam 2.55.0. It will be fixed in the upcoming release of grpcio and mitigated in Beam 2.56.0. For affected 2.55.0 users, any of the following mitigations should help:
install any of the the following dependency combinations in the Beam pipeline runtime environment (For example, you can use a --requirements_file pipeline option):
|
So, I removed all other dependency pins and just updated grpcio to 1.63.0rc1 (didn't even do grpcio-status in the beginning) on SDK version 2.55.1 with this patch manually applied to rule that out as a potential reason for pipeline stalling. So far I can no longer repro the issue, my job performs normally under load. I now also added the grpcio-status pin and updated in place, will check once the job is back to expected full scaling but I don't expect issues here. |
@tvalentyn is the fix here basically just disallowing 1.59.0-1.62.1 in Line 368 in e59d313
|
We have at least two changes in Beam that depend on new versions of GRPCIO, we'd have to roll them back. I am discussing with GRPC maintainers a possibility for a 1.62.2 patch release. Update: 1.62.2 has been released. I updated the issue description to explain the rootcause and mitigation options. |
What happened?
A combination of software releases in Beam dependency chain has surfaced a failure mode, that might cause unexplained pipeline stuckness. The issue affects Apache Beam 2.55.0 and 2.55.1, but may potentially affect other SDKs when the pipeline runtime environment has
google-api-core
version 2.17.0 or above, ANDgrpcio
version in the following range1.59.0<=grpcio<=1.62.1
.Symptoms
Beam pipelines might get stuck. Dataflow jobs might have errors like:
Unable to retrieve status info from SDK harness
There are 10 consecutive failures obtaining SDK worker status info
,SDK worker appears to be permanently unresponsive. Aborting the SDK.
Mitigation
Upgrade to Apache Beam 2.56.0 or above once available, until then: install any of the following dependency combinations in the Beam pipeline runtime environment
You can define dependencies in the pipeline runtime environment using a
--requirements_file
pipeline option or other options outlined in https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/.Users of Apache Beam 2.55.0 might be able to avoid the issue by downgrading to
apache-beam==2.54.0
, since the default containers for the runtime environment has the set of dependencies that does not trigger the bug.Rootcause
The issue was caused by a regression in
grpcio==1.59.0
grpc/grpc#36265, which has been now fixed in grpcio==1.62.2 and above. The regression triggered the failure mode when used withgoogle-api-core==2.17.0
and above.Description updated: 2023-04-23.
Original description:
Update of the Python
grpcio
dependency to version 1.62.1 caused Dataflow job stalling, with excessive waits for responses in GRCP Multi-threaded rendezvous probably somewhere in SDK worker. Upstream issue exists here: grpc/grpc#36256Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: