Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [Go SDK] Memory seems to be leaking on 2.49.0 with Dataflow #28142

Closed
2 of 15 tasks
boolangery opened this issue Aug 24, 2023 · 21 comments · Fixed by #30119
Closed
2 of 15 tasks

[Bug]: [Go SDK] Memory seems to be leaking on 2.49.0 with Dataflow #28142

boolangery opened this issue Aug 24, 2023 · 21 comments · Fixed by #30119

Comments

@boolangery
Copy link

What happened?

Hi,

We updated a Pubsub streaming job on Dataflow from 2.46.0 to 2.49.0. See these memory diagrams:

2.46.0 memory utilisation:
Capture d’écran 2023-08-24 à 09 44 18

2.49.0 memory utilisation:
Capture d’écran 2023-08-24 à 09 44 04

We sent back on the 2.46.0 for this job as workers were running out of memory and a lot of lag was introduced.

Do you have an explanation? What changed on memory management between 2.46.0 and 2.49.0?

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@Abacn
Copy link
Contributor

Abacn commented Aug 24, 2023

Hi, could you please raise a customer issue to Dataflow, as jobId / job graph are needed for triaging ?

The symptom reported here is general and hard to find cause without the job info available

@Abacn
Copy link
Contributor

Abacn commented Aug 24, 2023

one thing at least could check is to see if 2.47 and 2.48 had the same symptom thus narrow down the issue

@boolangery
Copy link
Author

Hi, could you please raise a customer issue to Dataflow, as jobId / job graph are needed for triaging ?

The symptom reported here is general and hard to find cause without the job info available

Sure, where I can submit this? Can't find anything in GCP, do you have a link? Thanks

@scwhittle
Copy link
Contributor

This issue appears to occur in 2.48 as well with a pipeline just consuming from Cloud Pubsub.

    _ = pipeline | "Read pubsub" >> io.ReadFromPubSub(
        subscription=sub, with_attributes=True
    )

@liferoad
Copy link
Collaborator

Hi, could you please raise a customer issue to Dataflow, as jobId / job graph are needed for triaging ?
The symptom reported here is general and hard to find cause without the job info available

Sure, where I can submit this? Can't find anything in GCP, do you have a link? Thanks

Please check this: https://cloud.google.com/dataflow/docs/support/getting-support#file-bugs-or-feature-requests

@tvalentyn tvalentyn added P1 and removed P2 labels Aug 28, 2023
@tvalentyn
Copy link
Contributor

@boolangery to confirm, was this a Go or Python pipeline?

@chleech
Copy link

chleech commented Aug 28, 2023

I’ve been experiencing the same issue. To validate, we also set up a pipeline that only reads from a single sub, ran it for 2 weeks and the mem is constantly increasing.

Got a response from DF team and their suggestion was to try 2.46.0. Will update here once we manage to test.

IMG_1092

@boolangery
Copy link
Author

@boolangery to confirm, was this a Go or Python pipeline?

A Go one

@boolangery
Copy link
Author

Issue has been created: https://issuetracker.google.com/issues/297918533

@lostluck
Copy link
Contributor

lostluck commented Aug 29, 2023

Adding the following service option when starting the job will let you get / provide CPU and HEAP profiles of the SDK worker in dataflow:

--dataflow_service_options=enable_google_cloud_profiler

From
https://cloud.google.com/dataflow/docs/guides/profiling-a-pipeline#enable_for_pipelines

@tvalentyn
Copy link
Contributor

FYI, we have observed a memory leak in Python SDK, which we correlated with a protobuf dependency upgrade: #28246. This issue may or may not be similar in nature.

@tvalentyn tvalentyn changed the title [Bug]: Memory seems to be leaking on 2.49.0 with Dataflow [Bug]: [Go] Memory seems to be leaking on 2.49.0 with Dataflow Aug 31, 2023
@tvalentyn tvalentyn changed the title [Bug]: [Go] Memory seems to be leaking on 2.49.0 with Dataflow [Bug]: [Go SDK] Memory seems to be leaking on 2.49.0 with Dataflow Aug 31, 2023
@kennknowles
Copy link
Member

If this makes the Go SDK unusable in 2.49.0 and beyond then per https://beam.apache.org/contribute/issue-priorities/ I would agree with P1. If it is usable in some cases then P2 is appropriate.

@kennknowles
Copy link
Member

And if P1 it should not be unassigned and should have ~daily updates and block releases.

@boolangery
Copy link
Author

boolangery commented Jan 25, 2024

This issue is still here in 2.53

@lostluck
Copy link
Contributor

@boolangery where does the heap profile show the memory is being held? The heap profile can be collected as described in the earlier comment:

#28142 (comment)

Otherwise, additional information would be useful for me to replicate the issue. A rough throughput, and message size would be very useful.

@lostluck lostluck self-assigned this Jan 25, 2024
@lostluck
Copy link
Contributor

Ah I see #28142 (comment) has been updated with profiles! Thank you.

@lostluck
Copy link
Contributor

The allocation is in makeChannels, which likely means it's the map from instruction/bundle ids to element channels. Something isn't getting cleaned up for some reason.

I believe it's a quick fix, and as the 2.54.0 release manager, I'm going to cherry pick it in once I've got it, since we're still in the "stabilization" phase of the new release. Thank you for your patience and cooperation.

@lostluck
Copy link
Contributor

I've successfully locally reproduced the issue locally using a lightly adjusted local prism runner, executing the pipeline in loopback mode and pprof, and narrowed down the leak to the channel cache in the read loop. It's not as aware of finished instructions as it should be.

Very localized as a fix at least.

@lostluck
Copy link
Contributor

lostluck commented Jan 25, 2024

The root cause is a subtle thing from the design of the Beam FnAPI protocol, but otherwise going to be on an SDK to SDK basis.

Essentially, the data channel and the control channel are coordinated. But they are independant. The data could come in before the bundle that processes that data, but we need to hold onto it. Similarly, the ProcessBundle request could come in earlier, and it needs to wait until the data is ready. Or any particular interleaving of the two.

The leak in the code is from that former case, where we're able to pull in all the data before ProcessBundle even starts up. Unfortunately, the Data channel doesn't know if it may close the Go Channel (elementsChan in the code) that sends the elements to the execution layer, until it knows what BundleDescriptor is being used so it can see if the Bundle uses Timers or not, and if so, how many transforms. In practice, there's likely to only be 2 streams, One for data, and the other for timers, but per the protocol, it could be arbitrary, so the SDK can't make an assumptions.

So the flow causing the leak is:
See Data for an unseen instruction.
Create and cache a elementChan in the read loop.
Get all the data.
Marks off how many "is-last" signals we see. (Once we have all the IsLasts, the read loop never sees a reference to that instruction ever again).
Receives the ProcessBundle request.
Know we have everything, close the channel, so the ProcessBundle can terminate.

But leak is because the read loop never "learns" that the data is complete and it can evict that reader from its cache, since the read loop never sees the instructionID again.

PubSub ends up triggering this behavior because outside of backlog catch up, each bundle is for a single element, so this causes a great deal of readers in the cache.

I should have a PR shortly.

@boolangery
Copy link
Author

Thank you for the explanation and the fix!

@lostluck
Copy link
Contributor

FYI, 2.54.0 is now available. While I'm pretty sure this issue is now resolved, it's good to get confirmation from affected users too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants