-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: [Python SDK] Memory leak in 2.47.0 - 2.51.0 SDKs. #28246
Comments
Users of Beam 2.50.0 SDK should additionally follow mitigation options for #28318. (also mentioned in the description). |
Remaining work: updgrade protobuf lower bound once their fixes are released. |
I'm trying to install protobuf version 4.24.3 which contains the fixed based on
However, apache beam 2.50.0 depends on protobuf (>=3.20.3,<4.24.0). Is this comment meant to address that? Just looked at the PR in detail. Will there be a patch released to include that change or is only going to get released in 2.51.0? If so, when is 2.51.0 going to be released? |
you should be able to force-install and use the newer version of protobuf without adverse effects in this case, even though it doens't fit the restriction. Beam community produces a release roughly every 6 weeks. re comment: I was hoping to have a restriction protobuf>=4.24.3, but it is a bit more involved. |
@chleech note that you need to install the new version of protobuf also in the runtime environment. |
Got it thank you! Actually, is it possible to only install it in the runtime env and not the build time one? |
I'm getting dependency conflict when trying to build a runtime image with
It's failing with
is this a known issue? |
Unfortunately I am still seeing a leak on |
Must be side effects from poetry installation. See: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#control-dependencies for tips on using constraint files that might help.
|
Same here! I let the pipeline run for 3 days and still got this plot. Am glad that I am not the only one. My set up is
|
@chleech This is actively investigated. I encourage you to try other mitigations above in the meantime. |
were you able to get this to work with beam 2.48.0? the last I tried it didn't change anything
|
Yes, I tried that couple of times, it has an effect in the pipelines I run. The memory growth decreases significantly. Make sure you are specifying the custom image via |
What is the status of this? The release branch is cut but we can cherry pick a fix if it would otherwise make 2.51.0 unusable. |
It's not yet fixed and we don't have a cherry-pick yet unfortunately, it will likely carry over to 2.51.0 |
You mean 2.52.0? |
i meant the leak might carry over to 2.51.0 unless i find a fix before the release and CP it. |
hey @tvalentyn - any luck fixing the mem issue in 2.51.0? |
2.51.0 does not have the fix yet. I can confirm with fairly high confidence that memory is leaking during execution metrics collection in beam/sdks/python/apache_beam/runners/worker/bundle_processor.py Lines 1188 to 1192 in 104c10b
|
@tvalentyn that sounds really promising. thank you for your hard work, can't wait! |
The memory appears to be lost when creating references here: https://github.com/apache/beam/blob/47d0fd566f86aaad35d26709c52ee555381823a4/sdks/python/apache_beam/runners/worker/bundle_processor.py#L1189C1-L1190C32 , even if we don't collect any metrics later. Filed protocolbuffers/protobuf#14571 with a repro for protobuf folks to take a further look. |
@tvalentyn which version of apache beam should we use to get the mem fix? |
It will be in version 2.52.0, which should be released in the next few weeks. |
I'd like to add more info about the investigation process for future reference. Edit: see also: https://cwiki.apache.org/confluence/display/BEAM/Investigating+Memory+Leaks |
Initially, I inspected whether leaking memory is occupied by objects allocated on the Python heap. It was not the case, but there are couple of ways how the heap could be inspected:
|
Then, the suspicion was that the leak might happen when C/C++ memory allocations are not released. Such leak could be caused by Python extensions used by Beam SDK or its dependencies. Such leaks might not be visible when inspecting objects that live in the Python interpreter heap, but might be visible when inspecting allocations performed by the Python process using a memory profiler that tracks memory allocations. I experimented with substituting the memory allocator library to tcmalloc. It helped to confirm the presence of the leak and attribute it to Substituting the allocator can be done in a custom container. A
Analyzing the profile needs to happen in the same or identical environment where the profiled binary runs, and have access to symbols from shared libraries used by the profiled binary. To access Dataflow worker environment, one can SSH to the VM and run commands in the running docker container.
For information on analyzing heap dumps collected with tcmalloc, see: https://gperftools.github.io/gperftools/heapprofile.html |
I tried several other profilers and had most success with memray: (https://pypi.org/project/memray/).
Instrumenting Beam SDK container to use memray required changing Beam container and its entrypoint:
However, rebuilding container from scratch is a bit slow, and to reduce the feedback loop, we can rebuild only the boot entrypoint and include updated entrypoint in a preexisting image. For example:
|
Retrieving the profile and creating a report required SSHing to the running worker, and creating a memray report in the running SDK harness container :
I found the Memray Table reporter most convenient during my debugging, but other reporters can also be useful. As a reminder, creating a report from a profile needs to happen in the same or identical environment where the profile was created. The identical environment might be a container started from the same image, but I created my reports on the running worker. It should be possible to simplify the process of collecting and analyzing profiles, and we'll track improvements in #20298. |
The table reporter attributed most of the leaked usage to a line in bundle_processor.py. With that info, I reproduced the leak in DirectRunner in a much simpler pipeline, and a very simple setup: Given a
Run:
The leak is visible in table.html, after double-sorting the table by Size column, and is increasing with an increase in number of iteration. The leak was later attributed to a regression protobuf in protocolbuffers/protobuf#14571 . |
What happened?
We have identified a memory leak that affects Beam Python SDK versions 2.47.0 and above. The leak was triggered by an upgrade to
protobuf==4.x.x
. We rootcaused this leak to protocolbuffers/protobuf#14571 and it has been remediated in Beam 2.52.0.[update: 2023-12-19]: Due to another issue related to protobuf upgrade, Python streaming users should continue to apply the mitigation steps below with Beam 2.52.0 or switch to Beam 2.53.0 once available.
Mitigation
Until Beam 2.52.0 is released, consider any of the following workarounds:
Use
apache-beam==2.46.0
or below.Install protobuf 3.x in the submission and runtime environment. For example, you can use a
--requirements_file
pipeline option with a file that includes:For more information, see: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
Use a
python
implementation of protobuf by setting aPROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
environment variable in the runtime environment. This might degrade the performance since python implementation is less efficient. For example, you could create a custom Beam SDK container from aDockerfile
that looks like the following:For more information, see: https://beam.apache.org/documentation/runtime/environments/
Install protobuf==4.25.0 or newer in the submission and runtime environment.
Users of Beam 2.50.0 SDK should additionally follow mitigation options for #28318.
Additional details
The leak can be reproduced by a pipeline:
Dataflow pipeline options for the above pipeline:
--max_num_workers=1 --autoscaling_algorithm=NONE --worker_machine_type=n2-standard-32
The leak was triggered by Beam switching default
protobuf
package version from 3.19.x to 4.22.x in #24599. The new versions ofprotobuf
also switched the default protobuf implemetation to aupb
implementation. Theupb
implementation had two known leaks that have since been mitigated by protobuf team in: protocolbuffers/protobuf#10088, https://github.com/protocolbuffers/upb/issues/1243 . The latest availableprotobuf==4.24.4
does not yet have the fix,but we have confirmed that using a patched version built in https://github.com/protocolbuffers/upb/actions/runs/6028136812 fixes the leak.Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: