[Bug]: [Python SDK] Memory leak in 2.47.0 - 2.51.0 SDKs. #28246

tvalentyn · 2023-08-31T00:34:59Z

What happened?

We have identified a memory leak that affects Beam Python SDK versions 2.47.0 and above. The leak was triggered by an upgrade to protobuf==4.x.x. We rootcaused this leak to protocolbuffers/protobuf#14571 and it has been remediated in Beam 2.52.0.

[update: 2023-12-19]: Due to another issue related to protobuf upgrade, Python streaming users should continue to apply the mitigation steps below with Beam 2.52.0 or switch to Beam 2.53.0 once available.

Mitigation

Until Beam 2.52.0 is released, consider any of the following workarounds:

Use apache-beam==2.46.0 or below.
Install protobuf 3.x in the submission and runtime environment. For example, you can use a --requirements_file pipeline option with a file that includes:
```
protobuf==3.20.3
grpcio-status==1.48.2
```
For more information, see: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
Use a python implementation of protobuf by setting a PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python environment variable in the runtime environment. This might degrade the performance since python implementation is less efficient. For example, you could create a custom Beam SDK container from a Dockerfile that looks like the following:
```
FROM apache/beam_python3.10_sdk:2.47.0
ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
```
For more information, see: https://beam.apache.org/documentation/runtime/environments/
Install protobuf==4.25.0 or newer in the submission and runtime environment.

Users of Beam 2.50.0 SDK should additionally follow mitigation options for #28318.

Additional details

The leak can be reproduced by a pipeline:

  with beam.Pipeline(options=pipeline_options) as p:
    # duplicate reads to increase throughput
    inputs = []
    for i in range(32):
      inputs.append(
          p | f"Read pubsub{i}" >> ReadFromPubSub(topic='projects/pubsub-public-data/topics/taxirides-realtime', with_attributes=True)
      )

    inputs | beam.Flatten()

Dataflow pipeline options for the above pipeline: --max_num_workers=1 --autoscaling_algorithm=NONE --worker_machine_type=n2-standard-32

The leak was triggered by Beam switching default protobuf package version from 3.19.x to 4.22.x in #24599. The new versions of protobuf also switched the default protobuf implemetation to a upb implementation. The upb implementation had two known leaks that have since been mitigated by protobuf team in: protocolbuffers/protobuf#10088, https://github.com/protocolbuffers/upb/issues/1243 . The latest available protobuf==4.24.4 does not yet have the fix, ~~but we have confirmed that using a patched version built in https://github.com/protocolbuffers/upb/actions/runs/6028136812 fixes the leak.~~

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

The text was updated successfully, but these errors were encountered:

tvalentyn · 2023-09-05T22:31:54Z

Users of Beam 2.50.0 SDK should additionally follow mitigation options for #28318. (also mentioned in the description).

tvalentyn · 2023-09-07T17:06:11Z

Remaining work: updgrade protobuf lower bound once their fixes are released.

chleech · 2023-09-07T22:42:44Z

I'm trying to install protobuf version 4.24.3 which contains the fixed based on

Added malloc_trim() calls to Python allocator so RSS will decrease when memory is freed (https://github.com/protocolbuffers/upb/commit/b0f5d5d94d9faafed2ab0fcaa9396cb4a984a2c1)

However, apache beam 2.50.0 depends on protobuf (>=3.20.3,<4.24.0). Is this comment meant to address that?

Just looked at the PR in detail. Will there be a patch released to include that change or is only going to get released in 2.51.0? If so, when is 2.51.0 going to be released?

tvalentyn · 2023-09-07T22:54:20Z

However, apache beam 2.50.0 depends on protobuf (>=3.20.3,<4.24.0). Is this comment meant to address that?

you should be able to force-install and use the newer version of protobuf without adverse effects in this case, even though it doens't fit the restriction.

Beam community produces a release roughly every 6 weeks.

re comment: I was hoping to have a restriction protobuf>=4.24.3, but it is a bit more involved.

tvalentyn · 2023-09-07T23:12:29Z

@chleech note that you need to install the new version of protobuf also in the runtime environment.

chleech · 2023-09-08T03:20:43Z

@chleech note that you need to install the new version of protobuf also in the runtime environment.

Got it thank you! Actually, is it possible to only install it in the runtime env and not the build time one?

chleech · 2023-09-08T15:25:53Z

I'm getting dependency conflict when trying to build a runtime image with

FROM apache/beam_python3.9_sdk:2.50.0

ARG WORKDIR=/dataflow/container
ARG TEMPLATE_NAME=none
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

RUN pip install --no-cache-dir --upgrade pip \
  && pip install --no-cache-dir poetry \
  && pip check

It's failing with

tensorflow 2.13.0 has requirement typing-extensions<4.6.0,>=3.6.6, but you have typing-extensions 4.7.1.

is this a known issue?

tvalentyn · 2023-09-09T01:31:57Z

Unfortunately I am still seeing a leak on protobuf==4.24.3, asked in protocolbuffers/protobuf#10088 (comment) .

tvalentyn · 2023-09-09T01:37:41Z

I'm getting dependency conflict when trying to build a runtime image with
...
is this a known issue?

Must be side effects from poetry installation. See: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#control-dependencies for tips on using constraint files that might help.

docker run --rm -it --entrypoint=/bin/bash apache/beam_python3.9_sdk:2.50.0
root@710cb268df2a:/# pip check
No broken requirements found.

chleech · 2023-09-11T14:33:51Z

Unfortunately I am still seeing a leak on protobuf==4.24.3, asked in protocolbuffers/protobuf#10088 (comment) .

Same here! I let the pipeline run for 3 days and still got this plot. Am glad that I am not the only one.

My set up is

beam 2.50.0
RUN pip install --force-reinstall -v "protobuf==4.24.3"

tvalentyn · 2023-09-12T00:12:52Z

@chleech This is actively investigated. I encourage you to try other mitigations above in the meantime.

chleech · 2023-09-12T17:31:50Z

were you able to get this to work with beam 2.48.0? the last I tried it didn't change anything

FROM apache/beam_python3.10_sdk:2.47.0
ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python

tvalentyn · 2023-09-12T21:50:45Z

were you able to get this to work with beam 2.48.0? the last I tried it didn't change anything

Yes, I tried that couple of times, it has an effect in the pipelines I run. The memory growth decreases significantly. Make sure you are specifying the custom image via --sdk_container_image option.

kennknowles · 2023-09-21T15:33:20Z

What is the status of this? The release branch is cut but we can cherry pick a fix if it would otherwise make 2.51.0 unusable.

tvalentyn · 2023-09-21T18:34:13Z

It's not yet fixed and we don't have a cherry-pick yet unfortunately, it will likely carry over to 2.51.0

kennknowles · 2023-09-21T20:23:36Z

You mean 2.52.0?

tvalentyn · 2023-09-21T21:08:36Z

i meant the leak might carry over to 2.51.0 unless i find a fix before the release and CP it.

chleech · 2023-09-26T18:48:52Z

hey @tvalentyn - any luck fixing the mem issue in 2.51.0?

tvalentyn · 2023-10-11T22:00:11Z

2.51.0 does not have the fix yet. I can confirm with fairly high confidence that memory is leaking during execution metrics collection in

beam/sdks/python/apache_beam/runners/worker/bundle_processor.py

Lines 1188 to 1192 in 104c10b

    
           for transform_id, op in self.ops.items(): 
        
             tag_to_pcollection_id = self.process_bundle_descriptor.transforms[ 
        
                 transform_id].outputs 
        
             all_monitoring_infos_dict.update( 
        
                 op.monitoring_infos(transform_id, dict(tag_to_pcollection_id)))

; I should have more info soon.

chleech · 2023-10-11T22:26:00Z

@tvalentyn that sounds really promising. thank you for your hard work, can't wait!

tvalentyn · 2023-10-13T21:41:40Z

The memory appears to be lost when creating references here: https://github.com/apache/beam/blob/47d0fd566f86aaad35d26709c52ee555381823a4/sdks/python/apache_beam/runners/worker/bundle_processor.py#L1189C1-L1190C32 , even if we don't collect any metrics later.

Filed protocolbuffers/protobuf#14571 with a repro for protobuf folks to take a further look.

chleech · 2023-11-02T19:31:01Z

@tvalentyn which version of apache beam should we use to get the mem fix?

damccorm · 2023-11-02T20:20:43Z

It will be in version 2.52.0, which should be released in the next few weeks.

tvalentyn · 2024-01-30T23:24:31Z

I'd like to add more info about the investigation process for future reference.

Edit: see also: https://cwiki.apache.org/confluence/display/BEAM/Investigating+Memory+Leaks

tvalentyn · 2024-01-30T23:24:53Z

Initially, I inspected whether leaking memory is occupied by objects allocated on the Python heap. It was not the case, but there are couple of ways how the heap could be inspected:

Pass the --experiments=enable_heap_dump option. Then, heap dumps will be appended to the SDK status responses, which SDK can provide to the runner. Dataflow workers serve the SDK status page on localhost:8081/sdk_status, and it can be queried via: gcloud compute ssh --zone "xx-somezone-z" "some-dataflow-gce-worker-01300848-wqox-harness-bvf7" --project "some-project-id" --command "curl localhost:8081/sdk_status" .

The per-workitem heap profiling options could be used to inspect the objects that are left in the heap after a bundle execution.

beam/sdks/python/apache_beam/options/pipeline_options.py

Lines 1280 to 1293 in 3172736

    
           parser.add_argument( 
        
               '--profile_memory', 
        
               action='store_true', 
        
               help='Enable work item heap profiling.') 
        
           parser.add_argument( 
        
               '--profile_location', 
        
               default=None, 
        
               help='path for saving profiler data.') 
        
           parser.add_argument( 
        
               '--profile_sample_rate', 
        
               type=float, 
        
               default=1.0, 
        
               help='A number between 0 and 1 indicating the ratio ' 
        
               'of bundles that should be profiled.')

.

tvalentyn · 2024-01-30T23:29:20Z

Then, the suspicion was that the leak might happen when C/C++ memory allocations are not released. Such leak could be caused by Python extensions used by Beam SDK or its dependencies. Such leaks might not be visible when inspecting objects that live in the Python interpreter heap, but might be visible when inspecting allocations performed by the Python process using a memory profiler that tracks memory allocations.

I experimented with substituting the memory allocator library to tcmalloc. It helped to confirm the presence of the leak and attribute it to _upb_Arena_SlowMalloc call, but it wasn't very helpful to pinpoint the source of the leak in the Python portion of apache_beam / pipeline code. Posting for reference, but I'd probably use memray first (see below), if I have to do a similar analysis again.

Substituting the allocator can be done in a custom container. A Dockerfile for a custom container that substitutes memory allocator might look like the following:

FROM apache/beam_python3.10_sdk:2.53.0
RUN apt update ; apt install -y google-perftools
# Note: this enables TCMalloc globally for all applications running in the container 
ENV LD_PRELOAD /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4
ENV HEAPPROFILE /tmp/profile

Analyzing the profile needs to happen in the same or identical environment where the profiled binary runs, and have access to symbols from shared libraries used by the profiled binary. To access Dataflow worker environment, one can SSH to the VM and run commands in the running docker container.

gcloud compute ssh --zone "xx-somezone-z" "some-dataflow-gce-worker-..." --project "some-project-id"

$ docker ps -a | grep python
# some_container_id   us-central1-artifactregistry....             
$ docker exec -it some_container_id /bin/bash

$ google-pprof /usr/local/bin/python profile.0740.heap --base=profile.0100.heap
Using local file /usr/local/bin/python.
Using local file profile.0740.heap.
Welcome to pprof!  For help, type 'help'.
(pprof) top
Total: 37.2 MB
    36.7  98.6%  98.6%     36.7  98.6% _upb_Arena_SlowMalloc
     0.4   1.0%  99.6%      0.4   1.0% _PyMem_RawMalloc (inline)
     0.1   0.3% 100.0%      0.1   0.3% std::__shared_count::__shared_count
     0.0   0.1% 100.0%      0.0   0.1% list_resize.part.0
...

$ google-pprof --inuse_objects  /usr/local/bin/python profile.0740.heap --base=profile.0100.heap
(pprof) top
Total: 556 objects
     333  59.9%  59.9%      333  59.9% PyThread_allocate_lock.localalias
      97  17.4%  77.3%       97  17.4% _PyMem_RawMalloc (inline)
      49   8.8%  86.2%       49   8.8% _upb_Arena_SlowMalloc
      27   4.9%  91.0%       43   7.7% _PyObject_GC_Resize.localalias
      23   4.1%  95.1%       23   4.1% upb_Arena_InitSlow
...

For information on analyzing heap dumps collected with tcmalloc, see: https://gperftools.github.io/gperftools/heapprofile.html

tvalentyn · 2024-01-30T23:45:25Z

I tried several other profilers and had most success with memray: (https://pypi.org/project/memray/).

In my experience, Python profilers are most effective when the Python program that leaks memory is launched by the profiler as opposed to attaching the profiler at runtime, after the process has already started.
It is best if the profiler can collect and output memory allocation statistics while the process is still running. Some tools only output the collected data after the process under investigation terminates, which made it more complicated to use such tools to profile a Beam pipeline.

Instrumenting Beam SDK container to use memray required changing Beam container and its entrypoint:

Install memray in the container and launch SDK harness from memray: https://github.com/apache/beam/pull/30151/files.
To rebuild the Beam SDK container once can use following command (from a checked out copy of Beam Repo):

./gradlew :sdks:python:container:py310:docker

However, rebuilding container from scratch is a bit slow, and to reduce the feedback loop, we can rebuild only the boot entrypoint and include updated entrypoint in a preexisting image. For example:

```
# From beam repo root, make changes to boot.go.	
your_editor sdks/python/container/boot.go 

# Rebuild the entrypoint
./gradlew :sdks:python:container:gobuild

cd sdks/python/container/build/target/launcher/linux_amd64

# Create a simple Dockerfile to install memray and use the custom entrypoint

cat >Dockerfile <<EOF
FROM apache/beam_python3.10_sdk:2.53.0
RUN pip install memray
COPY boot /opt/apache/beam/boot_modified
ENTRYPOINT ["/opt/apache/beam/boot_modified"]
EOF

# Build the image
docker build . --tag gcr.io/my-project/custom-image:tag
docker push gcr.io/my-project/custom-image:tag](http://gcr.io/my-project/custom-image:tag  
```

Run a pipeline with the custom image.

tvalentyn · 2024-01-30T23:59:32Z

Retrieving the profile and creating a report required SSHing to the running worker, and creating a memray report in the running SDK harness container :

$ gcloud compute --project "project-id" ssh --zone "us-west1-b" pipeline-0rc0-atta-11150311-ej0v-harness-tmj9 

# Find Python SDK container ID
$ docker ps -a
$ docker ps -a | grep python
# 8a643a4638ae   us-central1-artifactregistry....             
$ docker exec -it 8a643a4638ae /bin/bash

# locate profiler output, create reports, copy them it to GCS.
memray table --leak output.bin -o table.html --force
gsutil cp table.html gs://some-bucket/

I found the Memray Table reporter most convenient during my debugging, but other reporters can also be useful.

As a reminder, creating a report from a profile needs to happen in the same or identical environment where the profile was created. The identical environment might be a container started from the same image, but I created my reports on the running worker.

It should be possible to simplify the process of collecting and analyzing profiles, and we'll track improvements in #20298.

tvalentyn · 2024-01-31T00:05:17Z

The table reporter attributed most of the leaked usage to a line in bundle_processor.py. With that info, I reproduced the leak in DirectRunner in a much simpler pipeline, and a very simple setup:

Given a test_pipeline.py:

import argparse
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args()

with beam.Pipeline(argv=pipeline_args) as p:
  p | beam.Create([1]) | beam.Map(lambda x: x+1)

Run:

pip install apache-beam==2.47.0 memray==1.11.0
memray run  -o output.bin --force test_pipeline.py --direct_runner_bundle_repeat=10000
memray table --leak output.bin -o table.html --force

The leak is visible in table.html, after double-sorting the table by Size column, and is increasing with an increase in number of iteration. The leak was later attributed to a regression protobuf in protocolbuffers/protobuf#14571 .

tvalentyn added bug awaiting triage labels Aug 31, 2023

github-actions bot added python P2 labels Aug 31, 2023

tvalentyn removed the awaiting triage label Aug 31, 2023

tvalentyn added this to the 2.51.0 Release milestone Aug 31, 2023

This was referenced Aug 31, 2023

[Bug]: [Go SDK] Memory seems to be leaking on 2.49.0 with Dataflow #28142

Closed

Reference the memory leak in CHANGES.md #28247

Merged

This was referenced Sep 5, 2023

[Bug]: apache beam python SDK hangs and crashes with segmentation fault errors with orjson 3.9.4 #28318

Closed

[Bug]: Python SDK sometimes crashes in streaming jobs running on 2.47.0+ SDK #27330

Closed

tvalentyn mentioned this issue Sep 7, 2023

Update protobuf dependency to the version that fixes memory leaks. #28365

Merged

3 tasks

kennknowles removed this from the 2.51.0 Release milestone Oct 3, 2023

kennknowles added this to the 2.52.0 Release milestone Oct 3, 2023

tvalentyn changed the title ~~[Bug]: [Python SDK] Memory leak in 2.47.0 - 2.50.0 SDKs.~~ [Bug]: [Python SDK] Memory leak in 2.47.0 - 2.51.0 SDKs. Oct 13, 2023

daniel-cit mentioned this issue Oct 27, 2023

Apache Beam SDK versions 2.47 to 2.51 have open issues GoogleCloudPlatform/terraform-google-secured-data-warehouse#376

Closed

damccorm mentioned this issue Oct 27, 2023

[Task]: Update the minor version of protobuf library in the upper bound prior to Beam release. #25590

Closed

15 tasks

tvalentyn mentioned this issue Nov 1, 2023

Upgrade protobuf version to remediate a leak. #29255

Merged

damccorm closed this as completed in #29255 Nov 2, 2023

github-actions bot modified the milestones: 2.52.0 Release, 2.53.0 Release Nov 2, 2023

tvalentyn modified the milestones: 2.53.0 Release, 2.52.0 Release Nov 2, 2023

masonkirchner mentioned this issue Nov 2, 2023

[Request] Update to Apache Beam 2.52.0, enable Beam 2.46.0 compatibility tensorflow/tfx#6416

Closed

damccorm added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Nov 7, 2023

tvalentyn mentioned this issue Feb 3, 2024

Improve memory profiling for users of Portable Beam Python #20298

Open

ranchodeluxe mentioned this issue Feb 5, 2024

Memory Leak Investigation NASA-IMPACT/veda-pforge-job-runner#32

Closed

tvalentyn mentioned this issue Aug 9, 2024

Bump up google-cloud-storage version to fix data corruption issue #32135

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: [Python SDK] Memory leak in 2.47.0 - 2.51.0 SDKs. #28246

[Bug]: [Python SDK] Memory leak in 2.47.0 - 2.51.0 SDKs. #28246

tvalentyn commented Aug 31, 2023 •

edited

Loading

tvalentyn commented Sep 5, 2023 •

edited

Loading

tvalentyn commented Sep 7, 2023

chleech commented Sep 7, 2023 •

edited

Loading

tvalentyn commented Sep 7, 2023 •

edited

Loading

tvalentyn commented Sep 7, 2023

chleech commented Sep 8, 2023

chleech commented Sep 8, 2023

tvalentyn commented Sep 9, 2023

tvalentyn commented Sep 9, 2023

chleech commented Sep 11, 2023

tvalentyn commented Sep 12, 2023 •

edited

Loading

chleech commented Sep 12, 2023

tvalentyn commented Sep 12, 2023 •

edited

Loading

kennknowles commented Sep 21, 2023

tvalentyn commented Sep 21, 2023

kennknowles commented Sep 21, 2023

tvalentyn commented Sep 21, 2023 •

edited

Loading

chleech commented Sep 26, 2023

tvalentyn commented Oct 11, 2023 •

edited

Loading

chleech commented Oct 11, 2023

tvalentyn commented Oct 13, 2023 •

edited

Loading

chleech commented Nov 2, 2023

damccorm commented Nov 2, 2023

tvalentyn commented Jan 30, 2024 •

edited

Loading

tvalentyn commented Jan 30, 2024 •

edited

Loading

tvalentyn commented Jan 30, 2024 •

edited

Loading

tvalentyn commented Jan 30, 2024

tvalentyn commented Jan 30, 2024 •

edited

Loading

tvalentyn commented Jan 31, 2024

[Bug]: [Python SDK] Memory leak in 2.47.0 - 2.51.0 SDKs. #28246

[Bug]: [Python SDK] Memory leak in 2.47.0 - 2.51.0 SDKs. #28246

Comments

tvalentyn commented Aug 31, 2023 • edited Loading

What happened?

Issue Priority

Issue Components

tvalentyn commented Sep 5, 2023 • edited Loading

tvalentyn commented Sep 7, 2023

chleech commented Sep 7, 2023 • edited Loading

tvalentyn commented Sep 7, 2023 • edited Loading

tvalentyn commented Sep 7, 2023

chleech commented Sep 8, 2023

chleech commented Sep 8, 2023

tvalentyn commented Sep 9, 2023

tvalentyn commented Sep 9, 2023

chleech commented Sep 11, 2023

tvalentyn commented Sep 12, 2023 • edited Loading

chleech commented Sep 12, 2023

tvalentyn commented Sep 12, 2023 • edited Loading

kennknowles commented Sep 21, 2023

tvalentyn commented Sep 21, 2023

kennknowles commented Sep 21, 2023

tvalentyn commented Sep 21, 2023 • edited Loading

chleech commented Sep 26, 2023

tvalentyn commented Oct 11, 2023 • edited Loading

chleech commented Oct 11, 2023

tvalentyn commented Oct 13, 2023 • edited Loading

chleech commented Nov 2, 2023

damccorm commented Nov 2, 2023

tvalentyn commented Jan 30, 2024 • edited Loading

tvalentyn commented Jan 30, 2024 • edited Loading

tvalentyn commented Jan 30, 2024 • edited Loading

tvalentyn commented Jan 30, 2024

tvalentyn commented Jan 30, 2024 • edited Loading

tvalentyn commented Jan 31, 2024

tvalentyn commented Aug 31, 2023 •

edited

Loading

tvalentyn commented Sep 5, 2023 •

edited

Loading

chleech commented Sep 7, 2023 •

edited

Loading

tvalentyn commented Sep 7, 2023 •

edited

Loading

tvalentyn commented Sep 12, 2023 •

edited

Loading

tvalentyn commented Sep 12, 2023 •

edited

Loading

tvalentyn commented Sep 21, 2023 •

edited

Loading

tvalentyn commented Oct 11, 2023 •

edited

Loading

tvalentyn commented Oct 13, 2023 •

edited

Loading

tvalentyn commented Jan 30, 2024 •

edited

Loading

tvalentyn commented Jan 30, 2024 •

edited

Loading

tvalentyn commented Jan 30, 2024 •

edited

Loading

tvalentyn commented Jan 30, 2024 •

edited

Loading