[Python]Enable state cache to 100 MB #28781

AnandInguva · 2023-10-03T00:07:33Z

Enable state_cache_size = 100 MB for python SDK.
Fixes: #28770

state_cache_size can be enabled using --state_cache_size=<X>MB. state_cache_size should be in terms of Megabytes.

EDIT:

From the doc - https://docs.google.com/document/u/1/d/1gllYsIFqKt4TWAxQmXU_-sw7SLnur2Q69d__N0XBMdE/edit?usp=drive_open&ouid=102749919556839394679, the consensus is to add a pipeline option named max_cache_memory_usage_mb and explain in Beam docs and runner docs on what this option is and how this options works.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

codecov · 2023-10-03T00:31:41Z

Codecov Report

Merging #28781 (aa94309) into master (a93fa51) will decrease coverage by 0.05%.
Report is 24 commits behind head on master.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master   #28781      +/-   ##
==========================================
- Coverage   38.36%   38.32%   -0.05%     
==========================================
  Files         687      688       +1     
  Lines      101745   101833      +88     
==========================================
- Hits        39037    39027      -10     
- Misses      61129    61230     +101     
+ Partials     1579     1576       -3

Flag	Coverage Δ
python	`29.94% <85.71%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...dks/python/apache_beam/options/pipeline_options.py	`64.91% <100.00%> (+0.07%)`	⬆️
...apache_beam/runners/portability/portable_runner.py	`28.27% <ø> (ø)`
...thon/apache_beam/runners/worker/sdk_worker_main.py	`66.66% <83.33%> (+0.77%)`	⬆️

... and 8 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

AnandInguva · 2023-10-03T19:02:52Z

R: @tvalentyn

github-actions · 2023-10-03T19:11:02Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

AnandInguva · 2023-10-04T18:36:00Z

beam/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/options/DataflowPipelineDebugOptions.java

Line 247 in f30f6c5

Integer getWorkerCacheMb();

Java has default of 100 MB

Do we want to have a similar flag to workerCacheMB? state_cache_size to state_cache_size_mb or worker_cache_mb?

tvalentyn · 2023-10-04T19:15:46Z

state is a technical term that's not very user-friendly; it is good to have consistent naming. @lostluck does Go have state cache as well?

sdks/python/apache_beam/runners/worker/sdk_worker.py

lostluck · 2023-10-04T21:29:16Z

state is a technical term that's not very user-friendly; it is good to have consistent naming. @lostluck does Go have state cache as well?

It does, but it's not memory aware, just element sized.

Technically, it's used for cross bundle applications across the State API which side inputs also use.

Unless the cache value also applies to the Combiner Lifting cache, it wouldn't be a true "worker" cache, vs a "state" cache.

sdks/python/apache_beam/runners/worker/sdk_worker_main.py

tvalentyn · 2023-10-06T20:30:06Z

sdks/python/apache_beam/runners/worker/sdk_worker_main.py

-  return 0
+  if not state_cache_size:
+    # to maintain backward compatibility
+    for experiment in experiments:


pretty sure there is already a helper that does this parsing.

sdks/python/apache_beam/runners/worker/sdk_worker_main.py

sdks/python/apache_beam/options/pipeline_options.py

…able_state_cache

sdks/python/apache_beam/options/pipeline_options.py

tvalentyn · 2023-10-30T17:57:35Z

sdks/python/apache_beam/options/pipeline_options.py

-        '--state_cache_size_mb',
-        dest='state_cache_size',
+        '--max_cache_memory_usage_mb',
+        dest='max_cache_memory_usage_mb',
        type=int,
        default=None,


Any concerns to define the 100mb default here?

Current flow: If it is None here, it gives us an opportunity to look in --experiements for state_cache_size.

If the value is defined here as 100 MB, and if the user passes --experiments=state_cache_size, we should override 100 MB for the --experiments=state_cache_size.

I don't see any concerns of setting default here. might need to change some code though

sdks/python/apache_beam/runners/worker/sdk_worker_main.py

damccorm · 2023-10-31T14:29:33Z

CHANGES.md

@@ -74,6 +74,7 @@ should handle this. ([#25252](https://github.com/apache/beam/issues/25252)).
  jobs using the DataStream API. By default the option is set to false, so the batch jobs are still executed
  using the DataSet API.
 * `upload_graph` as one of the Experiments options for DataflowRunner is no longer required when the graph is larger than 10MB for Java SDK ([PR#28621](https://github.com/apache/beam/pull/28621).
+* state cache has been enabled to a default of 100 MB. Use `--max_cache_memory_usage_mb=X` to provide cache size. (Python) ([#28770](https://github.com/apache/beam/issues/28770)).


Could you add a short description of what this cache is and link to https://beam.apache.org/releases/pydoc/2.50.0/apache_beam.options.pipeline_options.html#module-apache_beam.options.pipeline_options?

Particularly, we should make it clear here that this impacts both user state and side inputs

sdks/python/apache_beam/options/pipeline_options.py

sdks/python/apache_beam/runners/worker/sdk_worker_main.py

damccorm · 2023-10-31T18:19:35Z

Run Python_PVR_Flink PreCommit

sdks/python/apache_beam/runners/worker/sdk_worker_main.py

tvalentyn · 2023-10-31T19:10:49Z

sdks/python/apache_beam/options/pipeline_options.py

+        type=int,
+        default=100,
+        help=(
+            'Size of the SdkHarness cache to store user state and side inputs '


nit: consider following wording

'Size of the SDK Harness cache to store user state and side inputs ' 'in MB. Default is 100MB. If the cache is full, least recently ' 'used elements will be evicted. This cache is per ' 'each SDK Harness instance. SDK Harness is a component responsible ' 'for executing the user code and communicating with the runner. ' 'Depending on the runner, ' 'there may be more than one SDK Harness process running on the same worker node. ' 'Increasing cache size might improve performance of some pipelines, but can lead to an increase ' 'in memory consumption and OOM errors if workers are not appropriately provisioned.'

AnandInguva · 2023-10-31T22:21:09Z

Merging this since tests pass.

github-actions bot added the python label Oct 3, 2023

AnandInguva force-pushed the enable_state_cache branch from 4578acf to 49f37c7 Compare October 3, 2023 00:09

AnandInguva force-pushed the enable_state_cache branch from 49f37c7 to ba96506 Compare October 3, 2023 16:56

AnandInguva changed the title ~~[WIP][Python]Enable state cache to 100 MB~~ [Python]Enable state cache to 100 MB Oct 3, 2023

AnandInguva force-pushed the enable_state_cache branch from ba96506 to 22abdc1 Compare October 3, 2023 16:58

AnandInguva marked this pull request as ready for review October 4, 2023 18:37

tvalentyn reviewed Oct 4, 2023

View reviewed changes

sdks/python/apache_beam/runners/worker/sdk_worker.py Outdated Show resolved Hide resolved

change state_cache_mb to 100 MB as default

b1ab7a3

AnandInguva force-pushed the enable_state_cache branch from 22abdc1 to b1ab7a3 Compare October 4, 2023 21:08

tvalentyn requested changes Oct 6, 2023

View reviewed changes

AnandInguva mentioned this pull request Oct 6, 2023

[Python]Update state cache size to 100 MB #28877

Merged

3 tasks

Address comments

dbea4fd

AnandInguva commented Oct 30, 2023

View reviewed changes

sdks/python/apache_beam/options/pipeline_options.py Outdated Show resolved Hide resolved

AnandInguva added 4 commits October 30, 2023 10:15

Reword help of pipeline option

9a4572c

Reword help of pipeline option

dae15e3

Merge remote-tracking branch 'origin/master' into enable_state_cache

5e0464c

Merge remote-tracking branch 'AnandInguva/enable_state_cache' into en…

b91dd61

…able_state_cache

tvalentyn reviewed Oct 30, 2023

View reviewed changes

AnandInguva added 3 commits October 30, 2023 14:26

Fix doc string

3995385

reword docstring

2ada878

Set default in the pipeline options

b84f53b

AnandInguva requested review from tvalentyn and damccorm October 31, 2023 14:00

damccorm reviewed Oct 31, 2023

View reviewed changes

AnandInguva added 2 commits October 31, 2023 11:02

Reword documentation

2518fa7

Add warning

62ed478

tvalentyn approved these changes Oct 31, 2023

View reviewed changes

sdks/python/apache_beam/runners/worker/sdk_worker_main.py Outdated Show resolved Hide resolved

tvalentyn reviewed Oct 31, 2023

View reviewed changes

AnandInguva added 2 commits October 31, 2023 15:29

Address comments

3f7ab0d

Update CHANGES.md

aa94309

AnandInguva merged commit d329a7e into apache:master Oct 31, 2023
85 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python]Enable state cache to 100 MB #28781

[Python]Enable state cache to 100 MB #28781

AnandInguva commented Oct 3, 2023 •

edited

Loading

codecov bot commented Oct 3, 2023 •

edited

Loading

AnandInguva commented Oct 3, 2023

github-actions bot commented Oct 3, 2023

AnandInguva commented Oct 4, 2023 •

edited

Loading

tvalentyn commented Oct 4, 2023

lostluck commented Oct 4, 2023

tvalentyn Oct 6, 2023

tvalentyn Oct 30, 2023

AnandInguva Oct 30, 2023

damccorm Oct 31, 2023

damccorm Oct 31, 2023

damccorm commented Oct 31, 2023

tvalentyn Oct 31, 2023

AnandInguva commented Oct 31, 2023

[Python]Enable state cache to 100 MB #28781

[Python]Enable state cache to 100 MB #28781

Conversation

AnandInguva commented Oct 3, 2023 • edited Loading

GitHub Actions Tests Status (on master branch)

codecov bot commented Oct 3, 2023 • edited Loading

Codecov Report

AnandInguva commented Oct 3, 2023

github-actions bot commented Oct 3, 2023

AnandInguva commented Oct 4, 2023 • edited Loading

tvalentyn commented Oct 4, 2023

lostluck commented Oct 4, 2023

tvalentyn Oct 6, 2023

Choose a reason for hiding this comment

tvalentyn Oct 30, 2023

Choose a reason for hiding this comment

AnandInguva Oct 30, 2023

Choose a reason for hiding this comment

damccorm Oct 31, 2023

Choose a reason for hiding this comment

damccorm Oct 31, 2023

Choose a reason for hiding this comment

damccorm commented Oct 31, 2023

tvalentyn Oct 31, 2023

Choose a reason for hiding this comment

AnandInguva commented Oct 31, 2023

AnandInguva commented Oct 3, 2023 •

edited

Loading

codecov bot commented Oct 3, 2023 •

edited

Loading

AnandInguva commented Oct 4, 2023 •

edited

Loading