[Data] Concurrency Cap Backpressure tuning #58163

srinathk10 · 2025-10-25T07:17:18Z

Thank you for contributing to Ray! 🚀
Please review the Ray Contribution Guide before opening a pull request.

⚠️ Remove these instructions before submitting your PR.

💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.

Description

Briefly describe what this PR accomplishes and why it's needed.

[Data] Concurrency Cap Backpressure tuning

Maintain asymmetric EWMA of total queued bytes (this op + downstream) as the typical level: level.
Maintain asymmetric EWMA of absolute residual vs the previous level as a scale proxy: dev = EWMA(|q - level_prev|).
Define deadband: [lower, upper] = [level - K_DEVdev, level + K_DEVdev].
If q > upper -> target cap = running - BACKOFF_FACTOR (back off)
If q < lower -> target cap = running + RAMPUP_FACTOR (ramp up)
Else -> target cap = running (hold)
Clamp to [1, configured_cap], admit iff running < target cap.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Srinath Krishnamachari <[email protected]>

srinathk10 · 2025-10-27T05:17:47Z

Benchmark

Baseline vs After

+---------------------------------------------+--------+----------+
| Benchmark                                   | After  | Baseline |
+---------------------------------------------+--------+----------+
| skip_training.parquet                       | 11257  | 10924    |
| skip_training.parquet.preserve_order        | 10754  | 10611    |
| skip_training.jpeg                          | 3325   | 3445     |
| skip_training.jpeg.preserve_order           | 1584   | 1775     |
| skip_training.jpeg.local_fs                 | 2042   | 2027     |
| skip_training.jpeg.local_fs.preserve_order  | 1987   | 1975     |
| skip_training.jpeg.local_fs_multi_gpus      | 2623   | 2658     |
| skip_training.jpeg.local_fs_multi_gpus.preserve_order | 2592 | 2615 |
+---------------------------------------------+--------+----------+

Signed-off-by: Srinath Krishnamachari <[email protected]>

python/ray/data/_internal/execution/backpressure_policy/concurrency_cap_backpressure_policy.py

alexeykudinkin · 2025-10-30T19:48:04Z

python/ray/data/_internal/execution/backpressure_policy/concurrency_cap_backpressure_policy.py

+                op_budget.object_store_memory / op_usage.object_store_memory
+                > self.OBJECT_STORE_USAGE_RATIO
+            ):
+                return running < self._concurrency_caps[op]


Hold on, what's this for?

python/ray/data/_internal/execution/backpressure_policy/concurrency_cap_backpressure_policy.py

Signed-off-by: Srinath Krishnamachari <[email protected]>

python/ray/data/_internal/execution/backpressure_policy/concurrency_cap_backpressure_policy.py

Signed-off-by: Srinath Krishnamachari <[email protected]>

srinathk10 · 2025-11-07T02:59:28Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the ConcurrencyCapBackpressurePolicy to use a simpler deadband-based algorithm for adjusting concurrency, which improves clarity and maintainability. The changes also make the policy's parameters configurable via environment variables.

My review focuses on the correctness of the new algorithm and its implementation. I've found a high-severity issue where the object store memory pressure check appears to be logically inverted, which could lead to incorrect backpressure behavior. I've also pointed out a related confusing comment and a minor typo in the docstring. Additionally, the corresponding unit test for the memory pressure check will need to be updated once the main logic is fixed.

Overall, this is a positive change that simplifies the system, and with the suggested fixes, it will be a solid improvement.

python/ray/data/_internal/execution/backpressure_policy/concurrency_cap_backpressure_policy.py

python/ray/data/tests/test_backpressure_policies.py

…rency_cap_backpressure_policy.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Srinath Krishnamachari <[email protected]>

python/ray/data/_internal/execution/backpressure_policy/concurrency_cap_backpressure_policy.py

Signed-off-by: Srinath Krishnamachari <[email protected]>

python/ray/data/_internal/execution/backpressure_policy/concurrency_cap_backpressure_policy.py

raulchen

Approving as this change per se makes sense.
But I think we need to revisit the whole policy again, after we fix all the accounting issues, e.g., prefetched data.
Because when that is done, this should no longer be needed to prevent spilling.
But this may still be useful for smoothing out sudden jumps.
So I'd suggest keeping it experimental for now and do more experiments.
Also, we should separate out the smoothing logic as standalone backpressure policy.

Signed-off-by: Srinath Krishnamachari <[email protected]>

> Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Concurrency Cap Backpressure tuning - Maintain asymmetric EWMA of total queued bytes (this op + downstream) as the typical level: level. - Maintain asymmetric EWMA of absolute residual vs the previous level as a scale proxy: dev = EWMA(|q - level_prev|). - Define deadband: [lower, upper] = [level - K_DEVdev, level + K_DEVdev]. If q > upper -> target cap = running - BACKOFF_FACTOR (back off) If q < lower -> target cap = running + RAMPUP_FACTOR (ramp up) Else -> target cap = running (hold) - Clamp to [1, configured_cap], admit iff running < target cap. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Srinath Krishnamachari <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

> Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Concurrency Cap Backpressure tuning - Maintain asymmetric EWMA of total queued bytes (this op + downstream) as the typical level: level. - Maintain asymmetric EWMA of absolute residual vs the previous level as a scale proxy: dev = EWMA(|q - level_prev|). - Define deadband: [lower, upper] = [level - K_DEVdev, level + K_DEVdev]. If q > upper -> target cap = running - BACKOFF_FACTOR (back off) If q < lower -> target cap = running + RAMPUP_FACTOR (ramp up) Else -> target cap = running (hold) - Clamp to [1, configured_cap], admit iff running < target cap. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Srinath Krishnamachari <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Aydin Abiar <[email protected]>

> Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Concurrency Cap Backpressure tuning - Maintain asymmetric EWMA of total queued bytes (this op + downstream) as the typical level: level. - Maintain asymmetric EWMA of absolute residual vs the previous level as a scale proxy: dev = EWMA(|q - level_prev|). - Define deadband: [lower, upper] = [level - K_DEVdev, level + K_DEVdev]. If q > upper -> target cap = running - BACKOFF_FACTOR (back off) If q < lower -> target cap = running + RAMPUP_FACTOR (ramp up) Else -> target cap = running (hold) - Clamp to [1, configured_cap], admit iff running < target cap. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Srinath Krishnamachari <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: YK <[email protected]>

> Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Concurrency Cap Backpressure tuning - Maintain asymmetric EWMA of total queued bytes (this op + downstream) as the typical level: level. - Maintain asymmetric EWMA of absolute residual vs the previous level as a scale proxy: dev = EWMA(|q - level_prev|). - Define deadband: [lower, upper] = [level - K_DEVdev, level + K_DEVdev]. If q > upper -> target cap = running - BACKOFF_FACTOR (back off) If q < lower -> target cap = running + RAMPUP_FACTOR (ramp up) Else -> target cap = running (hold) - Clamp to [1, configured_cap], admit iff running < target cap. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Srinath Krishnamachari <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

[Data] Concurrency Cap Backpressure tuning

353e60c

Signed-off-by: Srinath Krishnamachari <[email protected]>

srinathk10 added the go add ONLY when ready to merge, run all tests label Oct 25, 2025

Cleanup

41de1c3

Signed-off-by: Srinath Krishnamachari <[email protected]>

srinathk10 and others added 2 commits October 29, 2025 12:38

Merge branch 'master' into srinathk10/concurrency_cap_tuning

82ea7c7

Fix up

c32c68e

Signed-off-by: Srinath Krishnamachari <[email protected]>

srinathk10 marked this pull request as ready for review October 29, 2025 19:41

srinathk10 requested a review from a team as a code owner October 29, 2025 19:41

This comment was marked as outdated.

Sign in to view

ray-gardener bot added the data Ray Data-related issues label Oct 30, 2025

srinathk10 and others added 3 commits October 30, 2025 11:19

Merge branch 'master' into srinathk10/concurrency_cap_tuning

467280f

Lint

f697f68

Signed-off-by: Srinath Krishnamachari <[email protected]>

Fix test

41f664a

Signed-off-by: Srinath Krishnamachari <[email protected]>

alexeykudinkin reviewed Oct 30, 2025

View reviewed changes

srinathk10 and others added 4 commits October 30, 2025 20:55

Address comments

b13a054

Signed-off-by: Srinath Krishnamachari <[email protected]>

Lint

5a2d7f6

Signed-off-by: Srinath Krishnamachari <[email protected]>

Enable by default

f02bf87

Signed-off-by: Srinath Krishnamachari <[email protected]>

Merge branch 'master' into srinathk10/concurrency_cap_tuning

e1a6342

cursor bot reviewed Nov 6, 2025

View reviewed changes

python/ray/data/_internal/execution/backpressure_policy/concurrency_cap_backpressure_policy.py Show resolved Hide resolved

srinathk10 and others added 3 commits November 6, 2025 12:25

Update concurrency_cap_backpressure_policy.py

4a37600

Signed-off-by: Srinath Krishnamachari <[email protected]>

Fix tests

921d589

Signed-off-by: Srinath Krishnamachari <[email protected]>

Fix ups

61b8f49

Signed-off-by: Srinath Krishnamachari <[email protected]>

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

srinathk10 and others added 2 commits November 6, 2025 20:22

Merge branch 'master' into srinathk10/concurrency_cap_tuning

bbb82dd

alexeykudinkin reviewed Nov 7, 2025

View reviewed changes

python/ray/data/_internal/execution/backpressure_policy/concurrency_cap_backpressure_policy.py Outdated Show resolved Hide resolved

srinathk10 and others added 2 commits November 7, 2025 19:52

Address comments

ad8b180

Signed-off-by: Srinath Krishnamachari <[email protected]>

Merge branch 'master' into srinathk10/concurrency_cap_tuning

b6fd802

srinathk10 commented Nov 7, 2025

View reviewed changes

python/ray/data/_internal/execution/backpressure_policy/concurrency_cap_backpressure_policy.py Show resolved Hide resolved

raulchen approved these changes Nov 7, 2025

View reviewed changes

Addressed comments

42b28da

Signed-off-by: Srinath Krishnamachari <[email protected]>

raulchen merged commit fe5cd57 into master Nov 8, 2025
6 checks passed

raulchen deleted the srinathk10/concurrency_cap_tuning branch November 8, 2025 00:39

[Data] Concurrency Cap Backpressure tuning #58163

[Data] Concurrency Cap Backpressure tuning #58163

Conversation

srinathk10 commented Oct 25, 2025

Description

[Data] Concurrency Cap Backpressure tuning

Related issues

Additional information

Uh oh!

srinathk10 commented Oct 27, 2025

Benchmark

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexeykudinkin Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

srinathk10 commented Nov 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

raulchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants