Skip to content

Conversation

@srinathk10
Copy link
Contributor

Thank you for contributing to Ray! 🚀
Please review the Ray Contribution Guide before opening a pull request.

⚠️ Remove these instructions before submitting your PR.

💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.

Description

Briefly describe what this PR accomplishes and why it's needed.

[Data] Concurrency Cap Backpressure tuning

  • Maintain asymmetric EWMA of total queued bytes (this op + downstream) as the typical level: level.
  • Maintain asymmetric EWMA of absolute residual vs the previous level as a scale proxy: dev = EWMA(|q - level_prev|).
  • Define deadband: [lower, upper] = [level - K_DEVdev, level + K_DEVdev].
    If q > upper -> target cap = running - BACKOFF_FACTOR (back off)
    If q < lower -> target cap = running + RAMPUP_FACTOR (ramp up)
    Else -> target cap = running (hold)
  • Clamp to [1, configured_cap], admit iff running < target cap.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Srinath Krishnamachari <[email protected]>
@srinathk10 srinathk10 added the go add ONLY when ready to merge, run all tests label Oct 25, 2025
Signed-off-by: Srinath Krishnamachari <[email protected]>
@srinathk10
Copy link
Contributor Author

Benchmark

Baseline vs After

+---------------------------------------------+--------+----------+
| Benchmark                                   | After  | Baseline |
+---------------------------------------------+--------+----------+
| skip_training.parquet                       | 11257  | 10924    |
| skip_training.parquet.preserve_order        | 10754  | 10611    |
| skip_training.jpeg                          | 3325   | 3445     |
| skip_training.jpeg.preserve_order           | 1584   | 1775     |
| skip_training.jpeg.local_fs                 | 2042   | 2027     |
| skip_training.jpeg.local_fs.preserve_order  | 1987   | 1975     |
| skip_training.jpeg.local_fs_multi_gpus      | 2623   | 2658     |
| skip_training.jpeg.local_fs_multi_gpus.preserve_order | 2592 | 2615 |
+---------------------------------------------+--------+----------+

@srinathk10 srinathk10 marked this pull request as ready for review October 29, 2025 19:41
@srinathk10 srinathk10 requested a review from a team as a code owner October 29, 2025 19:41
cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener bot added the data Ray Data-related issues label Oct 30, 2025
srinathk10 and others added 3 commits October 30, 2025 11:19
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
op_budget.object_store_memory / op_usage.object_store_memory
> self.OBJECT_STORE_USAGE_RATIO
):
return running < self._concurrency_caps[op]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold on, what's this for?

srinathk10 and others added 4 commits October 30, 2025 20:55
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
srinathk10 and others added 3 commits November 6, 2025 12:25
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
@srinathk10
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the ConcurrencyCapBackpressurePolicy to use a simpler deadband-based algorithm for adjusting concurrency, which improves clarity and maintainability. The changes also make the policy's parameters configurable via environment variables.

My review focuses on the correctness of the new algorithm and its implementation. I've found a high-severity issue where the object store memory pressure check appears to be logically inverted, which could lead to incorrect backpressure behavior. I've also pointed out a related confusing comment and a minor typo in the docstring. Additionally, the corresponding unit test for the memory pressure check will need to be updated once the main logic is fixed.

Overall, this is a positive change that simplifies the system, and with the suggested fixes, it will be a solid improvement.

srinathk10 and others added 2 commits November 6, 2025 20:22
…rency_cap_backpressure_policy.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as this change per se makes sense.
But I think we need to revisit the whole policy again, after we fix all the accounting issues, e.g., prefetched data.
Because when that is done, this should no longer be needed to prevent spilling.
But this may still be useful for smoothing out sudden jumps.
So I'd suggest keeping it experimental for now and do more experiments.
Also, we should separate out the smoothing logic as standalone backpressure policy.

Signed-off-by: Srinath Krishnamachari <[email protected]>
@raulchen raulchen merged commit fe5cd57 into master Nov 8, 2025
6 checks passed
@raulchen raulchen deleted the srinathk10/concurrency_cap_tuning branch November 8, 2025 00:39
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
> Thank you for contributing to Ray! 🚀
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> 💡 Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.


### [Data] Concurrency Cap Backpressure tuning
- Maintain asymmetric EWMA of total queued bytes (this op + downstream)
as the typical level: level.
- Maintain asymmetric EWMA of absolute residual vs the previous level as
a scale proxy: dev = EWMA(|q - level_prev|).
- Define deadband: [lower, upper] = [level - K_DEVdev, level +
K_DEVdev].
If q > upper -> target cap = running - BACKOFF_FACTOR (back off)
If q < lower -> target cap = running + RAMPUP_FACTOR (ramp up)
Else -> target cap = running (hold)
- Clamp to [1, configured_cap], admit iff running < target cap.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
> Thank you for contributing to Ray! 🚀
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> 💡 Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.

### [Data] Concurrency Cap Backpressure tuning
- Maintain asymmetric EWMA of total queued bytes (this op + downstream)
as the typical level: level.
- Maintain asymmetric EWMA of absolute residual vs the previous level as
a scale proxy: dev = EWMA(|q - level_prev|).
- Define deadband: [lower, upper] = [level - K_DEVdev, level +
K_DEVdev].
If q > upper -> target cap = running - BACKOFF_FACTOR (back off)
If q < lower -> target cap = running + RAMPUP_FACTOR (ramp up)
Else -> target cap = running (hold)
- Clamp to [1, configured_cap], admit iff running < target cap.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Aydin Abiar <[email protected]>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
> Thank you for contributing to Ray! 🚀
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> 💡 Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.

### [Data] Concurrency Cap Backpressure tuning
- Maintain asymmetric EWMA of total queued bytes (this op + downstream)
as the typical level: level.
- Maintain asymmetric EWMA of absolute residual vs the previous level as
a scale proxy: dev = EWMA(|q - level_prev|).
- Define deadband: [lower, upper] = [level - K_DEVdev, level +
K_DEVdev].
If q > upper -> target cap = running - BACKOFF_FACTOR (back off)
If q < lower -> target cap = running + RAMPUP_FACTOR (ramp up)
Else -> target cap = running (hold)
- Clamp to [1, configured_cap], admit iff running < target cap.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: YK <[email protected]>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
> Thank you for contributing to Ray! 🚀
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> 💡 Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.


### [Data] Concurrency Cap Backpressure tuning
- Maintain asymmetric EWMA of total queued bytes (this op + downstream)
as the typical level: level.
- Maintain asymmetric EWMA of absolute residual vs the previous level as
a scale proxy: dev = EWMA(|q - level_prev|).
- Define deadband: [lower, upper] = [level - K_DEVdev, level +
K_DEVdev].
If q > upper -> target cap = running - BACKOFF_FACTOR (back off)
If q < lower -> target cap = running + RAMPUP_FACTOR (ramp up)
Else -> target cap = running (hold)
- Clamp to [1, configured_cap], admit iff running < target cap.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

4 participants