You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Data] Make test_dataset_throughput deterministic and refactor throughput stats (#58693)
This PR makes three improvements to Ray Data's throughput statistics:
1. **Makes `test_dataset_throughput` deterministic**: The original test
was flaky because it relied on actual task
execution timing. This PR rewrites it as unit tests
(`test_dataset_throughput_calculation` and
`test_operator_throughput_calculation`) using mocked `BlockStats`
objects, making the tests fast and reliable.
2. **Removes "Estimated single node throughput" from Dataset-level
stats**: This metric was misleading at the
dataset level since it summed wall times across all operators, which
doesn't accurately represent single-node
performance. The "Ray Data throughput" metric (total rows / total wall
time) remains and provides the meaningful
dataset-level throughput.
3. **Renames "Estimated single node throughput" to "Estimated single
task throughput"**: At the operator level,
this metric divides total rows by the sum of task wall times. The new
name more accurately reflects what it
measures—the throughput if all work were done by a single task serially.
---------
Signed-off-by: dancingactor <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
0 commit comments