[Data] Make `test_dataset_throughput` deterministic and refactor throughput stats #58693

dancingactor · 2025-11-17T04:10:14Z

This PR makes three improvements to Ray Data's throughput statistics:

Makes test_dataset_throughput deterministic: The original test was flaky because it relied on actual task
execution timing. This PR rewrites it as unit tests (test_dataset_throughput_calculation and
test_operator_throughput_calculation) using mocked BlockStats objects, making the tests fast and reliable.
Removes "Estimated single node throughput" from Dataset-level stats: This metric was misleading at the
dataset level since it summed wall times across all operators, which doesn't accurately represent single-node
performance. The "Ray Data throughput" metric (total rows / total wall time) remains and provides the meaningful
dataset-level throughput.
Renames "Estimated single node throughput" to "Estimated single task throughput": At the operator level,
this metric divides total rows by the sum of task wall times. The new name more accurately reflects what it
measures—the throughput if all work were done by a single task serially.

gemini-code-assist

Code Review

This pull request aims to make test_dataset_throughput deterministic by increasing the workload and introducing a tolerance for the throughput assertions. The changes look good and should help improve the test's stability. I've added a couple of minor style suggestions to align a new variable name with PEP 8 conventions.

python/ray/data/tests/test_stats.py

dancingactor · 2025-11-17T05:53:29Z

python/ray/data/tests/test_stats.py

    ray.init(num_cpus=2)

-    f = dummy_map_batches_sleep(0.01)
+    f = dummy_map_batches_sleep(0.03)


A shorter sleep time is better because it reduces the execution time. However, we choose 0.03 instead of 0.02 because using 0.02 resulting in 1 failure during 20 test runs

dancingactor · 2025-11-17T05:57:20Z

@owenowenisme @bveeramani PTAL, thanks!

bveeramani

This PR decreases the likeliness that this test fails, but ultimately, the test still relies on nondeterministic timing. It's also brittle because it uses regex that can break with minor formatting changes.

Rather than adjusting the parameters, could you rewrite this test as a unit test?

Separately, I realized that the "per node" throughputs actually represent the per-task throughput. Based on this, I think we should:

Remove the "per node throughput" for the "Dataset throughput" section, because the average per-task throughput across all operators isn't really useful, and
Rename "per node throughput" to "per task throughput" in the "Operator throughput" sections

The tests could look something like this:

def test_dataset_throughput_calculation():
    """Test throughput calculations using mock block stats."""
    from ray.data._internal.stats import DatasetStats
    from ray.data.block import BlockStats, BlockExecStats

    # Helper to create minimal block stats with only timing fields
    def create_block_stats(start_time, end_time, num_rows):
        exec_stats = BlockExecStats()
        exec_stats.start_time_s = start_time
        exec_stats.end_time_s = end_time
        exec_stats.wall_time_s = end_time - start_time

        return BlockStats(
            num_rows=num_rows,
            size_bytes=None,
            exec_stats=exec_stats
        )

    # Simulate 3 overlapping blocks
    blocks = [
        create_block_stats(0.0, 2.0, 100),
        create_block_stats(0.5, 2.5, 100),
        create_block_stats(1.0, 3.0, 100),
    ]

    stats = DatasetStats(metadata={"Map": blocks}, parent=None)
    summary = stats.to_summary()

    # Throughput: total rows / total execution duration
    # Total rows = 300
    # Duration = max end_time - min start_time = 3.0s
    # 300 rows / 3s = 100 rows/s
    # TODO: You'll need to expose this as a property so that it's testable.
    assert summary.num_rows_per_s == 100

def test_operator_throughput_calculation():
    ...  # A similar unit test. You might need to do some refactoring.

    # summary is a OperatorStatsSummary here, not DatasetStatsSummary
    # TODO: You'll need to similarly expose this property.
    assert summary.num_rows_per_s == 100
    assert summary.num_rows_per_task_s == 100

bveeramani · 2025-11-17T06:51:47Z

@dancingactor lemme know if you have any questions.

dancingactor · 2025-11-17T16:13:56Z

Thanks for your detailed feedback! I have two questions:

1.

My understanding is that we should remove the original test_dataset_throughput performance test, and instead add two unit tests, test_dataset_throughput_calculation and test_operator_throughput_calculation, to verify the correctness of the dataset and operator throughput calculations.

2.

Separately, I realized that the "per node" throughputs actually represent the per-task throughput. Based on this, I think we should:

Remove the "per node throughput" for the "Dataset throughput" section, because the average per-task throughput across all operators isn't really useful, and

Rename "per node throughput" to "per task throughput" in the "Operator throughput" sections

May I confirm that this means we should modify the current ds.stats() output as follow

Operator 1 Map(f): 4 tasks executed, 4 blocks produced in 2.23s                                                                 
 ...                                                                                                                             
* Operator throughput:                                                                                                          
        * Total input num rows: 0 rows                                                                                          
        * Total output num rows: 100 rows                                                                                       
        * Ray Data throughput: 44.8881328745759 rows/s                                               
        * Estimated single node throughput: 32.05117589203472 rows/s    <-- change node to task

Dataset throughput:
        * Ray Data throughput: 24.66899248263124 rows/s
        * Estimated single node throughput: 16.076964501040045 rows/s.  <-- remove this line

bveeramani · 2025-11-17T18:44:35Z

@dancingactor that's right!

dancingactor · 2025-11-18T03:57:50Z

Just to confirm, I need to do following things in this PR

Remove test_dataset_throughput test, add test_dataset_throughput_calculation and test_operator_throughput_calculation tests
Modify stats.py
Modify other tests in test_stats.py that are related to the change in stats.py

bveeramani · 2025-11-18T07:20:51Z

Just to confirm, I need to do following things in this PR

Remove test_dataset_throughput test, add test_dataset_throughput_calculation and test_operator_throughput_calculation tests

Modify stats.py

Modify other tests in test_stats.py that are related to the change in stats.py

That sounds right.

One note of warning -- test_stats.py is extremely brittle!

bveeramani · 2025-11-19T22:19:01Z

Hey @dancingactor, just following up here. Lemme know if I can provide any info or help to move this along!

dancingactor · 2025-11-20T04:43:58Z

Hi @bveeramani, since I am new to ray, I spend some time understanding the context and the codebase. I almost completed the test_dataset_throughput and test_operator_throughput_calculation. and will work on ds.stats() output very soon.

python/ray/data/_internal/stats.py

dancingactor · 2025-11-20T17:02:18Z

Hi @bveeramani, could you please advise on how to correctly test the new test_stats.py? 🙏

Currently I try to directly execute pytest /ray/python/ray/data/tests/test_stats.py I run into an error during environment setup. The error message is like

2025-11-21 00:50:44,720 INFO worker.py:2023 -- Started a local Ray instance.
                                                                                                                        6% ▋         

―――――――――――――――――――――――――――――――――――― ERROR at setup of test_large_args_scheduling_strategy[True] ――――――――――――――――――――――――――――――――――――

request = <SubRequest 'ray_start_regular_shared' for <Function test_streaming_split_stats>>

    @pytest.fixture(scope="module")
    def ray_start_regular_shared(request):
        param = getattr(request, "param", {})
>       with _ray_start(**param) as res:

python/ray/tests/conftest.py:615: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/contextlib.py:117: in __enter__
    return next(self.gen)
python/ray/tests/conftest.py:547: in _ray_start
    address_info = ray.init("local", **init_kwargs)
python/ray/_private/client_mode_hook.py:104: in wrapper
    return func(*args, **kwargs)
python/ray/_private/worker.py:2025: in init
    connect(
python/ray/_private/worker.py:1163: in wrapper
    return func(*args, **kwargs)
python/ray/_private/worker.py:2662: in connect
    worker.core_worker = ray._raylet.CoreWorker(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   self._gc_thread = PythonGCThread()
E   AttributeError: module 'ray._private.ray_constants' has no attribute 'RAY_GC_MIN_COLLECT_INTERVAL'

python/ray/_raylet.pyx:2709: AttributeError

bveeramani · 2025-11-20T17:57:15Z

Ah, this sounds like your Ray Core version is out-of-date.

Are you building Ray Core from source, or using the setup-dev.py script? I think you might need to either rebuild Ray (if building from source) or reinstasll the Ray nightly wheel (if using setup-dev.py)

dancingactor · 2025-11-21T06:38:39Z

Thanks! I will try the setup-dev.py approach

bveeramani · 2025-11-21T06:40:16Z

Thanks! I will try the setup-dev.py approach

Awesome! Lemme know if you run into any problems. Happy to help you out

dancingactor · 2025-11-23T17:02:26Z

Hi @bveeramani,
I have tested the modified code works, PTAL

tests % pytest /Users/ryanchen/github/ray/python/ray/data/tests/test_stats.py         
Test session starts (platform: darwin, Python 3.10.19, pytest 7.4.4, pytest-sugar 0.9.5)
rootdir: /Users/ryanchen/github/ray
configfile: pytest.ini
plugins: docker-tools-3.1.3, sphinx-0.5.1.dev0, forked-1.4.0, anyio-4.11.0, asyncio-0.17.2, sugar-0.9.5, timeout-2.1.0, shutil-1.8.1, lazy-fixtures-1.1.2, rerunfailures-11.1.2, pytest_httpserver-1.1.3, virtualenv-1.8.1, mock-3.14.0, aiohttp-1.1.0
asyncio: mode=auto
timeout: 180.0s
timeout method: signal
timeout func_only: False
collecting ... 
 python/ray/data/tests/test_stats.py ss✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓s✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓                                                                100% ██████████

Results (729.00s):
      44 passed
       3 skipped

python/ray/data/tests/test_stats.py

python/ray/data/_internal/stats.py

bveeramani

Thanks for the PR! Left a few comments

python/ray/data/_internal/stats.py

python/ray/data/tests/test_stats.py

python/ray/data/_internal/stats.py

dancingactor · 2025-11-26T05:02:56Z

python/ray/data/_internal/stats.py

                out += "\nDataset memory:\n"
                out += "* Spilled to disk: {}MB\n".format(dataset_mb_spilled)

-            # For throughput, we compute both an observed Ray Data dataset throughput


This comments were moved to https://github.com/ray-project/ray/pull/58693/files#diff-4dba40d789c60bfba4ae769f109b39979aa7d6977390329e7e2bb0e666569009R1221-R1226

the comment for "estimated single node" was removed since we removed this part from class Dataset

dancingactor · 2025-11-26T05:04:40Z

python/ray/data/_internal/stats.py

                node_count_stats["count"],
            )
-        if output_num_rows_stats and self.time_total_s and wall_time_stats:
-            # For throughput, we compute both an observed Ray Data operator throughput


These comments were moved to

https://github.com/ray-project/ray/pull/58693/files#diff-4dba40d789c60bfba4ae769f109b39979aa7d6977390329e7e2bb0e666569009R1386-R1388

https://github.com/ray-project/ray/pull/58693/files#diff-4dba40d789c60bfba4ae769f109b39979aa7d6977390329e7e2bb0e666569009R1396-R1399

…as unit tests 2. Remove the "per node throughput" for the "Dataset throughput" section 3. Rename "per node throughput" to "per task throughput" in the "Operator throughput" sections Signed-off-by: dancingactor <[email protected]>

bveeramani

LGTM! 🚢

Signed-off-by: Balaji Veeramani <[email protected]>

dancingactor · 2025-11-26T07:31:58Z

Thanks! Really appreciate your time and guidance for this issue!

…oughput stats (ray-project#58693) This PR makes three improvements to Ray Data's throughput statistics: 1. **Makes `test_dataset_throughput` deterministic**: The original test was flaky because it relied on actual task execution timing. This PR rewrites it as unit tests (`test_dataset_throughput_calculation` and `test_operator_throughput_calculation`) using mocked `BlockStats` objects, making the tests fast and reliable. 2. **Removes "Estimated single node throughput" from Dataset-level stats**: This metric was misleading at the dataset level since it summed wall times across all operators, which doesn't accurately represent single-node performance. The "Ray Data throughput" metric (total rows / total wall time) remains and provides the meaningful dataset-level throughput. 3. **Renames "Estimated single node throughput" to "Estimated single task throughput"**: At the operator level, this metric divides total rows by the sum of task wall times. The new name more accurately reflects what it measures—the throughput if all work were done by a single task serially. --------- Signed-off-by: dancingactor <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>

dancingactor requested a review from a team as a code owner November 17, 2025 04:10

gemini-code-assist bot reviewed Nov 17, 2025

View reviewed changes

python/ray/data/tests/test_stats.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_stats.py Outdated Show resolved Hide resolved

dancingactor force-pushed the master branch from e595e02 to 97b01e4 Compare November 17, 2025 04:13

cursor bot reviewed Nov 17, 2025

View reviewed changes

python/ray/data/tests/test_stats.py Outdated Show resolved Hide resolved

dancingactor force-pushed the master branch 2 times, most recently from 08f8c78 to 7f86e78 Compare November 17, 2025 05:17

dancingactor commented Nov 17, 2025

View reviewed changes

bveeramani reviewed Nov 17, 2025

View reviewed changes

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Nov 17, 2025

dancingactor force-pushed the master branch 2 times, most recently from 6f974a2 to fabe20e Compare November 20, 2025 14:17

cursor bot reviewed Nov 20, 2025

View reviewed changes

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved

python/ray/data/_internal/stats.py Show resolved Hide resolved

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved

python/ray/data/_internal/stats.py Show resolved Hide resolved

dancingactor force-pushed the master branch from 5d0779a to 6dc17f5 Compare November 23, 2025 17:15

cursor bot reviewed Nov 23, 2025

View reviewed changes

python/ray/data/tests/test_stats.py Show resolved Hide resolved

dancingactor force-pushed the master branch 2 times, most recently from 7f7199b to 7efd4f6 Compare November 24, 2025 04:30

dancingactor force-pushed the master branch from 7efd4f6 to 057de9a Compare November 24, 2025 05:01

cursor bot reviewed Nov 24, 2025

View reviewed changes

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved

dancingactor force-pushed the master branch 3 times, most recently from 963260c to da9e36b Compare November 24, 2025 06:58

bveeramani reviewed Nov 24, 2025

View reviewed changes

bveeramani self-assigned this Nov 25, 2025

dancingactor force-pushed the master branch from da9e36b to 9ec0df6 Compare November 25, 2025 16:41

cursor bot reviewed Nov 25, 2025

View reviewed changes

python/ray/data/_internal/stats.py Show resolved Hide resolved

dancingactor force-pushed the master branch from 9ec0df6 to 8ee7fae Compare November 26, 2025 04:52

dancingactor commented Nov 26, 2025

View reviewed changes

dancingactor requested a review from bveeramani November 26, 2025 05:07

dancingactor force-pushed the master branch from 8ee7fae to ffabc65 Compare November 26, 2025 05:51

bveeramani approved these changes Nov 26, 2025

View reviewed changes

Fix formatting

c73e911

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani changed the title ~~[Data] Make test_dataset_throughput deterministic by increasing workload and applying tolerance~~ [Data] Make test_dataset_throughput deterministic and refactor throughput stats Nov 26, 2025

bveeramani enabled auto-merge (squash) November 26, 2025 07:24

github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 26, 2025

bveeramani merged commit ec254d0 into ray-project:master Nov 26, 2025
7 of 8 checks passed

bveeramani mentioned this pull request Nov 26, 2025

[Data][Flaky] test_dataset_throughput fails when Ray Data throughput is slightly lower than single node estimate #58565

Closed

[Data] Make test_dataset_throughput deterministic and refactor throughput stats #58693

[Data] Make test_dataset_throughput deterministic and refactor throughput stats #58693

Uh oh!

Conversation

dancingactor commented Nov 17, 2025 • edited by bveeramani Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dancingactor Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dancingactor commented Nov 17, 2025

Uh oh!

bveeramani left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bveeramani commented Nov 17, 2025

Uh oh!

dancingactor commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1.

2.

Uh oh!

bveeramani commented Nov 17, 2025

Uh oh!

dancingactor commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bveeramani commented Nov 18, 2025

Uh oh!

bveeramani commented Nov 19, 2025

Uh oh!

dancingactor commented Nov 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dancingactor commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bveeramani commented Nov 20, 2025

Uh oh!

dancingactor commented Nov 21, 2025

Uh oh!

bveeramani commented Nov 21, 2025

Uh oh!

dancingactor commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dancingactor Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dancingactor Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

[Data] Make `test_dataset_throughput` deterministic and refactor throughput stats #58693

[Data] Make `test_dataset_throughput` deterministic and refactor throughput stats #58693

dancingactor commented Nov 17, 2025 •

edited by bveeramani

Loading

dancingactor Nov 17, 2025 •

edited

Loading

bveeramani left a comment •

edited

Loading

dancingactor commented Nov 17, 2025 •

edited

Loading

dancingactor commented Nov 18, 2025 •

edited

Loading

dancingactor commented Nov 20, 2025 •

edited

Loading

dancingactor commented Nov 23, 2025 •

edited

Loading

dancingactor Nov 26, 2025 •

edited

Loading