[data] Refractor progress bar clearer metrics #57094

iamjustinhsu · 2025-10-01T16:47:48Z

~~Before:~~
~~https://github.com/user-attachments/assets/9db00f37-0c37-4e99-874a-a14481878e4a~~
~~In before, the progress bar won't update until the first tasks finishes.~~

~~After:
~~https://github.com/user-attachments/assets/99877a3f-7b52-4293-aae5-7702edfaabec~~

In After, the progress bar won't update until the first task generates output. If a task generates 10 blocks, we will update the progress bar while it's generating blocks, even if the task hasn't finished. Once the task finishes, we default back to the way it was before.

~~This is better because the very 1st progress bar update will occur sooner, and won't feel abrupt to the user.~~

Refractoring the progress bar estimates using known metrics.

Why are these changes needed?

Currently we use number of finished tasks. This is OK, but since we use streaming geneator, 1 task = thousands of blocks. This is troublesome for additional split factor (split blocks) in read parquet

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: iamjustinhsu <[email protected]>

iamjustinhsu · 2025-10-01T16:48:07Z

python/ray/data/_internal/execution/operators/input_data_buffer.py

                object to use injestion.
            input_data: The list of bundles to output from this operator.
            input_data_factory: The factory to get input data, if input_data is None.
-            num_output_blocks: The number of output blocks. If not specified, progress


not being used anywhere

: Signed-off-by: iamjustinhsu <[email protected]>

…/use-outputs-generated-for-pg

Signed-off-by: iamjustinhsu <[email protected]>

python/ray/data/_internal/execution/interfaces/physical_operator.py

Signed-off-by: iamjustinhsu <[email protected]>

iamjustinhsu · 2025-10-15T19:08:43Z

python/ray/data/_internal/execution/interfaces/physical_operator.py

        estimated_output_num_rows = round(
            estimated_num_tasks
-            * metrics.rows_task_outputs_generated
+            * metrics.rows_outputs_of_finished_tasks


bveeramani

@iamjustinhsu I felt confused after reading the PR description about what the change is, and why it's an improvement. Could you update the description to make it clearer?

Is this understanding correct?

Before: Ray Data doesn't render a progress bar if no tasks have finished because it can't estimate how many rows each task will produce, and therefore can't estimate the total number of rows the operator will produce.
After: Ray Data uses the number of rows already outputted as an estimate of the total number of rows that will be produced by the operator.

And this is better because it appears smoother (?)

bveeramani · 2025-10-16T06:19:43Z

@iamjustinhsu what's the code you used for the example recordings?

iamjustinhsu · 2025-10-16T20:48:18Z

The code is above the sample recordings:

import time

def random_blocks(x):
    bits_to_allocate = 1 * 1024 * 1024
    for i in range(5):
        time.sleep(1)
        yield {'item': [0] * bits_to_allocate} # After the first yield, it the progress bar will update

ray.data.DataContext.get_current().target_max_block_size = 1 * 1024 * 1024
ray.data.range(100, override_num_blocks=2).map_batches(random_blocks, concurrency=1).materialize()

bveeramani · 2025-10-20T16:31:52Z

Gotcha.

I'm worried that some users might feel confused if they see the progress bar stuck at 100% for a while before it abruptly updates to <100%. Might be clearer to show no progress bar than an inaccurate one, though I can understand the argument that always showing a progress bar might look nicer.

@alexeykudinkin what's your take?

iamjustinhsu · 2025-10-20T17:50:05Z

@bveeramani for more context, this was mainly annoying because of ReadParquet->SplitBlocks(N), because you don't get any updates until after the 1st task finishes. If N is large (which I have seen to be in the several hundreds), nothing is updating since SplitBlocks is in a single task.

bveeramani · 2025-10-20T18:32:54Z

@bveeramani for more context, this was mainly annoying because of ReadParquet->SplitBlocks(N), because you don't get any updates until after the 1st task finishes. If N is large (which I have seen to be in the several hundreds), nothing is updating since SplitBlocks is in a single task.

Yeah, this can happen whenever tasks produce multiple outputs, which can be especially common with something like split blocks

alexeykudinkin · 2025-10-21T22:25:28Z

python/ray/data/_internal/execution/operators/map_operator.py

+                self._estimated_num_output_bundles = (
+                    self._metrics.num_task_outputs_generated
+                )
+                self._estimated_output_num_rows = (
+                    self._metrics.rows_task_outputs_generated
+                )


Hold on, this is what we estimate total # of rows/bundles to be not what has been gen'd so far, right?

_estimated_num_output_bundles and _estimated_output_num_rows estimate the total number. But as a crude heuristic, I wanted these to update sooner, rather than later. These variables are mainly used for user facing progress indicators (like in progress bar). After the 1st task finishes, the progress bar will behave as before.

But these values don't make sense, right?

As reader of the code i'm scratching my head as we're now messing up their semantic

i understand, if the name of variable is named estimation, then semantics imply that it's not exact but a guess. Whether that guess is wrong, is also intrinsic for the current estimation. IMO a crude estimation is still better than None. Incidentally, I had originally thought to do something like this:

if self._metrics.num_tasks_finished == 0: estimated_num_tasks = ( self.upstream_op_num_outputs / metrics.num_inputs_received * num_tasks_submitted ) ratio = estimated_num_tasks / self._metrics.num_tasks_running self._estimated_num_output_bundles = ( self._metrics.num_task_outputs_generated * ratio ) self._estimated_output_num_rows = ( self._metrics.rows_task_outputs_generated * ratio )

but at this point the estimation complexity outweighs it's temporal use when num_tasks_finished = 0.

Regardless, I'm gonna change the intent of the PR to refractor stuff instead, because there are still stuff that needs to change in this PR.

Signed-off-by: iamjustinhsu <[email protected]>

bveeramani · 2025-10-23T20:33:49Z

python/ray/data/_internal/execution/interfaces/physical_operator.py

+                upstream_op_num_outputs / metrics.average_num_inputs_per_task
            )

        estimated_num_output_bundles = round(
-            estimated_num_tasks
-            * metrics.num_outputs_of_finished_tasks
-            / metrics.num_tasks_finished
+            estimated_num_tasks * metrics.average_num_outputs_per_task
        )
        estimated_output_num_rows = round(
-            estimated_num_tasks
-            * metrics.rows_task_outputs_generated
-            / metrics.num_tasks_finished
+            estimated_num_tasks * metrics.average_rows_outputs_per_task


If these metrics are None, this code will raise a TypeError:

average_num_inputs_per_task

average_num_outputs_per_task

average_rows_outputs_per_task

This can't happen right now because we check num_tasks_finished > 0, and the three metrics aren't None when num_tasks_finished > 0. But, that's an implementation detail, and isn't guaranteed by their interface.

This code would be more robust if replaced this check:

and metrics.num_inputs_received > 0 and metrics.num_tasks_finished > 0

With explicit checks that the metrics aren't None.

Signed-off-by: iamjustinhsu <[email protected]>

~~Before:~~ ~~https://github.com/user-attachments/assets/9db00f37-0c37-4e99-874a-a14481878e4a~~ ~~In before, the progress bar won't update until the first tasks finishes.~~ ~~After: ~~https://github.com/user-attachments/assets/99877a3f-7b52-4293-aae5-7702edfaabec~~ ~~In After, the progress bar won't update until the first task generates output. If a task generates 10 blocks, we will update the progress bar while it's generating blocks, even if the task hasn't finished. Once the task finishes, we default back to the way it was before.~~ ~~This is better because the very 1st progress bar update will occur sooner, and won't feel abrupt to the user.~~ Refractoring the progress bar estimates using known metrics. ## Why are these changes needed? Currently we use number of finished tasks. This is OK, but since we use streaming geneator, 1 task = thousands of blocks. This is troublesome for additional split factor (split blocks) in read parquet  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]> Signed-off-by: xgui <[email protected]>

~~Before:~~ ~~https://github.com/user-attachments/assets/9db00f37-0c37-4e99-874a-a14481878e4a~~ ~~In before, the progress bar won't update until the first tasks finishes.~~ ~~After: ~~https://github.com/user-attachments/assets/99877a3f-7b52-4293-aae5-7702edfaabec~~ ~~In After, the progress bar won't update until the first task generates output. If a task generates 10 blocks, we will update the progress bar while it's generating blocks, even if the task hasn't finished. Once the task finishes, we default back to the way it was before.~~ ~~This is better because the very 1st progress bar update will occur sooner, and won't feel abrupt to the user.~~ Refractoring the progress bar estimates using known metrics. ## Why are these changes needed? Currently we use number of finished tasks. This is OK, but since we use streaming geneator, 1 task = thousands of blocks. This is troublesome for additional split factor (split blocks) in read parquet  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]>

~~Before:~~ ~~https://github.com/user-attachments/assets/9db00f37-0c37-4e99-874a-a14481878e4a~~ ~~In before, the progress bar won't update until the first tasks finishes.~~ ~~After: ~~https://github.com/user-attachments/assets/99877a3f-7b52-4293-aae5-7702edfaabec~~ ~~In After, the progress bar won't update until the first task generates output. If a task generates 10 blocks, we will update the progress bar while it's generating blocks, even if the task hasn't finished. Once the task finishes, we default back to the way it was before.~~ ~~This is better because the very 1st progress bar update will occur sooner, and won't feel abrupt to the user.~~ Refractoring the progress bar estimates using known metrics. ## Why are these changes needed? Currently we use number of finished tasks. This is OK, but since we use streaming geneator, 1 task = thousands of blocks. This is troublesome for additional split factor (split blocks) in read parquet  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>

~~Before:~~ ~~https://github.com/user-attachments/assets/9db00f37-0c37-4e99-874a-a14481878e4a~~ ~~In before, the progress bar won't update until the first tasks finishes.~~ ~~After: ~~https://github.com/user-attachments/assets/99877a3f-7b52-4293-aae5-7702edfaabec~~ ~~In After, the progress bar won't update until the first task generates output. If a task generates 10 blocks, we will update the progress bar while it's generating blocks, even if the task hasn't finished. Once the task finishes, we default back to the way it was before.~~ ~~This is better because the very 1st progress bar update will occur sooner, and won't feel abrupt to the user.~~ Refractoring the progress bar estimates using known metrics. ## Why are these changes needed? Currently we use number of finished tasks. This is OK, but since we use streaming geneator, 1 task = thousands of blocks. This is troublesome for additional split factor (split blocks) in read parquet  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]> Signed-off-by: Future-Outlier <[email protected]>

[data] Refine estimate for total_num_rows in progress bars

e8a6d08

Signed-off-by: iamjustinhsu <[email protected]>

iamjustinhsu commented Oct 1, 2025

View reviewed changes

iamjustinhsu added the go add ONLY when ready to merge, run all tests label Oct 1, 2025

iamjustinhsu added 2 commits October 6, 2025 17:32

keep finished + ongoing tasks

a909536

: Signed-off-by: iamjustinhsu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into jhsu…

9690830

…/use-outputs-generated-for-pg

iamjustinhsu marked this pull request as ready for review October 9, 2025 21:19

iamjustinhsu requested a review from a team as a code owner October 9, 2025 21:19

This comment was marked as outdated.

Sign in to view

iamjustinhsu added 2 commits October 9, 2025 14:29

move estimation in output generation

648e5b5

Signed-off-by: iamjustinhsu <[email protected]>

remove print

e3f5554

Signed-off-by: iamjustinhsu <[email protected]>

iamjustinhsu marked this pull request as draft October 9, 2025 21:45

more accurate

aa2a6b8

Signed-off-by: iamjustinhsu <[email protected]>

iamjustinhsu commented Oct 9, 2025

View reviewed changes

python/ray/data/_internal/execution/interfaces/physical_operator.py Outdated Show resolved Hide resolved

iamjustinhsu marked this pull request as ready for review October 9, 2025 21:58

This comment was marked as outdated.

Sign in to view

ray-gardener bot added the data Ray Data-related issues label Oct 10, 2025

iamjustinhsu added 3 commits October 10, 2025 13:28

case for num_tasks_finished == 0

f63c0c1

Signed-off-by: iamjustinhsu <[email protected]>

remove code

444cebb

Signed-off-by: iamjustinhsu <[email protected]>

comments

8a84a06

Signed-off-by: iamjustinhsu <[email protected]>

This comment was marked as outdated.

Sign in to view

goutamvenkat-anyscale approved these changes Oct 14, 2025

View reviewed changes

iamjustinhsu commented Oct 15, 2025

View reviewed changes

bveeramani reviewed Oct 16, 2025

View reviewed changes

alexeykudinkin reviewed Oct 21, 2025

View reviewed changes

Merge branch 'master' into jhsu/use-outputs-generated-for-pg

e8278df

iamjustinhsu added 2 commits October 22, 2025 22:27

fix

f3d854c

Signed-off-by: iamjustinhsu <[email protected]>

lint

f8ba30a

Signed-off-by: iamjustinhsu <[email protected]>

iamjustinhsu changed the title ~~[data] Refine estimate for total_num_rows in progress bars~~ [data] Refractor progress bar clearer metrics Oct 23, 2025

bveeramani approved these changes Oct 23, 2025

View reviewed changes

check for None or 0

37bd144

Signed-off-by: iamjustinhsu <[email protected]>

iamjustinhsu force-pushed the jhsu/use-outputs-generated-for-pg branch from c85111d to 37bd144 Compare October 23, 2025 22:11

alexeykudinkin merged commit 77a96d9 into ray-project:master Oct 24, 2025
6 checks passed

iamjustinhsu deleted the jhsu/use-outputs-generated-for-pg branch October 24, 2025 19:40

[data] Refractor progress bar clearer metrics #57094

[data] Refractor progress bar clearer metrics #57094

Uh oh!

Conversation

iamjustinhsu commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

iamjustinhsu Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

iamjustinhsu Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

bveeramani commented Oct 16, 2025

Uh oh!

iamjustinhsu commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bveeramani commented Oct 20, 2025

Uh oh!

iamjustinhsu commented Oct 20, 2025

Uh oh!

bveeramani commented Oct 20, 2025

Uh oh!

alexeykudinkin Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

iamjustinhsu Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

iamjustinhsu Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bveeramani Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

iamjustinhsu commented Oct 1, 2025 •

edited

Loading

iamjustinhsu commented Oct 16, 2025 •

edited

Loading

iamjustinhsu Oct 21, 2025 •

edited

Loading

iamjustinhsu Oct 23, 2025 •

edited

Loading