[Data][Flaky] Ensure `ActorPoolMapOperator` clears all queues on completion #58694

400Ping · 2025-11-17T04:31:10Z

Description

One of the PyTorch examples intermittently fails with an error like this:

AssertionError: Expected Internal Input Queue for MapBatches(ResnetModel) to be empty, but found 1 bundles

This assertion can fail because of two bugs:

We don't call done_adding_bundles before clearing the block ref bundler
We don't clean the actor pool's bundle queue

This PR fixes those bugs and deflakes the test.

Related issues

Closes #58546

Additional information

gemini-code-assist

Code Review

This pull request addresses a flaky test by fixing a race condition in ActorPoolMapOperator. The change adds a call to _dispatch_tasks() in all_inputs_done() to ensure any queued bundles are processed when no more inputs are expected. This is a solid fix for the described problem. I've added one minor suggestion to clean up some redundant code in the same method for improved maintainability.

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

400Ping · 2025-11-17T04:45:38Z

cc @owenowenisme

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

iamjustinhsu · 2025-11-17T21:13:30Z

Hi @400Ping, thanks for your contribution! We really appreciate the effort.

I think @owenowenisme might be right here, all_inputs_done should already be called. My hunch is that BlockRefBundler class is not correctly finalizing the outputs. From here

            if (
                output_buffer_size < self._min_rows_per_bundle
                or output_buffer_size == 0
            ):

it looks like we should be doing

            if (
                output_buffer_size < self._min_rows_per_bundle
                or output_buffer_size == 0
                or self._finalized
            ):

so that there are no more remainders. Can you try that and report back?
cc: @bveeramani

python/ray/data/_internal/execution/operators/map_operator.py

bveeramani · 2025-11-19T06:36:04Z

Hey @400Ping, would you mind helping me understand the root cause of the assertion error?

Also, did the repro script fail before the changes?

400Ping · 2025-11-19T06:39:33Z

Hey @400Ping, would you mind helping me understand the root cause of the assertion error?

Also, did the repro script fail before the changes?

Ok, will try to find it.

python/ray/data/_internal/execution/operators/map_operator.py

iamjustinhsu · 2025-11-21T18:51:40Z

python/ray/data/_internal/execution/operators/map_operator.py

                output_buffer_size += bundle_size
            else:
                remainder = self._bundle_buffer[idx:]
+                break


@bveeramani this feels like this break should have been there in the beginning, right?

python/ray/data/_internal/execution/operators/map_operator.py

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

python/ray/data/_internal/execution/operators/map_operator.py

400Ping · 2025-11-23T05:50:12Z

cc @bveeramani PTAL

400Ping · 2025-11-23T05:57:24Z

not very sure if this is the way to solve this.

iamjustinhsu · 2025-11-24T23:43:51Z

python/ray/data/_internal/execution/operators/map_operator.py

            if (
                output_buffer_size < self._min_rows_per_bundle
                or output_buffer_size == 0
+                or self._finalized


@bveeramani my impression is that when self._finalized=True (ie, when an operator is completed()), it is possible this for loop enters the else statement down below, populating remainders with non-empty ref bundles.

I also think the break statement is necessary too, otherwise, remainders is always being reassigned.

bveeramani · 2025-11-25T09:36:08Z

@400Ping @iamjustinhsu I was able to create a minimal repro of this issue (at least, I think this is the same issue)

import ray


class Fn:

    def __call__(self, batch):
        return batch


ds = ray.data.range(100, override_num_blocks=100).map_batches(Fn, batch_size=10).limit(1)
for _ in ds.iter_internal_ref_bundles():
    pass

Traceback (most recent call last):                                                                                                                                      
  File "/Users/balaji/ray/1.py", line 12, in <module>
    for _ in ds.iter_internal_ref_bundles():                                                                                                                            
  File "/Users/balaji/ray/python/ray/data/_internal/execution/interfaces/executor.py", line 34, in __next__
    return self.get_next()
  File "/Users/balaji/ray/python/ray/data/_internal/execution/legacy_compat.py", line 76, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
  File "/Users/balaji/ray/python/ray/data/_internal/execution/streaming_executor.py", line 786, in get_next
    bundle = state.get_output_blocking(output_split_idx)
  File "/Users/balaji/ray/python/ray/data/_internal/execution/streaming_executor_state.py", line 454, in get_output_blocking
    raise self._exception
  File "/Users/balaji/ray/python/ray/data/_internal/execution/streaming_executor.py", line 356, in run
    continue_sched = self._scheduling_loop_step(self._topology)
  File "/Users/balaji/ray/python/ray/data/_internal/execution/streaming_executor.py", line 529, in _scheduling_loop_step
    self._validate_operator_queues_empty(op, state)
  File "/Users/balaji/ray/python/ray/data/_internal/execution/streaming_executor.py", line 571, in _validate_operator_queues_empty
    assert op.internal_input_queue_num_blocks() == 0, error_msg.format(
AssertionError: Expected Internal Input Queue for MapBatches(Fn) to be empty, but found 8 bundles

@iamjustinhsu would a reasonable fix be to just clear the internal queues when the operator is manually marked finished?

    def mark_execution_finished(self):
        # Discard remaining bundles in the internal bundle queue.
        self._bundle_queue.clear()

        # Discard remaining bundles in the block ref bundler.
        self._block_ref_bundler.done_adding_bundles()
        while self._block_ref_bundler.has_bundle():
            self._block_ref_bundler.get_next_bundle()

        super().mark_execution_finished()

iamjustinhsu · 2025-11-25T16:28:26Z

@bveeramani oh i see now. We actually do this for for all InternalQueueOperatorMixin operators. However, I implemented this for MapOperator, thinking ActorPoolMapOperator uses the same underlying implementation. It actually doesn't, and contains an additional queue called _bundle_queue. Here is my PR for reference: #58441. What we should do is have logic for also clearing the self._bundle_queue in ActorPoolMapOperator. This can be done with having separate implementations for TaskPoolMapOperator and ActorPoolMapOperator

bveeramani · 2025-11-25T18:51:24Z

Ah, okay.

@400Ping would you mind refactoring the PR or opening a new PR to do the following:

Delete MapOperator.clear_internal_output_queue and MapOperator.clear_internal_input_queue
Add TaskPoolMapOperator.clear_internal_output_queue and TaskPoolMapOperator.clear_internal_input_queue
Add ActorPoolMapOperator.clear_internal_output_queue and ActorPoolMapOperator.clear_internal_input_queue

I think this is all that we need to do to solve the flaky issue.

bveeramani

The implementations for clear_internal_input_queue and clear_internal_output_queue LGTM.

@400Ping to keep the git history clear, would you mind reverting all of the changes unrelated to those methods (e.g., formatting, removing _inputs_done, type annotations)? Once we've reverted the unrelated changes, I'll approve the PR

400Ping · 2025-11-26T06:32:06Z

The implementations for clear_internal_input_queue and clear_internal_output_queue LGTM.

@400Ping to keep the git history clear, would you mind reverting all of the changes unrelated to those methods (e.g., formatting, removing _inputs_done, type annotations)? Once we've reverted the unrelated changes, I'll approve the PR

Ok.

Signed-off-by: 400Ping <[email protected]>

bveeramani

LGTM! 🚢

Signed-off-by: Balaji Veeramani <[email protected]>

400Ping · 2025-11-26T08:51:54Z

Thanks for the fix!

bveeramani · 2025-11-26T09:11:17Z

@400Ping Just merged! ty for the contribution!

400Ping · 2025-11-26T09:12:46Z

@400Ping Just merged! ty for the contribution!

Thank you as well, I am a newbie in this area 😓.

…8694) ## Description The test fails intermittently with an assertion error indicating that the internal input queue for a MapBatches operator is not empty when it's expected to be. This suggests a race condition or timing issue in the streaming executor's queue management. ## Related issues Closes ray-project#58546 ## Additional information --------- Signed-off-by: 400Ping <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>

400Ping requested a review from a team as a code owner November 17, 2025 04:31

gemini-code-assist bot reviewed Nov 17, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py Outdated Show resolved Hide resolved

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Nov 17, 2025

owenowenisme reviewed Nov 17, 2025

View reviewed changes

ryankert01 reviewed Nov 18, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/map_operator.py Outdated Show resolved Hide resolved

iamjustinhsu self-requested a review November 19, 2025 19:06

bveeramani self-assigned this Nov 19, 2025

iamjustinhsu reviewed Nov 21, 2025

View reviewed changes

cursor bot reviewed Nov 21, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/map_operator.py Outdated Show resolved Hide resolved

cursor bot reviewed Nov 21, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/map_operator.py Outdated Show resolved Hide resolved

cursor bot reviewed Nov 22, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/map_operator.py Outdated Show resolved Hide resolved

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py Outdated Show resolved Hide resolved

cursor bot reviewed Nov 22, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/map_operator.py Outdated Show resolved Hide resolved

cursor bot reviewed Nov 22, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/map_operator.py Outdated Show resolved Hide resolved

400Ping requested a review from iamjustinhsu November 22, 2025 03:36

iamjustinhsu reviewed Nov 24, 2025

View reviewed changes

bveeramani reviewed Nov 26, 2025

View reviewed changes

fix

04c6b1d

Signed-off-by: 400Ping <[email protected]>

400Ping force-pushed the data/fix-pytorch_resnet_batch_prediction-flaky branch from f47460e to 04c6b1d Compare November 26, 2025 06:39

400Ping requested a review from bveeramani November 26, 2025 06:41

bveeramani approved these changes Nov 26, 2025

View reviewed changes

bveeramani added 2 commits November 25, 2025 23:14

Fix bug

e3a8274

Signed-off-by: Balaji Veeramani <[email protected]>

Fix unexpected diff

a1fbd0d

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani added the go add ONLY when ready to merge, run all tests label Nov 26, 2025

Fix test

c2c27df

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani merged commit 6a14c93 into ray-project:master Nov 26, 2025
6 checks passed

bveeramani changed the title ~~[Data][Flaky] pytorch_resnet_batch_prediction is flaky~~ [Data][Flaky] Ensure ActorPoolMapOperator clears all queues on completion Nov 26, 2025

iamjustinhsu mentioned this pull request Nov 26, 2025

[data] drain buffer on finalize for block ref bundler #59019

Closed

[Data][Flaky] Ensure ActorPoolMapOperator clears all queues on completion #58694

[Data][Flaky] Ensure ActorPoolMapOperator clears all queues on completion #58694

Uh oh!

Conversation

400Ping commented Nov 17, 2025 • edited by bveeramani Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

400Ping commented Nov 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iamjustinhsu commented Nov 17, 2025

Uh oh!

Uh oh!

bveeramani commented Nov 19, 2025

Uh oh!

400Ping commented Nov 19, 2025

Uh oh!

Uh oh!

iamjustinhsu Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

400Ping commented Nov 23, 2025

Uh oh!

400Ping commented Nov 23, 2025

Uh oh!

iamjustinhsu Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bveeramani commented Nov 25, 2025

Uh oh!

iamjustinhsu commented Nov 25, 2025

Uh oh!

bveeramani commented Nov 25, 2025

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

400Ping commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

400Ping commented Nov 26, 2025

Uh oh!

Uh oh!

bveeramani commented Nov 26, 2025

Uh oh!

400Ping commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Data][Flaky] Ensure `ActorPoolMapOperator` clears all queues on completion #58694

[Data][Flaky] Ensure `ActorPoolMapOperator` clears all queues on completion #58694

400Ping commented Nov 17, 2025 •

edited by bveeramani

Loading

iamjustinhsu Nov 24, 2025 •

edited

Loading

400Ping commented Nov 26, 2025 •

edited

Loading