[Data] Speed up checkpoint filter and reduce memory usage#60294
[Data] Speed up checkpoint filter and reduce memory usage#60294wxwmd wants to merge 1 commit intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the checkpoint filtering mechanism to use a global actor that holds checkpointed IDs as a NumPy array. This is a significant improvement for memory efficiency and performance by avoiding passing large checkpoint data to each task. The changes to the build and CI scripts seem appropriate for the internal environment.
My review identified a critical issue where the batch-based filtering path (filter_rows_for_batch and its caller) was not updated to align with the new actor-based architecture, which will lead to runtime errors. I also found a medium-severity performance issue due to using ray.get() inside a loop and a minor issue with a leftover debug print statement.
a71364b to
b971082
Compare
b971082 to
b2fc954
Compare
|
Great job! I have a few questions:
Maybe we could turn the actor into a sharded design (multiple actors, partitioned by |
|
b2fc954 to
eb93d53
Compare
|
@owenowenisme @raulchen please check this when you have time 😊 |
eb93d53 to
0276b22
Compare
Signed-off-by: xiaowen.wxw <wxw403883@alibaba-inc.com>
0276b22 to
df3e7aa
Compare
| job_id = ray.get_runtime_context().get_job_id() | ||
| self._ckpt_filter = BatchBasedCheckpointFilter.options( | ||
| name=f"checkpoint_filter_{job_id}", | ||
| lifetime="detached", |
There was a problem hiding this comment.
Do we have to make the actor detached?
There was a problem hiding this comment.
i will test the lifecycle of this actor
| job_id = ray.get_runtime_context().get_job_id() | ||
| self._ckpt_filter = BatchBasedCheckpointFilter.options( | ||
| name=f"checkpoint_filter_{job_id}", | ||
| lifetime="detached", |
There was a problem hiding this comment.
Since you''re using actor now and the ThreadPoolExecutor is removed, can we use max_concurrency here?
There was a problem hiding this comment.
youou mean using a thread pool to perform the filtering? I think that is a good idea, will implement it
There was a problem hiding this comment.
Sorry for reviewing this so late, and thanks for the beautiful diagram!
When I was reviewing your global actor approach, I have another idea. The actor introduces a serial bottleneck — every read task has to ship its block to the single actor for filtering and wait for the result back. Without max_concurrency, calls are processed one at a time, which could be a significant throughput regression from the old design where each worker filtered locally in parallel.
Instead, what if we keep the filtering local in each worker but broadcast the checkpoint IDs as a numpy array via the object store?
The approach would be:
- Load checkpoint data and convert to a sorted numpy array (your PR already does this in _postprocess_block — nice work on that part!)
- Use a remote task to do the heavy conversion, then
ray.put()the numpy array into the object store - Pass the ObjectRef to each read task via
add_map_task_kwargs_fn(the old mechanism) - Each worker calls
ray.get(ref)to get a zero-copy read-only view from the local object store, then does searchsorted locally
This gives us:
- Parallelism: filtering is parallel across all workers, no bottleneck
- Memory efficiency: Ray's object store stores one copy per node in shared memory, and all workers on the same node share it via zero-copy
- Minimize the re-computation of converting arrow blocks into numpy
- Simplicity: No actor needed
hi youcheng, thanks for reviewing! When I first solve this problem, I had the same idea as you: perform the I implemented and tested this approach, and it has one issue: the NumPy array is too large. having each worker keep a ~10 GB object in memory is unacceptable. Our cluster has about 1,000 nodes, which means roughly 10,000 GB of memory would be used only for checkpoint. |
|
Got it, I think this is valid, one problem is that we should avoid this actor becoming bottleneck. |
Yes, I will use concurrency to enhance the actor in this PR. |
@owenowenisme what do you think if we implement a checkpoint actor-pool? i think this will solve the single-actor bottleneck |

Current checkpoint:
The current implementation has two issues:
Improved Checkpoint:
Maintain a global
checkpoint_filteractor that holds thecheckpoint_idsarray; this actor is responsible for filtering all input blocks.There are two advantages to this approach:
checkpoint_filteractor holds it.Performance test
test code:
node: 16 cores with 64GB memory (make sure you have memory at least 16GB to avoid oom)
origin ray:
Speedup:
Test Result
origin: 680s
speedup: 190s
You can see that even the overall running time of the task has been accelerated by 3.6 times.
Memory
If we delete this row:
original ray will oom, the fixed ray passed. This demonstrates that this PR has enhanced the stability.