Skip to content

Conversation

@lee1258561
Copy link

This reverts commit 460bce4.

Thank you for contributing to Ray! 🚀
Please review the Ray Contribution Guide before opening a pull request.

⚠️ Remove these instructions before submitting your PR.

💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.

Description

This PR likely reintroducing the issue that A previous PR trying to solve: #55978

Specifically, this add a full list of paths to self and for FileBaseDatasource self is captured every read_task_fn during serialization and causing this data being duplicated and cause excessive spilling.

We face the similar warning and have spilling behavior:

The serialized size of your read function named 'read_task_fn' is 9.7MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'.

Related issues

#55978

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@lee1258561 lee1258561 requested a review from a team as a code owner January 9, 2026 08:30
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reverts a change that added _source_paths to FileBasedDatasource and ParquetDatasource for lineage tracking. As you've noted in the description, this change introduced a significant performance regression due to the increased serialization size of read tasks, leading to excessive object spilling. The revert correctly removes the problematic _source_paths attribute and the corresponding test assertions. This is a necessary fix for the performance issue, and I approve of this change. Given that this re-introduces the issue with lineage tracking, it would be beneficial to create a follow-up ticket to explore a more performant solution for tracking source paths.

@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Jan 9, 2026
@lee1258561 lee1258561 added the go add ONLY when ready to merge, run all tests label Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant