Proposal to Introduce Ray SQL into DataFusion Python #872

austin362667 · 2024-09-15T18:46:53Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In light of the complexities associated with maintaining and debugging Ballista, the community would like to propose exploring the adoption of Ray SQL within the DataFusion Python project.

Ray SQL, with only 1.7k lines of Rust, is significantly simpler compared to Ballista’s 27k lines. Despite its smaller codebase, Ray SQL has demonstrated the ability to run all TPC-H queries, showcasing its robustness and simplicity. This reduction in complexity could ease maintenance, lower the learning curve for contributors, and provide a functional distributed SQL solution that is much simpler than Ballista because Ray provides so much – there is no need for us to build scheduler and executor processes – we can simply execute Python tasks in the Ray cluster.

If there is enough interest and support, we could start working on bringing the Ray SQL code into the DataFusion Python project.

Describe the solution you'd like

Following options are suggested by @andygrove , all feedbacks are welcome!

Bringing the Ray SQL prototype into the DataFusion Python project

Describe alternatives you've considered

Building a new version inspired by the Ray SQL code

Additional context

We might need to go through the Apache IP clearance process for importing the external Ray SQL codebase in datafusion-contrib which is not part of Apache.

The text was updated successfully, but these errors were encountered:

andygrove · 2024-09-15T18:50:25Z

Ray SQL project: https://github.com/datafusion-contrib/ray-sql

andygrove · 2024-09-15T18:54:17Z

@franklsf95 fyi

jordandakota · 2024-09-15T19:08:00Z

Things I'm thinking of that should be showcased or discussed:

Showcase complexity/simplicity
- Setup and testing of ballista on tpc-h
- Setup and testing of raysql on tpc-h
Benchmark differences ballista/raysql
What do we gain/lose by moving towards adopting raysql
What does raysql provide that Ballista currently does, what would need to be added/maintained by the integration
- Need to assess covers all Ballista functionalities and if not document gaps
- Identify effort requires to close any gaps
Discuss learning curve differences
- At present it's quite complex setting up and running a Ballista cluster while you can start a Ray cluster in a few lines of code reliably
- Potential to attract contributors due to ease of use

edmondop · 2024-09-15T19:12:36Z

Is the complexity of Ballista coming from cluster management or query planning?
If Ray SQL can achieve similar or better performance than Ballista with only 1.7k loc by delegating cluster management to Ray and only keeping the distributed query planning responsibility, that's pretty amazing !

andygrove · 2024-09-15T19:51:10Z

Is the complexity of Ballista coming from cluster management or query planning?

Ballista and Ray SQL use the same query planning code. The complexity comes from cluster management/communication.

andygrove · 2024-09-15T22:12:18Z

Setup and testing of ballista on tpc-h

Since upgrading to a more recent DataFusion version, Ballista is no longer capable of running the TPC-H suite. It works fine in the 0.12.0 release.

ozankabak · 2024-09-16T06:11:22Z

I am curious about the right boundary here. If we go through with this, should the code become a part of datafusion-python, or should the former stay as a single node thing focusing on excellent bindings to core, and we should have a "datafusion-ray" that uses datafusion-python as a library?

austin362667 · 2024-09-16T07:41:10Z

Thanks @ozankabak I'm curious on the boundary, too.
One of the goals in Ray SQL is "Drive requirements for DataFusion's Python bindings". So my guess is that if we wanna use datafusion-python as a library for separate datafusion-ray, we might need to ensure APIs of datafusion-ray is a subset of datafusion-python APIs. Otherwise it'll not be able to easily expose complete bindings to python side via PyO3.

ozankabak · 2024-09-16T09:23:31Z

Indeed. Do you have an idea of how the subset would look like?

andygrove · 2024-09-16T13:34:20Z

To expand on this, there are several options for adopting the Ray SQL code as part of the DataFusion project.

Create new repo datafusion-ray-sql (possibly renamed to datafusion-ray or something else)
Replace the current Ballista code with Ray SQL (Ballista is already known as the distributed version of DataFusion)
Add as an optional module of the DataFusion Python bindings

vakarisbk · 2024-09-16T13:58:40Z

100% yes. If Ray SQL can handle all TPC-H queries with just 1.7k lines of code, it’s sounds like a no-brainer. This actually makes a production-ready distributed DataFusion sound achievable and it should attract more contributors.
Can't think of a single reason not to go this route.

alamb · 2024-09-17T00:31:59Z

One question I have is who would maintain this new code? Ballista I think suffers from a slow decline due to lack of active maintenance and community. We should try to avoid the same thing happening to Ray

jordandakota · 2024-09-17T00:59:20Z

Docs/Readme.md probably needs a quick update for anyone joining the catchup. Distributed shuffle using rays object store was implemented in #33. Get Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Andrew Lamb ***@***.***> Sent: Monday, September 16, 2024 6:32:22 PM To: apache/datafusion-python ***@***.***> Cc: Jordan Fox ***@***.***>; Comment ***@***.***> Subject: Re: [apache/datafusion-python] Proposal to Introduce Ray SQL into DataFusion Python (Issue #872) Caution: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. One question I have is who would maintain this new code? Ballista I think suffers from a slow decline due to lack of active maintenance and community. We should try to avoid the same thing happening to Ray — Reply to this email directly, view it on GitHub<#872 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AHXVCFKTRUINKJ527GAANVDZW52BNAVCNFSM6AAAAABOIBWHDCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJUGI3DMMJUGI>. You are receiving this because you commented.Message ID: ***@***.***>

andygrove · 2024-09-17T16:25:51Z

Here is the PR in ray-sql that added support for distributed shuffle using the Ray object store:

datafusion-contrib/ray-sql#33

andygrove · 2024-09-17T16:42:08Z

One question I have is who would maintain this new code? Ballista I think suffers from a slow decline due to lack of active maintenance and community. We should try to avoid the same thing happening to Ray

That's a great point. If the project were to become part of the Apache DataFusion project then I would certainly put time into maintaining it and helping build community around the project. I am not able to contribute in its current location.

I have recently been attempting to maintain Ballista by upgrading to more recent versions of DataFusion, but the project is large and complex and the original contributors of much of this code are no longer available to help, so it is challenging.

I believe that DataFusion + Ray is an opportunity to start fresh on a solution for distributed DataFusion as a much lighter weight alternative to Ballista and the project is small enough (~40 commits) that it will be easier for new contributors to follow along.

These are the initial tasks that I would plan on working on (with the community, hopefully) if we were to move forward with this proposal.

Make sure that the current code still runs with recent versions of Python + Ray
Update README to reflect that distributed execution is now supported
Run fresh benchmarks and compare to Ballista
Fix the outstanding bug Example produces incorrect results datafusion-contrib/ray-sql#44
Upgrade to more recent versions of DataFusion (one PR per version)

Another possibility is that interested contributors could start maintaining the project in its current location, but I am not sure who would be able to approve the PRs.

alamb · 2024-09-17T18:18:46Z

Given that I am +1 on this proposal 🚀

alamb · 2024-09-17T18:19:41Z

Another possibility is that interested contributors could start maintaining the project in its current location, but I am not sure who would be able to approve the PRs.

I think I have admin rights to the https://github.com/datafusion-contrib org and so can add / remove people from the project as necessary. Just let me know

franklsf95 · 2024-09-18T11:12:42Z

Really excited to see this happening! I contributed some code to Ray SQL last year (most notably, making distributed shuffle work using Ray's distributed object store), and can help answer any question regarding Ray (I work with people who build Ray, Ray Data, etc.)

On a high level, by building on top of Ray, you get a distributed execution substrate for free. Ray handles managing the cluster, scheduling tasks, managing distributed memory, fault tolerance, to name a few. This would mean basically to use DataFusion as a single-node query execution engine and build all the distributed stuff in Python on top of Ray. If this is in line with the goal of this project (or the DataFusion project), then I think it would be a good way to go.

andygrove · 2024-09-18T13:57:46Z

I've been thinking some more about where this code should live.

I am now leaning towards putting the code into a new standalone datafusion-ray repository.

My reasoning for this:

We have only just gotten to the point where were have multiple people regularly maintaining the datafusion-python project and I don't think that we should put an additional burden on these maintainers
As @ozankabak mentioned, perhaps it is better to keep datafusion-python as a single node thing focusing on excellent bindings to core and then have datafusion-ray depend on datafusion-python. This will be an excellent way to ensure that datafusion-python has sufficient extension points to enable distributed use cases.
If the project fails to gain traction, then we haven't polluted the datafusion-python repo with unmaintained code

timsaucer · 2024-09-18T14:07:38Z

I have been thinking a lot about this and how it would fit into or use datafusion-python. This is the third case I've come across of needing some kind of way to extend datafusion-python. The other two are delta-rs and a personal project where I could really benefit from adding a custom ExecutionPlan. I keep going back and forth between two ideas

Make some kind of requirement that the version dependency is exactly the same and compiler versions are identical. I think we can do this for datafusion-ray but other projects like delta-rs don't want such a requirement.
Create a FFI interface for extending datafusion so we can use that as a safe intermediary between the projects. I have a couple of minimal examples of doing this in a branch, but as you start getting into anything that touches the SessionContext the requirements for what is included in the FFI can explode.

Regarding the question of where it should sit, I would recommend a separate repo datafusion-ray which will force the issue on making datafusion-python more extensible.

austin362667 · 2024-09-19T11:13:03Z

Seems like everyone is on the same page, preferring DataFusion + Ray over Ballista, we just haven't decide how to do that (so many options).

Based on @andygrove's reasonings, sounds a good idea starting a separate datafusion-ray project under Apache DataFusion, aimed at replacing Ballista. This will keep things clean and avoid mixing datafusion-python with datafusion-ray, preventing a messy, hodgepodge integration. Look forward to it!

Following this way, this proposal is not relevant in datafusion-python anymore. What's the next actionable items? Would love to help~

austin362667 added the enhancement New feature or request label Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal to Introduce Ray SQL into DataFusion Python #872

Proposal to Introduce Ray SQL into DataFusion Python #872

austin362667 commented Sep 15, 2024

andygrove commented Sep 15, 2024

andygrove commented Sep 15, 2024

jordandakota commented Sep 15, 2024 •

edited

Loading

edmondop commented Sep 15, 2024

andygrove commented Sep 15, 2024

andygrove commented Sep 15, 2024

ozankabak commented Sep 16, 2024

austin362667 commented Sep 16, 2024

ozankabak commented Sep 16, 2024

andygrove commented Sep 16, 2024

vakarisbk commented Sep 16, 2024

alamb commented Sep 17, 2024

jordandakota commented Sep 17, 2024 via email

andygrove commented Sep 17, 2024

andygrove commented Sep 17, 2024

alamb commented Sep 17, 2024

alamb commented Sep 17, 2024

franklsf95 commented Sep 18, 2024

andygrove commented Sep 18, 2024

timsaucer commented Sep 18, 2024

austin362667 commented Sep 19, 2024 •

edited

Loading

Proposal to Introduce Ray SQL into DataFusion Python #872

Proposal to Introduce Ray SQL into DataFusion Python #872

Comments

austin362667 commented Sep 15, 2024

andygrove commented Sep 15, 2024

andygrove commented Sep 15, 2024

jordandakota commented Sep 15, 2024 • edited Loading

edmondop commented Sep 15, 2024

andygrove commented Sep 15, 2024

andygrove commented Sep 15, 2024

ozankabak commented Sep 16, 2024

austin362667 commented Sep 16, 2024

ozankabak commented Sep 16, 2024

andygrove commented Sep 16, 2024

vakarisbk commented Sep 16, 2024

alamb commented Sep 17, 2024

jordandakota commented Sep 17, 2024 via email

andygrove commented Sep 17, 2024

andygrove commented Sep 17, 2024

alamb commented Sep 17, 2024

alamb commented Sep 17, 2024

franklsf95 commented Sep 18, 2024

andygrove commented Sep 18, 2024

timsaucer commented Sep 18, 2024

austin362667 commented Sep 19, 2024 • edited Loading

jordandakota commented Sep 15, 2024 •

edited

Loading

austin362667 commented Sep 19, 2024 •

edited

Loading