Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to Introduce Ray SQL into DataFusion Python #872

Open
austin362667 opened this issue Sep 15, 2024 · 21 comments
Open

Proposal to Introduce Ray SQL into DataFusion Python #872

austin362667 opened this issue Sep 15, 2024 · 21 comments
Labels
enhancement New feature or request

Comments

@austin362667
Copy link

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In light of the complexities associated with maintaining and debugging Ballista, the community would like to propose exploring the adoption of Ray SQL within the DataFusion Python project.

Ray SQL, with only 1.7k lines of Rust, is significantly simpler compared to Ballista’s 27k lines. Despite its smaller codebase, Ray SQL has demonstrated the ability to run all TPC-H queries, showcasing its robustness and simplicity. This reduction in complexity could ease maintenance, lower the learning curve for contributors, and provide a functional distributed SQL solution that is much simpler than Ballista because Ray provides so much – there is no need for us to build scheduler and executor processes – we can simply execute Python tasks in the Ray cluster.

If there is enough interest and support, we could start working on bringing the Ray SQL code into the DataFusion Python project.

Describe the solution you'd like

Following options are suggested by @andygrove , all feedbacks are welcome!

Bringing the Ray SQL prototype into the DataFusion Python project

Describe alternatives you've considered

Building a new version inspired by the Ray SQL code

Additional context

We might need to go through the Apache IP clearance process for importing the external Ray SQL codebase in datafusion-contrib which is not part of Apache.

@austin362667 austin362667 added the enhancement New feature or request label Sep 15, 2024
@andygrove
Copy link
Member

Ray SQL project: https://github.com/datafusion-contrib/ray-sql

@andygrove
Copy link
Member

@franklsf95 fyi

@jordandakota
Copy link

jordandakota commented Sep 15, 2024

Things I'm thinking of that should be showcased or discussed:

  • Showcase complexity/simplicity
    • Setup and testing of ballista on tpc-h
    • Setup and testing of raysql on tpc-h
  • Benchmark differences ballista/raysql
  • What do we gain/lose by moving towards adopting raysql
  • What does raysql provide that Ballista currently does, what would need to be added/maintained by the integration
    • Need to assess covers all Ballista functionalities and if not document gaps
    • Identify effort requires to close any gaps
  • Discuss learning curve differences
    • At present it's quite complex setting up and running a Ballista cluster while you can start a Ray cluster in a few lines of code reliably
    • Potential to attract contributors due to ease of use

@edmondop
Copy link

Is the complexity of Ballista coming from cluster management or query planning?
If Ray SQL can achieve similar or better performance than Ballista with only 1.7k loc by delegating cluster management to Ray and only keeping the distributed query planning responsibility, that's pretty amazing !

@andygrove
Copy link
Member

Is the complexity of Ballista coming from cluster management or query planning?

Ballista and Ray SQL use the same query planning code. The complexity comes from cluster management/communication.

@andygrove
Copy link
Member

  • Setup and testing of ballista on tpc-h

Since upgrading to a more recent DataFusion version, Ballista is no longer capable of running the TPC-H suite. It works fine in the 0.12.0 release.

@ozankabak
Copy link

I am curious about the right boundary here. If we go through with this, should the code become a part of datafusion-python, or should the former stay as a single node thing focusing on excellent bindings to core, and we should have a "datafusion-ray" that uses datafusion-python as a library?

@austin362667
Copy link
Author

Thanks @ozankabak I'm curious on the boundary, too.
One of the goals in Ray SQL is "Drive requirements for DataFusion's Python bindings". So my guess is that if we wanna use datafusion-python as a library for separate datafusion-ray, we might need to ensure APIs of datafusion-ray is a subset of datafusion-python APIs. Otherwise it'll not be able to easily expose complete bindings to python side via PyO3.

@ozankabak
Copy link

Indeed. Do you have an idea of how the subset would look like?

@andygrove
Copy link
Member

To expand on this, there are several options for adopting the Ray SQL code as part of the DataFusion project.

  1. Create new repo datafusion-ray-sql (possibly renamed to datafusion-ray or something else)
  2. Replace the current Ballista code with Ray SQL (Ballista is already known as the distributed version of DataFusion)
  3. Add as an optional module of the DataFusion Python bindings

@vakarisbk
Copy link

100% yes. If Ray SQL can handle all TPC-H queries with just 1.7k lines of code, it’s sounds like a no-brainer. This actually makes a production-ready distributed DataFusion sound achievable and it should attract more contributors.
Can't think of a single reason not to go this route.

@alamb
Copy link

alamb commented Sep 17, 2024

One question I have is who would maintain this new code? Ballista I think suffers from a slow decline due to lack of active maintenance and community. We should try to avoid the same thing happening to Ray

@jordandakota
Copy link

jordandakota commented Sep 17, 2024 via email

@andygrove
Copy link
Member

Here is the PR in ray-sql that added support for distributed shuffle using the Ray object store:

datafusion-contrib/ray-sql#33

@andygrove
Copy link
Member

One question I have is who would maintain this new code? Ballista I think suffers from a slow decline due to lack of active maintenance and community. We should try to avoid the same thing happening to Ray

That's a great point. If the project were to become part of the Apache DataFusion project then I would certainly put time into maintaining it and helping build community around the project. I am not able to contribute in its current location.

I have recently been attempting to maintain Ballista by upgrading to more recent versions of DataFusion, but the project is large and complex and the original contributors of much of this code are no longer available to help, so it is challenging.

I believe that DataFusion + Ray is an opportunity to start fresh on a solution for distributed DataFusion as a much lighter weight alternative to Ballista and the project is small enough (~40 commits) that it will be easier for new contributors to follow along.

These are the initial tasks that I would plan on working on (with the community, hopefully) if we were to move forward with this proposal.

  • Make sure that the current code still runs with recent versions of Python + Ray
  • Update README to reflect that distributed execution is now supported
  • Run fresh benchmarks and compare to Ballista
  • Fix the outstanding bug Example produces incorrect results datafusion-contrib/ray-sql#44
  • Upgrade to more recent versions of DataFusion (one PR per version)

Another possibility is that interested contributors could start maintaining the project in its current location, but I am not sure who would be able to approve the PRs.

@alamb
Copy link

alamb commented Sep 17, 2024

Given that I am +1 on this proposal 🚀

@alamb
Copy link

alamb commented Sep 17, 2024

Another possibility is that interested contributors could start maintaining the project in its current location, but I am not sure who would be able to approve the PRs.

I think I have admin rights to the https://github.com/datafusion-contrib org and so can add / remove people from the project as necessary. Just let me know

@franklsf95
Copy link

Really excited to see this happening! I contributed some code to Ray SQL last year (most notably, making distributed shuffle work using Ray's distributed object store), and can help answer any question regarding Ray (I work with people who build Ray, Ray Data, etc.)

On a high level, by building on top of Ray, you get a distributed execution substrate for free. Ray handles managing the cluster, scheduling tasks, managing distributed memory, fault tolerance, to name a few. This would mean basically to use DataFusion as a single-node query execution engine and build all the distributed stuff in Python on top of Ray. If this is in line with the goal of this project (or the DataFusion project), then I think it would be a good way to go.

@andygrove
Copy link
Member

I've been thinking some more about where this code should live.

I am now leaning towards putting the code into a new standalone datafusion-ray repository.

My reasoning for this:

  • We have only just gotten to the point where were have multiple people regularly maintaining the datafusion-python project and I don't think that we should put an additional burden on these maintainers
  • As @ozankabak mentioned, perhaps it is better to keep datafusion-python as a single node thing focusing on excellent bindings to core and then have datafusion-ray depend on datafusion-python. This will be an excellent way to ensure that datafusion-python has sufficient extension points to enable distributed use cases.
  • If the project fails to gain traction, then we haven't polluted the datafusion-python repo with unmaintained code

@timsaucer
Copy link
Contributor

I have been thinking a lot about this and how it would fit into or use datafusion-python. This is the third case I've come across of needing some kind of way to extend datafusion-python. The other two are delta-rs and a personal project where I could really benefit from adding a custom ExecutionPlan. I keep going back and forth between two ideas

  • Make some kind of requirement that the version dependency is exactly the same and compiler versions are identical. I think we can do this for datafusion-ray but other projects like delta-rs don't want such a requirement.
  • Create a FFI interface for extending datafusion so we can use that as a safe intermediary between the projects. I have a couple of minimal examples of doing this in a branch, but as you start getting into anything that touches the SessionContext the requirements for what is included in the FFI can explode.

Regarding the question of where it should sit, I would recommend a separate repo datafusion-ray which will force the issue on making datafusion-python more extensible.

@austin362667
Copy link
Author

austin362667 commented Sep 19, 2024

Seems like everyone is on the same page, preferring DataFusion + Ray over Ballista, we just haven't decide how to do that (so many options).

Based on @andygrove's reasonings, sounds a good idea starting a separate datafusion-ray project under Apache DataFusion, aimed at replacing Ballista. This will keep things clean and avoid mixing datafusion-python with datafusion-ray, preventing a messy, hodgepodge integration. Look forward to it!

Following this way, this proposal is not relevant in datafusion-python anymore. What's the next actionable items? Would love to help~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants