Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Update roadmap to point at EPIC's, clarify project goals #6639

Merged
merged 5 commits into from
Jun 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion datafusion/core/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,13 @@
//!
//! ## Customization and Extension
//!
//! DataFusion supports extension at many points:
//! DataFusion is designed to be a "disaggregated" query engine. This
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is trying to address @boazberman 's comments in #6441 (comment)

//! means that developers can mix and extend the parts of DataFusion
//! they need for their usecase. For example, just the
//! [`ExecutionPlan`] operators, or the [`SqlToRel`] SQL planner and
//! optimizer.
//!
//! In order to achieve this, DataFusion supports extension at many points:
//!
//! * read from any datasource ([`TableProvider`])
//! * define your own catalogs, schemas, and table lists ([`CatalogProvider`])
Expand Down
121 changes: 24 additions & 97 deletions docs/source/contributor-guide/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,100 +19,27 @@ under the License.

# Roadmap

This document describes high level goals of the DataFusion and
Ballista development community. It is not meant to restrict
possibilities, but rather help newcomers understand the broader
context of where the community is headed, and inspire
additional contributions.

DataFusion and Ballista are part of the [Apache
Arrow](https://arrow.apache.org/) project and governed by the Apache
Software Foundation governance model. These projects are entirely
driven by volunteers, and we welcome contributions for items not on
this roadmap. However, before submitting a large PR, we strongly
suggest you start a conversation using a github issue or the
[email protected] mailing list to make review efficient and avoid
surprises.

## DataFusion

DataFusion's goal is to become the embedded query engine of choice
for new analytic applications, by leveraging the unique features of
[Rust](https://www.rust-lang.org/) and [Apache Arrow](https://arrow.apache.org/)
to provide:

1. Best-in-class single node query performance
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These goals are largely redundant with the introduction, so I figured it would be better to leave a link and direct people back there rather than partially replicate the content

2. A Declarative SQL query interface compatible with PostgreSQL
3. A Dataframe API, similar to those offered by Pandas and Spark
4. A Procedural API for programmatically creating and running execution plans
5. High performance, data race free, ergonomic extensibility points at at every layer

### Additional SQL Language Features

- Decimal Support [#122](https://github.com/apache/arrow-datafusion/issues/122)
- Complete support list on [status](https://github.com/apache/arrow-datafusion/blob/main/README.md#status)
- Timestamp Arithmetic [#194](https://github.com/apache/arrow-datafusion/issues/194)
- SQL Parser extension point [#533](https://github.com/apache/arrow-datafusion/issues/533)
- Support for nested structures (fields, lists, structs) [#119](https://github.com/apache/arrow-datafusion/issues/119)
- Run all queries from the TPCH benchmark (see [milestone](https://github.com/apache/arrow-datafusion/milestone/2) for more details)

### Query Optimizer

- More sophisticated cost based optimizer for join ordering
- Implement advanced query optimization framework (Tokomak) [#440](https://github.com/apache/arrow-datafusion/issues/440)
- Finer optimizations for group by and aggregate functions

### Datasources

- Better support for reading data from remote filesystems (e.g. S3) without caching it locally [#907](https://github.com/apache/arrow-datafusion/issues/907) [#1060](https://github.com/apache/arrow-datafusion/issues/1060)
- Improve performances of file format datasources (parallelize file listings, async Arrow readers, file chunk prefetching capability...)

### Runtime / Infrastructure

- Migrate to some sort of arrow2 based implementation (see [milestone](https://github.com/apache/arrow-datafusion/milestone/3) for more details)
- Add DataFusion to h2oai/db-benchmark [#147](https://github.com/apache/arrow-datafusion/issues/147)
- Improve build time [#348](https://github.com/apache/arrow-datafusion/issues/348)

### Resource Management

- Finer grain control and limit of runtime memory [#587](https://github.com/apache/arrow-datafusion/issues/587) and CPU usage [#54](https://github.com/apache/arrow-datafusion/issues/64)

### Python Interface

TBD

### DataFusion CLI (`datafusion-cli`)

Note: There are some additional thoughts on a datafusion-cli vision on [#1096](https://github.com/apache/arrow-datafusion/issues/1096#issuecomment-939418770).

- Better abstraction between REPL parsing and queries so that commands are separated and handled correctly
- Connect to the `Statistics` subsystem and have the cli print out more stats for query debugging, etc.
- Improved error handling for interactive use and shell scripting usage
- publishing to apt, brew, and possible NuGet registry so that people can use it more easily
- adopt a shorter name, like dfcli?

## Ballista

Ballista is a distributed compute platform based on Apache Arrow and DataFusion. It provides a query scheduler that
breaks a physical plan into stages and tasks and then schedules tasks for execution across the available executors
in the cluster.

Having Ballista as part of the DataFusion codebase helps ensure that DataFusion remains suitable for distributed
compute. For example, it helps ensure that physical query plans can be serialized to protobuf format and that they
remain language-agnostic so that executors can be built in languages other than Rust.

### Ballista Roadmap

### Move query scheduler into DataFusion

The Ballista scheduler has some advantages over DataFusion query execution because it doesn't try to eagerly execute
the entire query at once but breaks it down into a directionally-acyclic graph (DAG) of stages and executes a
configurable number of stages and tasks concurrently. It should be possible to push some of this logic down to
DataFusion so that the same scheduler can be used to scale across cores in-process and across nodes in a cluster.

### Implement execution-time cost-based optimizations based on statistics

After the execution of a query stage, accurate statistics are available for the resulting data. These statistics
could be leveraged by the scheduler to optimize the query during execution. For example, when performing a hash join
it is desirable to load the smaller side of the join into memory and in some cases we cannot predict which side will
be smaller until execution time.
The [project introduction](../user-guide/introduction) explains the
overview and goals of DataFusion, and our development efforts largely
align to that vision.

## Planning `EPIC`s

DataFusion uses [GitHub
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I began this PR by trying to summarize the outstanding work and to do so I looked at the EPICs -- pretty soon I found that I was just replicating https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+epic in a markdown document that would end up out of date

While a more free form version of the roadmap in text (rather than a github issue list) is probably easier to consume, unless we have a volunteer to commit to doing, keeping our efforts focused on keeping github updated seemed better.

issues](https://github.com/apache/arrow-datafusion/issues) to track
planned work. We collect related tickets using tracking issues labeled
with `[EPIC]` which contain discussion and links to more detailed items.

Epics offer a high level roadmap of what the DataFusion
community is thinking about. The epics are not meant to restrict
possibilities, but rather help the community see where development is
headed, align our work, and inspire additional contributions.

As this project is entirely driven by volunteers, we welcome
contributions for items not currently covered by epics. However,
before submitting a large PR, we strongly suggest and request you
start a conversation using a github issue or the
[[email protected]](mailto:[email protected]) mailing list to
make review efficient and avoid surprises.

[The current list of `EPIC`s can be found here](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+epic).
58 changes: 40 additions & 18 deletions docs/source/user-guide/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,20 @@
DataFusion is a very fast, extensible query engine for building
high-quality data-centric systems in [Rust](http://rustlang.org),
using the [Apache Arrow](https://arrow.apache.org) in-memory format.
DataFusion is part of the [Apache Arrow](https://arrow.apache.org/)
project.

DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community.
DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet, JSON, and Avro, [python bindings], extensive customization, a great community, and more.

[python bindings]: https://github.com/apache/arrow-datafusion-python

## Project Goals

DataFusion aims to be the query engine of choice for new, fast
data centric systems such as databases, dataframe libraries, machine
learning and streaming applications by leveraging the unique features
of [Rust](https://www.rust-lang.org/) and [Apache
Arrow](https://arrow.apache.org/).

## Features

Expand All @@ -34,37 +46,47 @@ DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchm
- Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL,
other query languages, custom plan and execution nodes, optimizer passes, and more.
- Streaming, asynchronous IO directly from popular object stores, including AWS S3,
Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the
`ObjectStore` trait.
Azure Blob Storage, and Google Cloud Storage (Other storage systems are supported via the
`ObjectStore` trait).
- [Excellent Documentation](https://docs.rs/datafusion/latest) and a
[welcoming community](https://arrow.apache.org/datafusion/contributor-guide/communication.html).
- A state of the art query optimizer with projection and filter pushdown, sort aware optimizations,
automatic join reordering, expression coercion, and more.
- Permissive Apache 2.0 License, Apache Software Foundation governance
- Written in [Rust](https://www.rust-lang.org/), a modern system language with development
productivity similar to Java or Golang, the performance of C++, and
[loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
- Support for [Substrait](https://substrait.io/) for query plan serialization, making it easier to integrate DataFusion
with other projects, and to pass plans across language boundaries.
- A state of the art query optimizer with expression coercion and
simplification, projection and filter pushdown, sort and distribution
aware optimizations, automatic join reordering, and more.
- Permissive Apache 2.0 License, predictable and well understood
[Apache Software Foundation](https://www.apache.org/) governance.
- Implementation in [Rust](https://www.rust-lang.org/), a modern
system language with development productivity similar to Java or
Golang, the performance of C++, and [loved by programmers
everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
- Support for [Substrait](https://substrait.io/) query plans, to
easily pass plans across language and system boundaries.

## Use Cases

DataFusion can be used without modification as an embedded SQL
engine or can be customized and used as a foundation for
building new systems. Here are some examples of systems built using DataFusion:
building new systems.

While most current usecases are "analytic" or (throughput) some
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is trying to channel @avantgardnerio 's suggestion on #6441 (comment) though I am not sure how faithfully I have done so

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I could say it any better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need the or after "analytic" though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While most current usecases are "analytic" or (throughput) some
While most current usecases are "analytic" (throughput) some

Nice catch -- 🦅 👁️

components of DataFusion such as the plan representations, are
suitable for "streaming" and "transaction" style systems (low
latency).

Here are some example systems built using DataFusion:

- Specialized Analytical Database systems such as [CeresDB] and more general Apache Spark like system such a [Ballista].
- New query language engines such as [prql-query] and accelerators such as [VegaFusion]
- Research platform for new Database Systems, such as [Flock]
- SQL support to another library, such as [dask sql]
- Streaming data platforms such as [Synnada]
- Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as [qv]
- A faster Spark runtime replacement [Blaze]
- Native Spark runtime replacement such as [Blaze]

By using DataFusion, the projects are freed to focus on their specific
By using DataFusion, projects are freed to focus on their specific
features, and avoid reimplementing general (but still necessary)
features such as an expression representation, standard optimizations,
execution plans, file format support, etc.
parellelized streaming execution plans, file format support, etc.

## Known Users

Expand Down Expand Up @@ -119,7 +141,7 @@ Here are some of the projects known to use DataFusion:
## Integrations and Extensions

There are a number of community projects that extend DataFusion or
provide integrations with other systems.
provide integrations with other systems, some of which are described below:

### Language Bindings

Expand All @@ -137,5 +159,5 @@ provide integrations with other systems.

- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast.
- _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
- _Easy to Embed_: Allowing extension at almost any point in its design, and published regularly as a crate on [crates.io](http://crates.io), DataFusion can be integrated and tailored for your specific usecase.
- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can and is used as the foundation for production systems.