-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Docs: Update roadmap to point at EPIC's, clarify project goals (#6639)
* Docs: Update Roadmap to point at github epics, update project goals * improvements * update * Fix doc link
- Loading branch information
Showing
3 changed files
with
71 additions
and
116 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,100 +19,27 @@ under the License. | |
|
||
# Roadmap | ||
|
||
This document describes high level goals of the DataFusion and | ||
Ballista development community. It is not meant to restrict | ||
possibilities, but rather help newcomers understand the broader | ||
context of where the community is headed, and inspire | ||
additional contributions. | ||
|
||
DataFusion and Ballista are part of the [Apache | ||
Arrow](https://arrow.apache.org/) project and governed by the Apache | ||
Software Foundation governance model. These projects are entirely | ||
driven by volunteers, and we welcome contributions for items not on | ||
this roadmap. However, before submitting a large PR, we strongly | ||
suggest you start a conversation using a github issue or the | ||
[email protected] mailing list to make review efficient and avoid | ||
surprises. | ||
|
||
## DataFusion | ||
|
||
DataFusion's goal is to become the embedded query engine of choice | ||
for new analytic applications, by leveraging the unique features of | ||
[Rust](https://www.rust-lang.org/) and [Apache Arrow](https://arrow.apache.org/) | ||
to provide: | ||
|
||
1. Best-in-class single node query performance | ||
2. A Declarative SQL query interface compatible with PostgreSQL | ||
3. A Dataframe API, similar to those offered by Pandas and Spark | ||
4. A Procedural API for programmatically creating and running execution plans | ||
5. High performance, data race free, ergonomic extensibility points at at every layer | ||
|
||
### Additional SQL Language Features | ||
|
||
- Decimal Support [#122](https://github.com/apache/arrow-datafusion/issues/122) | ||
- Complete support list on [status](https://github.com/apache/arrow-datafusion/blob/main/README.md#status) | ||
- Timestamp Arithmetic [#194](https://github.com/apache/arrow-datafusion/issues/194) | ||
- SQL Parser extension point [#533](https://github.com/apache/arrow-datafusion/issues/533) | ||
- Support for nested structures (fields, lists, structs) [#119](https://github.com/apache/arrow-datafusion/issues/119) | ||
- Run all queries from the TPCH benchmark (see [milestone](https://github.com/apache/arrow-datafusion/milestone/2) for more details) | ||
|
||
### Query Optimizer | ||
|
||
- More sophisticated cost based optimizer for join ordering | ||
- Implement advanced query optimization framework (Tokomak) [#440](https://github.com/apache/arrow-datafusion/issues/440) | ||
- Finer optimizations for group by and aggregate functions | ||
|
||
### Datasources | ||
|
||
- Better support for reading data from remote filesystems (e.g. S3) without caching it locally [#907](https://github.com/apache/arrow-datafusion/issues/907) [#1060](https://github.com/apache/arrow-datafusion/issues/1060) | ||
- Improve performances of file format datasources (parallelize file listings, async Arrow readers, file chunk prefetching capability...) | ||
|
||
### Runtime / Infrastructure | ||
|
||
- Migrate to some sort of arrow2 based implementation (see [milestone](https://github.com/apache/arrow-datafusion/milestone/3) for more details) | ||
- Add DataFusion to h2oai/db-benchmark [#147](https://github.com/apache/arrow-datafusion/issues/147) | ||
- Improve build time [#348](https://github.com/apache/arrow-datafusion/issues/348) | ||
|
||
### Resource Management | ||
|
||
- Finer grain control and limit of runtime memory [#587](https://github.com/apache/arrow-datafusion/issues/587) and CPU usage [#54](https://github.com/apache/arrow-datafusion/issues/64) | ||
|
||
### Python Interface | ||
|
||
TBD | ||
|
||
### DataFusion CLI (`datafusion-cli`) | ||
|
||
Note: There are some additional thoughts on a datafusion-cli vision on [#1096](https://github.com/apache/arrow-datafusion/issues/1096#issuecomment-939418770). | ||
|
||
- Better abstraction between REPL parsing and queries so that commands are separated and handled correctly | ||
- Connect to the `Statistics` subsystem and have the cli print out more stats for query debugging, etc. | ||
- Improved error handling for interactive use and shell scripting usage | ||
- publishing to apt, brew, and possible NuGet registry so that people can use it more easily | ||
- adopt a shorter name, like dfcli? | ||
|
||
## Ballista | ||
|
||
Ballista is a distributed compute platform based on Apache Arrow and DataFusion. It provides a query scheduler that | ||
breaks a physical plan into stages and tasks and then schedules tasks for execution across the available executors | ||
in the cluster. | ||
|
||
Having Ballista as part of the DataFusion codebase helps ensure that DataFusion remains suitable for distributed | ||
compute. For example, it helps ensure that physical query plans can be serialized to protobuf format and that they | ||
remain language-agnostic so that executors can be built in languages other than Rust. | ||
|
||
### Ballista Roadmap | ||
|
||
### Move query scheduler into DataFusion | ||
|
||
The Ballista scheduler has some advantages over DataFusion query execution because it doesn't try to eagerly execute | ||
the entire query at once but breaks it down into a directionally-acyclic graph (DAG) of stages and executes a | ||
configurable number of stages and tasks concurrently. It should be possible to push some of this logic down to | ||
DataFusion so that the same scheduler can be used to scale across cores in-process and across nodes in a cluster. | ||
|
||
### Implement execution-time cost-based optimizations based on statistics | ||
|
||
After the execution of a query stage, accurate statistics are available for the resulting data. These statistics | ||
could be leveraged by the scheduler to optimize the query during execution. For example, when performing a hash join | ||
it is desirable to load the smaller side of the join into memory and in some cases we cannot predict which side will | ||
be smaller until execution time. | ||
The [project introduction](../user-guide/introduction) explains the | ||
overview and goals of DataFusion, and our development efforts largely | ||
align to that vision. | ||
|
||
## Planning `EPIC`s | ||
|
||
DataFusion uses [GitHub | ||
issues](https://github.com/apache/arrow-datafusion/issues) to track | ||
planned work. We collect related tickets using tracking issues labeled | ||
with `[EPIC]` which contain discussion and links to more detailed items. | ||
|
||
Epics offer a high level roadmap of what the DataFusion | ||
community is thinking about. The epics are not meant to restrict | ||
possibilities, but rather help the community see where development is | ||
headed, align our work, and inspire additional contributions. | ||
|
||
As this project is entirely driven by volunteers, we welcome | ||
contributions for items not currently covered by epics. However, | ||
before submitting a large PR, we strongly suggest and request you | ||
start a conversation using a github issue or the | ||
[[email protected]](mailto:[email protected]) mailing list to | ||
make review efficient and avoid surprises. | ||
|
||
[The current list of `EPIC`s can be found here](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+epic). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters