From 133128840ca3dbea200dcfe84050cb7b82bf94a8 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Tue, 16 Jul 2024 07:19:25 -0400 Subject: [PATCH] Docs: Document creating new extension APIs (#11425) * Docs: Document creating new extension APIs * fix * Add clarification about extension APIs. Thanks @ozankabak * Apply suggestions from code review Co-authored-by: Mehmet Ozan Kabak * Add a paragraph on datafusion-contrib * prettier --------- Co-authored-by: Mehmet Ozan Kabak --- datafusion/core/src/lib.rs | 2 +- docs/source/contributor-guide/architecture.md | 74 +++++++++++++++++++ 2 files changed, 75 insertions(+), 1 deletion(-) diff --git a/datafusion/core/src/lib.rs b/datafusion/core/src/lib.rs index 63dbe824c231..81c1c4629a3a 100644 --- a/datafusion/core/src/lib.rs +++ b/datafusion/core/src/lib.rs @@ -174,7 +174,7 @@ //! //! DataFusion is designed to be highly extensible, so you can //! start with a working, full featured engine, and then -//! specialize any behavior for their usecase. For example, +//! specialize any behavior for your usecase. For example, //! some projects may add custom [`ExecutionPlan`] operators, or create their own //! query language that directly creates [`LogicalPlan`] rather than using the //! built in SQL planner, [`SqlToRel`]. diff --git a/docs/source/contributor-guide/architecture.md b/docs/source/contributor-guide/architecture.md index 68541f877768..55c8a1d980df 100644 --- a/docs/source/contributor-guide/architecture.md +++ b/docs/source/contributor-guide/architecture.md @@ -25,3 +25,77 @@ possible. You can find the most up to date version in the [source code]. [crates.io documentation]: https://docs.rs/datafusion/latest/datafusion/index.html#architecture [source code]: https://github.com/apache/datafusion/blob/main/datafusion/core/src/lib.rs + +## Forks vs Extension APIs + +DataFusion is a fast moving project, which results in frequent internal changes. +This benefits DataFusion by allowing it to evolve and respond quickly to +requests, but also means that maintaining a fork with major modifications +sometimes requires non trivial work. + +The public API (what is accessible if you use the DataFusion releases from +crates.io) is typically much more stable (though it does change from release to +release as well). + +Thus, rather than forks, we recommend using one of the many extension APIs (such +as `TableProvider`, `OptimizerRule`, or `ExecutionPlan`) to customize +DataFusion. If you can not do what you want with the existing APIs, we would +welcome you working with us to add new APIs to enable your use case, as +described in the next section. + +## `datafusion-contrib` + +While DataFusions comes with enough features "out of the box" to quickly start +with a working system, it can't include everything useful feature (e.g. +`TableProvider`s for all data formats). The [`datafusion-contrib`] project +contains a collection of community maintained extensions that are not part of +the core DataFusion project, and not under Apache Software Foundation governance +but may be useful to others in the community. If you are interested adding a +feature to DataFusion, a new extension in `datafusion-contrib` is likely a good +place to start. Please [contact] us via github issue, slack, or Discord and +we'll gladly set up a new repository for your extension. + +[`datafusion-contrib`]: https://github.com/datafusion-contrib +[contact]: ../contributor-guide/communication.md + +## Creating new Extension APIs + +DataFusion aims to be a general-purpose query engine, and thus the core crates +contain features that are useful for a wide range of use cases. Use case specific +functionality (such as very specific time series or stream processing features) +are typically implemented using the extension APIs. + +If have a use case that is not covered by the existing APIs, we would love to +work with you to design a new general purpose API. There are often others who are +interested in similar extensions and the act of defining the API often improves +the code overall for everyone. + +Extension APIs that provide "safe" default behaviors are more likely to be +suitable for inclusion in DataFusion, while APIs that require major changes to +built-in operators are less likely. For example, it might make less sense +to add an API to support a stream processing feature if that would result in +slower performance for built-in operators. It may still make sense to add +extension APIs for such features, but leave implementation of such operators in +downstream projects. + +The process to create a new extension API is typically: + +- Look for an existing issue describing what you want to do, and file one if it + doesn't yet exist. +- Discuss what the API would look like. Feel free to ask contributors (via `@` + mentions) for feedback (you can find such people by looking at the most + recently changed PRs and issues) +- Prototype the new API, typically by adding an example (in + `datafusion-examples` or refactoring existing code) to show how it would work +- Create a PR with the new API, and work with the community to get it merged + +Some benefits of using an example based approach are + +- Any future API changes will also keep your example going ensuring no + regression in functionality +- There will be a blue print of any needed changes to your code if the APIs do change + (just look at what changed in your example) + +An example of this process was [creating a SQL Extension Planning API]. + +[creating a sql extension planning api]: https://github.com/apache/datafusion/issues/11207