diff --git a/docs/source/contributor-guide/getting_started.md b/docs/source/contributor-guide/getting_started.md new file mode 100644 index 000000000000..64d5a0d43d5d --- /dev/null +++ b/docs/source/contributor-guide/getting_started.md @@ -0,0 +1,87 @@ + + +# Getting Started + +This section describes how you can get started at developing DataFusion. + +## Windows setup + +```shell +wget https://az792536.vo.msecnd.net/vms/VMBuild_20190311/VirtualBox/MSEdge/MSEdge.Win10.VirtualBox.zip +choco install -y git rustup.install visualcpp-build-tools +git-bash.exe +cargo build +``` + +## Protoc Installation + +Compiling DataFusion from sources requires an installed version of the protobuf compiler, `protoc`. + +On most platforms this can be installed from your system's package manager + +``` +# Ubuntu +$ sudo apt install -y protobuf-compiler + +# Fedora +$ dnf install -y protobuf-devel + +# Arch Linux +$ pacman -S protobuf + +# macOS +$ brew install protobuf +``` + +You will want to verify the version installed is `3.12` or greater, which introduced support for explicit [field presence](https://github.com/protocolbuffers/protobuf/blob/v3.12.0/docs/field_presence.md). Older versions may fail to compile. + +```shell +$ protoc --version +libprotoc 3.12.4 +``` + +Alternatively a binary release can be downloaded from the [Release Page](https://github.com/protocolbuffers/protobuf/releases) or [built from source](https://github.com/protocolbuffers/protobuf/blob/main/src/README.md). + +## Bootstrap environment + +DataFusion is written in Rust and it uses a standard rust toolkit: + +- `cargo build` +- `cargo fmt` to format the code +- `cargo test` to test +- etc. + +Note that running `cargo test` requires significant memory resources, due to cargo running many tests in parallel by default. If you run into issues with slow tests or system lock ups, you can significantly reduce the memory required by instead running `cargo test -- --test-threads=1`. For more information see [this issue](https://github.com/apache/datafusion/issues/5347). + +Testing setup: + +- `rustup update stable` DataFusion uses the latest stable release of rust +- `git submodule init` +- `git submodule update` + +Formatting instructions: + +- [ci/scripts/rust_fmt.sh](../../../ci/scripts/rust_fmt.sh) +- [ci/scripts/rust_clippy.sh](../../../ci/scripts/rust_clippy.sh) +- [ci/scripts/rust_toml_fmt.sh](../../../ci/scripts/rust_toml_fmt.sh) + +or run them all at once: + +- [dev/rust_lint.sh](../../../dev/rust_lint.sh) diff --git a/docs/source/contributor-guide/howtos.md b/docs/source/contributor-guide/howtos.md new file mode 100644 index 000000000000..254b1de6521e --- /dev/null +++ b/docs/source/contributor-guide/howtos.md @@ -0,0 +1,129 @@ + + +# HOWTOs + +## How to add a new scalar function + +Below is a checklist of what you need to do to add a new scalar function to DataFusion: + +- Add the actual implementation of the function to a new module file within: + - [here](https://github.com/apache/datafusion/tree/main/datafusion/functions-array) for array functions + - [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/crypto) for crypto functions + - [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/datetime) for datetime functions + - [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/encoding) for encoding functions + - [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/math) for math functions + - [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/regex) for regex functions + - [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/string) for string functions + - [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/unicode) for unicode functions + - create a new module [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/) for other functions. +- New function modules - for example a `vector` module, should use a [rust feature](https://doc.rust-lang.org/cargo/reference/features.html) (for example `vector_expressions`) to allow DataFusion + users to enable or disable the new module as desired. +- The implementation of the function is done via implementing `ScalarUDFImpl` trait for the function struct. + - See the [advanced_udf.rs] example for an example implementation + - Add tests for the new function +- To connect the implementation of the function add to the mod.rs file: + - a `mod xyz;` where xyz is the new module file + - a call to `make_udf_function!(..);` + - an item in `export_functions!(..);` +- In [sqllogictest/test_files], add new `sqllogictest` integration tests where the function is called through SQL against well known data and returns the expected result. + - Documentation for `sqllogictest` [here](https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/README.md) +- Add SQL reference documentation [here](https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/scalar_functions.md) + +[advanced_udf.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_udaf.rs +[sqllogictest/test_files]: https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest/test_files + +## How to add a new aggregate function + +Below is a checklist of what you need to do to add a new aggregate function to DataFusion: + +- Add the actual implementation of an `Accumulator` and `AggregateExpr`: +- In [datafusion/expr/src](../../../datafusion/expr/src/aggregate_function.rs), add: + - a new variant to `AggregateFunction` + - a new entry to `FromStr` with the name of the function as called by SQL + - a new line in `return_type` with the expected return type of the function, given an incoming type + - a new line in `signature` with the signature of the function (number and types of its arguments) + - a new line in `create_aggregate_expr` mapping the built-in to the implementation + - tests to the function. +- In [sqllogictest/test_files], add new `sqllogictest` integration tests where the function is called through SQL against well known data and returns the expected result. + - Documentation for `sqllogictest` [here](https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/README.md) +- Add SQL reference documentation [here](https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/aggregate_functions.md) + +## How to display plans graphically + +The query plans represented by `LogicalPlan` nodes can be graphically +rendered using [Graphviz](https://www.graphviz.org/). + +To do so, save the output of the `display_graphviz` function to a file.: + +```rust +// Create plan somehow... +let mut output = File::create("/tmp/plan.dot")?; +write!(output, "{}", plan.display_graphviz()); +``` + +Then, use the `dot` command line tool to render it into a file that +can be displayed. For example, the following command creates a +`/tmp/plan.pdf` file: + +```bash +dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf +``` + +## How to format `.md` document + +We are using `prettier` to format `.md` files. + +You can either use `npm i -g prettier` to install it globally or use `npx` to run it as a standalone binary. Using `npx` required a working node environment. Upgrading to the latest prettier is recommended (by adding `--upgrade` to the `npm` command). + +```bash +$ prettier --version +2.3.0 +``` + +After you've confirmed your prettier version, you can format all the `.md` files: + +```bash +prettier -w {datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md +``` + +## How to format `.toml` files + +We use `taplo` to format `.toml` files. + +For Rust developers, you can install it via: + +```sh +cargo install taplo-cli --locked +``` + +> Refer to the [Installation section][doc] on other ways to install it. +> +> [doc]: https://taplo.tamasfe.dev/cli/installation/binary.html + +```bash +$ taplo --version +taplo 0.9.0 +``` + +After you've confirmed your `taplo` version, you can format all the `.toml` files: + +```bash +taplo fmt +``` diff --git a/docs/source/contributor-guide/index.md b/docs/source/contributor-guide/index.md index 5705737206da..9aaa8b045388 100644 --- a/docs/source/contributor-guide/index.md +++ b/docs/source/contributor-guide/index.md @@ -113,232 +113,6 @@ The good thing about open code and open development is that any issues in one ch Pull requests will be marked with a `stale` label after 60 days of inactivity and then closed 7 days after that. Commenting on the PR will remove the `stale` label. -## Getting Started - -This section describes how you can get started at developing DataFusion. - -### Windows setup - -```shell -wget https://az792536.vo.msecnd.net/vms/VMBuild_20190311/VirtualBox/MSEdge/MSEdge.Win10.VirtualBox.zip -choco install -y git rustup.install visualcpp-build-tools -git-bash.exe -cargo build -``` - -### Protoc Installation - -Compiling DataFusion from sources requires an installed version of the protobuf compiler, `protoc`. - -On most platforms this can be installed from your system's package manager - -``` -# Ubuntu -$ sudo apt install -y protobuf-compiler - -# Fedora -$ dnf install -y protobuf-devel - -# Arch Linux -$ pacman -S protobuf - -# macOS -$ brew install protobuf -``` - -You will want to verify the version installed is `3.12` or greater, which introduced support for explicit [field presence](https://github.com/protocolbuffers/protobuf/blob/v3.12.0/docs/field_presence.md). Older versions may fail to compile. - -```shell -$ protoc --version -libprotoc 3.12.4 -``` - -Alternatively a binary release can be downloaded from the [Release Page](https://github.com/protocolbuffers/protobuf/releases) or [built from source](https://github.com/protocolbuffers/protobuf/blob/main/src/README.md). - -### Bootstrap environment - -DataFusion is written in Rust and it uses a standard rust toolkit: - -- `cargo build` -- `cargo fmt` to format the code -- `cargo test` to test -- etc. - -Note that running `cargo test` requires significant memory resources, due to cargo running many tests in parallel by default. If you run into issues with slow tests or system lock ups, you can significantly reduce the memory required by instead running `cargo test -- --test-threads=1`. For more information see [this issue](https://github.com/apache/datafusion/issues/5347). - -Testing setup: - -- `rustup update stable` DataFusion uses the latest stable release of rust -- `git submodule init` -- `git submodule update` - -Formatting instructions: - -- [ci/scripts/rust_fmt.sh](../../../ci/scripts/rust_fmt.sh) -- [ci/scripts/rust_clippy.sh](../../../ci/scripts/rust_clippy.sh) -- [ci/scripts/rust_toml_fmt.sh](../../../ci/scripts/rust_toml_fmt.sh) - -or run them all at once: - -- [dev/rust_lint.sh](../../../dev/rust_lint.sh) - -## Testing - -Tests are critical to ensure that DataFusion is working properly and -is not accidentally broken during refactorings. All new features -should have test coverage. - -DataFusion has several levels of tests in its [Test -Pyramid](https://martinfowler.com/articles/practical-test-pyramid.html) -and tries to follow the Rust standard [Testing Organization](https://doc.rust-lang.org/book/ch11-03-test-organization.html) in the The Book. - -### Unit tests - -Tests for code in an individual module are defined in the same source file with a `test` module, following Rust convention. - -### sqllogictests Tests - -DataFusion's SQL implementation is tested using [sqllogictest](https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest) which are run like any other Rust test using `cargo test --test sqllogictests`. - -`sqllogictests` tests may be less convenient for new contributors who are familiar with writing `.rs` tests as they require learning another tool. However, `sqllogictest` based tests are much easier to develop and maintain as they 1) do not require a slow recompile/link cycle and 2) can be automatically updated via `cargo test --test sqllogictests -- --complete`. - -Like similar systems such as [DuckDB](https://duckdb.org/dev/testing), DataFusion has chosen to trade off a slightly higher barrier to contribution for longer term maintainability. - -### Rust Integration Tests - -There are several tests of the public interface of the DataFusion library in the [tests](https://github.com/apache/datafusion/tree/main/datafusion/core/tests) directory. - -You can run these tests individually using `cargo` as normal command such as - -```shell -cargo test -p datafusion --test parquet_exec -``` - -## Benchmarks - -### Criterion Benchmarks - -[Criterion](https://docs.rs/criterion/latest/criterion/index.html) is a statistics-driven micro-benchmarking framework used by DataFusion for evaluating the performance of specific code-paths. In particular, the criterion benchmarks help to both guide optimisation efforts, and prevent performance regressions within DataFusion. - -Criterion integrates with Cargo's built-in [benchmark support](https://doc.rust-lang.org/cargo/commands/cargo-bench.html) and a given benchmark can be run with - -``` -cargo bench --bench BENCHMARK_NAME -``` - -A full list of benchmarks can be found [here](https://github.com/apache/datafusion/tree/main/datafusion/core/benches). - -_[cargo-criterion](https://github.com/bheisler/cargo-criterion) may also be used for more advanced reporting._ - -### Parquet SQL Benchmarks - -The parquet SQL benchmarks can be run with - -``` - cargo bench --bench parquet_query_sql -``` - -These randomly generate a parquet file, and then benchmark queries sourced from [parquet_query_sql.sql](../../../datafusion/core/benches/parquet_query_sql.sql) against it. This can therefore be a quick way to add coverage of particular query and/or data paths. - -If the environment variable `PARQUET_FILE` is set, the benchmark will run queries against this file instead of a randomly generated one. This can be useful for performing multiple runs, potentially with different code, against the same source data, or for testing against a custom dataset. - -The benchmark will automatically remove any generated parquet file on exit, however, if interrupted (e.g. by CTRL+C) it will not. This can be useful for analysing the particular file after the fact, or preserving it to use with `PARQUET_FILE` in subsequent runs. - -### Comparing Baselines - -By default, Criterion.rs will compare the measurements against the previous run (if any). Sometimes it's useful to keep a set of measurements around for several runs. For example, you might want to make multiple changes to the code while comparing against the master branch. For this situation, Criterion.rs supports custom baselines. - -``` - git checkout main - cargo bench --bench sql_planner -- --save-baseline main - git checkout YOUR_BRANCH - cargo bench --bench sql_planner -- --baseline main -``` - -Note: For MacOS it may be required to run `cargo bench` with `sudo` - -``` -sudo cargo bench ... -``` - -More information on [Baselines](https://bheisler.github.io/criterion.rs/book/user_guide/command_line_options.html#baselines) - -### Upstream Benchmark Suites - -Instructions and tooling for running upstream benchmark suites against DataFusion can be found in [benchmarks](https://github.com/apache/datafusion/tree/main/benchmarks). - -These are valuable for comparative evaluation against alternative Arrow implementations and query engines. - -## HOWTOs - -### How to add a new scalar function - -Below is a checklist of what you need to do to add a new scalar function to DataFusion: - -- Add the actual implementation of the function to a new module file within: - - [here](../../../datafusion/functions-array/src) for array functions - - [here](../../../datafusion/functions/src/crypto) for crypto functions - - [here](../../../datafusion/functions/src/datetime) for datetime functions - - [here](../../../datafusion/functions/src/encoding) for encoding functions - - [here](../../../datafusion/functions/src/math) for math functions - - [here](../../../datafusion/functions/src/regex) for regex functions - - [here](../../../datafusion/functions/src/string) for string functions - - [here](../../../datafusion/functions/src/unicode) for unicode functions - - create a new module [here](../../../datafusion/functions/src) for other functions. -- New function modules - for example a `vector` module, should use a [rust feature](https://doc.rust-lang.org/cargo/reference/features.html) (for example `vector_expressions`) to allow DataFusion - users to enable or disable the new module as desired. -- The implementation of the function is done via implementing `ScalarUDFImpl` trait for the function struct. - - See the [advanced_udf.rs](../../../datafusion-examples/examples/advanced_udf.rs) example for an example implementation - - Add tests for the new function -- To connect the implementation of the function add to the mod.rs file: - - a `mod xyz;` where xyz is the new module file - - a call to `make_udf_function!(..);` - - an item in `export_functions!(..);` -- In [sqllogictest/test_files](../../../datafusion/sqllogictest/test_files), add new `sqllogictest` integration tests where the function is called through SQL against well known data and returns the expected result. - - Documentation for `sqllogictest` [here](../../../datafusion/sqllogictest/README.md) -- Add SQL reference documentation [here](../../../docs/source/user-guide/sql/scalar_functions.md) - -### How to add a new aggregate function - -Below is a checklist of what you need to do to add a new aggregate function to DataFusion: - -- Add the actual implementation of an `Accumulator` and `AggregateExpr`: - - [here](../../../datafusion/physical-expr/src/string_expressions.rs) for string functions - - [here](../../../datafusion/physical-expr/src/math_expressions.rs) for math functions - - [here](../../../datafusion/functions/src/datetime/mod.rs) for datetime functions - - create a new module [here](../../../datafusion/physical-expr/src) for other functions -- In [datafusion/expr/src](../../../datafusion/expr/src/aggregate_function.rs), add: - - a new variant to `AggregateFunction` - - a new entry to `FromStr` with the name of the function as called by SQL - - a new line in `return_type` with the expected return type of the function, given an incoming type - - a new line in `signature` with the signature of the function (number and types of its arguments) - - a new line in `create_aggregate_expr` mapping the built-in to the implementation - - tests to the function. -- In [sqllogictest/test_files](../../../datafusion/sqllogictest/test_files), add new `sqllogictest` integration tests where the function is called through SQL against well known data and returns the expected result. - - Documentation for `sqllogictest` [here](../../../datafusion/sqllogictest/README.md) -- Add SQL reference documentation [here](../../../docs/source/user-guide/sql/aggregate_functions.md) - -### How to display plans graphically - -The query plans represented by `LogicalPlan` nodes can be graphically -rendered using [Graphviz](https://www.graphviz.org/). - -To do so, save the output of the `display_graphviz` function to a file.: - -```rust -// Create plan somehow... -let mut output = File::create("/tmp/plan.dot")?; -write!(output, "{}", plan.display_graphviz()); -``` - -Then, use the `dot` command line tool to render it into a file that -can be displayed. For example, the following command creates a -`/tmp/plan.pdf` file: - -```bash -dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf -``` - ## Specifications We formalize some DataFusion semantics and behaviors through specification @@ -354,45 +128,3 @@ Here is the list current active specifications: - [Invariants](https://datafusion.apache.org/contributor-guide/specification/invariants.html) All specifications are stored in the `docs/source/specification` folder. - -## How to format `.md` document - -We are using `prettier` to format `.md` files. - -You can either use `npm i -g prettier` to install it globally or use `npx` to run it as a standalone binary. Using `npx` required a working node environment. Upgrading to the latest prettier is recommended (by adding `--upgrade` to the `npm` command). - -```bash -$ prettier --version -2.3.0 -``` - -After you've confirmed your prettier version, you can format all the `.md` files: - -```bash -prettier -w {datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md -``` - -## How to format `.toml` files - -We use `taplo` to format `.toml` files. - -For Rust developers, you can install it via: - -```sh -cargo install taplo-cli --locked -``` - -> Refer to the [Installation section][doc] on other ways to install it. -> -> [doc]: https://taplo.tamasfe.dev/cli/installation/binary.html - -```bash -$ taplo --version -taplo 0.9.0 -``` - -After you've confirmed your `taplo` version, you can format all the `.toml` files: - -```bash -taplo fmt -``` diff --git a/docs/source/contributor-guide/testing.md b/docs/source/contributor-guide/testing.md new file mode 100644 index 000000000000..11f53bcb2a2d --- /dev/null +++ b/docs/source/contributor-guide/testing.md @@ -0,0 +1,105 @@ + + +# Testing + +Tests are critical to ensure that DataFusion is working properly and +is not accidentally broken during refactorings. All new features +should have test coverage. + +DataFusion has several levels of tests in its [Test +Pyramid](https://martinfowler.com/articles/practical-test-pyramid.html) +and tries to follow the Rust standard [Testing Organization](https://doc.rust-lang.org/book/ch11-03-test-organization.html) in the The Book. + +## Unit tests + +Tests for code in an individual module are defined in the same source file with a `test` module, following Rust convention. + +## sqllogictests Tests + +DataFusion's SQL implementation is tested using [sqllogictest](https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest) which are run like any other Rust test using `cargo test --test sqllogictests`. + +`sqllogictests` tests may be less convenient for new contributors who are familiar with writing `.rs` tests as they require learning another tool. However, `sqllogictest` based tests are much easier to develop and maintain as they 1) do not require a slow recompile/link cycle and 2) can be automatically updated via `cargo test --test sqllogictests -- --complete`. + +Like similar systems such as [DuckDB](https://duckdb.org/dev/testing), DataFusion has chosen to trade off a slightly higher barrier to contribution for longer term maintainability. + +### Rust Integration Tests + +There are several tests of the public interface of the DataFusion library in the [tests](https://github.com/apache/datafusion/tree/main/datafusion/core/tests) directory. + +You can run these tests individually using `cargo` as normal command such as + +```shell +cargo test -p datafusion --test parquet_exec +``` + +## Benchmarks + +### Criterion Benchmarks + +[Criterion](https://docs.rs/criterion/latest/criterion/index.html) is a statistics-driven micro-benchmarking framework used by DataFusion for evaluating the performance of specific code-paths. In particular, the criterion benchmarks help to both guide optimisation efforts, and prevent performance regressions within DataFusion. + +Criterion integrates with Cargo's built-in [benchmark support](https://doc.rust-lang.org/cargo/commands/cargo-bench.html) and a given benchmark can be run with + +``` +cargo bench --bench BENCHMARK_NAME +``` + +A full list of benchmarks can be found [here](https://github.com/apache/datafusion/tree/main/datafusion/core/benches). + +_[cargo-criterion](https://github.com/bheisler/cargo-criterion) may also be used for more advanced reporting._ + +### Parquet SQL Benchmarks + +The parquet SQL benchmarks can be run with + +``` + cargo bench --bench parquet_query_sql +``` + +These randomly generate a parquet file, and then benchmark queries sourced from [parquet_query_sql.sql](../../../datafusion/core/benches/parquet_query_sql.sql) against it. This can therefore be a quick way to add coverage of particular query and/or data paths. + +If the environment variable `PARQUET_FILE` is set, the benchmark will run queries against this file instead of a randomly generated one. This can be useful for performing multiple runs, potentially with different code, against the same source data, or for testing against a custom dataset. + +The benchmark will automatically remove any generated parquet file on exit, however, if interrupted (e.g. by CTRL+C) it will not. This can be useful for analysing the particular file after the fact, or preserving it to use with `PARQUET_FILE` in subsequent runs. + +### Comparing Baselines + +By default, Criterion.rs will compare the measurements against the previous run (if any). Sometimes it's useful to keep a set of measurements around for several runs. For example, you might want to make multiple changes to the code while comparing against the master branch. For this situation, Criterion.rs supports custom baselines. + +``` + git checkout main + cargo bench --bench sql_planner -- --save-baseline main + git checkout YOUR_BRANCH + cargo bench --bench sql_planner -- --baseline main +``` + +Note: For MacOS it may be required to run `cargo bench` with `sudo` + +``` +sudo cargo bench ... +``` + +More information on [Baselines](https://bheisler.github.io/criterion.rs/book/user_guide/command_line_options.html#baselines) + +### Upstream Benchmark Suites + +Instructions and tooling for running upstream benchmark suites against DataFusion can be found in [benchmarks](https://github.com/apache/datafusion/tree/main/benchmarks). + +These are valuable for comparative evaluation against alternative Arrow implementations and query engines. diff --git a/docs/source/index.rst b/docs/source/index.rst index 5d6dcd3f87a2..77412e716271 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -111,13 +111,16 @@ Please see the `developer’s guide`_ for contributing and `communication`_ for contributor-guide/index contributor-guide/communication + contributor-guide/getting_started contributor-guide/architecture + contributor-guide/testing + contributor-guide/howtos contributor-guide/roadmap contributor-guide/quarterly_roadmap contributor-guide/governance contributor-guide/specification/index -.. _toc.contributor-guide: +.. _toc.subprojects: .. toctree:: :maxdepth: 1