diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 00000000..82cdcb02 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,53 @@ +# Contributing to the Substrait validator + +Contributions are welcome! Here are some ways you can help: + + - Try to break it. Get it to emit an error or warning for something you believe to be completely valid, or get it to emit nothing or only a warning for something you believe to be invalid. The validator is currently still considered to be experimental. + - Help with reviewing PRs and keeping up with Substrait. + - Improve the built-in plan documentation. The validator tries to emit plain English descriptions of various nodes in the tree to explain what they do, but this is far from complete. + - Submit additional integration tests, for more and better queries. Currently about half of TPC-H exists, but the quality of the plans is very poor (no optimization, partially handwritten, may be wholly incorrect, mostly restricted to the set of things Isthmus generates). + - Look through the issues to see if anything needs solving. + - Or maybe we're missing something we haven't even thought of yet! + +## Licensing + +All contributions should be licensed under Apache 2.0. We use SPDX license headers to minimize clutter. CI will ensure that all files have such a header. + +## Code style + +Code style is strictly enforced for all languages using CI and (to some extent) [pre-commit](https://pre-commit.com/) via: + + - Rust: [rustfmt](https://github.com/rust-lang/rustfmt) and [clippy](https://github.com/rust-lang/rust-clippy). + - At the time of writing, stable clippy (1.60 and 1.61) panics on the codebase; hopefully this will be fixed soon. In CI the toolchain for linting is pinned to 1.59 for this reason. You can use Rust 1.59 to run clippy locally as well (`rustup install 1.59.0`, `cargo +1.59.0 clippy`) or you can leave it to CI, but either way you'll have to silence pre-commit (run it, see if there are no violations other than the clippy panic, then commit using `--no-verify`). + - Python: [black](https://github.com/psf/black) and [flake8](https://flake8.pycqa.org/en/4.0.1/). + - C: [clang-format](https://clang.llvm.org/docs/ClangFormat.html). + - protobuf: [buf format](https://buf.build/blog/introducing-buf-format). + - YAML: [yamllint](https://github.com/adrienverge/yamllint). + +## Development dependencies + +Here's a (probably non-exhaustive) list of things you may want to have installed to develop for the validator. + + - [Rust stable](https://rustup.rs/) + - VSCode with [rust-analyzer](https://rust-analyzer.github.io/) (or some other IDE with Rust support) + - Python 3.x (all non-EOL versions should be supported) + - The toolchain should be able to compile libprotobuf for you if you don't already have it, but it's probably a good idea to have a reasonably recent version installed system-wide as well + - [pre-commit](https://pre-commit.com/), so you don't have to rely on CI to catch all your errors, and to help format your code + - git (obviously) + - for the C bindings: [CMake](https://cmake.org/) and a C compiler (gcc, clang, and MSVC should all work; the bindings are very lightweight) + +Note: this list is probably non-exhaustive; if you find something missing from this list, please add it! + +## Contributor FAQ + +### What if I've never written Rust before? + +Probably not a problem! Getting a Rust dev environment up and running is [very](https://rustup.rs/) [easy](https://code.visualstudio.com/docs/languages/rust), and the actual validation part of the validator is very structured once you know the patterns, and hides much of the nitty-gritty details. Unless you're planning on working on the various internal support libraries, you shouldn't run into many surprises. + +### There's lots of code here. Where do I begin? + +After getting all the dependencies set up and playing around with the validator [command line](https://github.com/substrait-io/substrait-validator/tree/main/py) for a bit, run `cargo doc --all-features --document-private-items --open` to generate and open the internal documentation. The toplevel `substrait_validator` crate documentation includes the overview of the validator project as a whole (i.e. including some pointers for the foreign language bindings). + +### What if I'm still really confused after that? + +[Ask us](https://substrait.io/community/)! diff --git a/rs/Cargo.toml b/rs/Cargo.toml index 6dbf6a81..7adf394a 100644 --- a/rs/Cargo.toml +++ b/rs/Cargo.toml @@ -9,6 +9,13 @@ edition = "2021" license = "Apache-2.0" include = ["src", "build.rs", "README.md"] +[features] + +# Enable to get some extra developer documentation from cargo doc. Be sure to +# use this with --document-private-items as well. Enabling this doesn't +# functionally change anything. +private_docs = [] + [dependencies] # Prost is used to deal with protobuf serialization and deserialization. diff --git a/rs/src/lib.rs b/rs/src/lib.rs index 023f4359..7f3cae80 100644 --- a/rs/src/lib.rs +++ b/rs/src/lib.rs @@ -2,6 +2,8 @@ //! Crate for validating [Substrait](https://substrait.io/). //! +//! # Usage +//! //! The usage pattern is roughly as follows. //! //! 1) Build a [`Config`] structure to configure the validator. You can also @@ -22,6 +24,139 @@ //! the crate that was used for protobuf deserialization. If you're looking for //! a library (or CLI) that supports more human-friendly input, check out the //! Python bindings. +#![cfg_attr( + feature = "private_docs", + allow(rustdoc::private_intra_doc_links), + doc = " +# Internal workings + +*Very* roughly speaking, the validation process boils down to a conversion from +[one type of tree structure](input::proto::substrait::Plan) (including +expansions of any referenced YAML files) to [another](output::tree::Node), +using the facilities provided by the [parse module](mod@parse). This process is +documented in much more detail [here](mod@parse). Once constructed, the +resulting tree can then be [further converted](export) to a few export formats, +or the validation [diagnostics](output::diagnostic) can simply be +[extracted](ParseResult::iter_diagnostics()). + +This crate only supports the binary protobuf serialization format as input; +that conversion is ultimately done [here](parse::traversal::parse_proto()) +using a combination of [prost] and some unfortunate magic in +[substrait_validator_derive]. That is to say: it does *NOT* support JSON format +or variations thereof. This is because support for protobuf JSON is flaky +beyond the official bindings, likely in no small part due to all the case +conversion magic and special cases crammed into that format. Since there are no +official protobuf bindings for Rust, there is no way to do this from within the +crate without reinventing the wheel as a square. + +Instead, the Python bindings, generated using +[maturin](https://github.com/PyO3/maturin), include the user-facing logic for +this. This is also the primary reason why the CLI is written in Python, rather +than in Rust. When a format other than binary protobuf is passed to the Python +package, it uses the official protobuf bindings for Python to (re)serialize to +the binary format, before handing control to the Rust crate. For the return +trip, the protobuf export format (using the message tree defined in the +[substrait.validator](https://github.com/substrait-io/substrait-validator/blob\ +/main/proto/substrait/validator/validator.proto) protobuf namespace) is used to +pass the parse result to Python. + +C bindings also exist. These are of the not-very-user-friendly sort, however; +they exist primarily to allow the validator to be used from within the testing +frameworks of whatever language you want, provided they support calling into +C-like libraries. + +## Testing strategy + +Currently, this crate has (almost) no test cases of its own. This is primarily +to do with the fact that validating only part of a plan would require complex +context setup and that, ideally, the (bits of) plan for the test cases are +written in either JSON or a yet-more user-friendly variant thereof. For the +reasons given above, this can't really be done from within Rust. + +Instead, tests are run using the [test-runner crate](https://github.com/\ + substrait-io/substrait-validator/tree/main/tests) and its associated Python +frontend. The Python frontend pre-parses YAML test description files into a +JSON file that's easy to read from within Rust via serde-json, after which the +Rust crate takes over to run the test cases. The pre-parsing involves +converting the JSON-as-YAML protobuf tree into the binary serialization format, +but also allows diagnostic presence checks to be inserted in the plan where +they are expected (rather than having to link up the tree paths manually) and +allows YAML extensions to be specified inline (they'll be extracted and +replaced with a special URI that the test runner understands). + +The APIs for the bindings on top of the Rust crate are tested using +[pytest](https://docs.pytest.org/) (Python) and +[googletest](https://google.github.io/googletest/) (C). + +## Resolving extension URIs + +URI resolution deserves an honorable mention here, because it unfortunately +can't easily be hidden away in some private module: anything that uses HTTPS +must either link into the operating system's certificate store or ship its own +root certificates. The latter is sure to be a security issue, so let's restrict +ourselves to the former solution. + +The problem with this is that it pollutes the Rust crate with runtime linking +shenanigans that are not at all compatible from one system to another. In +particular, we can't build universal Python packages around crates that do +this. Since we rely on Python for the CLI, this is a bit of an issue. + +For this reason, URI resolution is guarded behind the `curl` feature. When the +feature is enabled, `libcurl` will be used to resolve URIs, using the system's +certificate store for HTTPS. When disabled, the crate will fall back to +resolving only `file://` URIs, unless a more capable resolver is +[installed](Config::add_uri_resolver()). The Python bindings will do just that: +they install a resolver based on Python's own +[urllib](https://docs.python.org/3/library/urllib.html). + +## Build process + +The build process for the crates and Python module also involves some +not-so-obvious magic, to do with shipping the Substrait protobuf and YAML +schema as appropriate. The problem is that Cargo and Python's packaging logic +require that all files shipped with the package be located within the package +source tree, which is not the case here due to the common submodule and proto +directories. + +### Rust + +If the [`in-git-repo` file](https://github.com/substrait-io/\ +substrait-validator/blob/main/rs/in-git-repo) exists, the +[build.rs file for this crate](https://github.com/substrait-io/\ +substrait-validator/blob/main/rs/build.rs) will copy the proto and schema files +from their respective source locations into `src/resources`, thus keeping them +in sync. The `in-git-repo` file is not included in the crate manifest, so this +step is skipped when the crate is compiled after being downloaded from +crates.io. Note however, that in order to release this crate, it must always +first be built: the only time during the packaging process when build.rs is +called is already on the user's machine, so the resource files won't be +synchronized by `cargo package`. + +### Python + +The process for Python is much the same, but handled by a +[wrapper around maturin](https://github.com/substrait-io/substrait-validator/\ +blob/main/py/substrait_validator_build/__init__.py), as maturin does not expose +pre-build hooks of its own. The `in-git-repo` file isn't necessary here; we can +use the `local_dependencies` file that will be generated by the packaging tools +as part of a source distribution as a marker. + +Here, too, it's important that the synchronization logic is run manually prior +to various release-like operations. This can be done by running +[prepare_build.py](https://github.com/substrait-io/substrait-validator/blob/\ +main/py/prepare_build.py). + +### Protobuf + +In order to rely on as few external dependencies as possible, all protoc +invocations by the various parts of the build invoke the `protoc` executable +as found/compiled and exposed by [prost-build](prost-build). That is: this +protoc is also abused to generate Python bindings. Unfortunately, prost-build +is planning to [remove](https://github.com/tokio-rs/prost/pull/620) the build +logic for protoc at the time of writing (and who can blame them), so this will +need to be done differently in the future. + " +)] #[macro_use] pub mod output;