Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Contributing to the Substrait validator

Contributions are welcome! Here are some ways you can help:

- Try to break it. Get it to emit an error or warning for something you believe to be completely valid, or get it to emit nothing or only a warning for something you believe to be invalid. The validator is currently still considered to be experimental.
- Help with reviewing PRs and keeping up with Substrait.
- Improve the built-in plan documentation. The validator tries to emit plain English descriptions of various nodes in the tree to explain what they do, but this is far from complete.
- Submit additional integration tests, for more and better queries. Currently about half of TPC-H exists, but the quality of the plans is very poor (no optimization, partially handwritten, may be wholly incorrect, mostly restricted to the set of things Isthmus generates).
- Look through the issues to see if anything needs solving.
- Or maybe we're missing something we haven't even thought of yet!

## Licensing

All contributions should be licensed under Apache 2.0. We use SPDX license headers to minimize clutter. CI will ensure that all files have such a header.

## Code style

Code style is strictly enforced for all languages using CI and (to some extent) [pre-commit](https://pre-commit.com/) via:

- Rust: [rustfmt](https://github.com/rust-lang/rustfmt) and [clippy](https://github.com/rust-lang/rust-clippy).
- At the time of writing, stable clippy (1.60 and 1.61) panics on the codebase; hopefully this will be fixed soon. In CI the toolchain for linting is pinned to 1.59 for this reason. You can use Rust 1.59 to run clippy locally as well (`rustup install 1.59.0`, `cargo +1.59.0 clippy`) or you can leave it to CI, but either way you'll have to silence pre-commit (run it, see if there are no violations other than the clippy panic, then commit using `--no-verify`).
- Python: [black](https://github.com/psf/black) and [flake8](https://flake8.pycqa.org/en/4.0.1/).
- C: [clang-format](https://clang.llvm.org/docs/ClangFormat.html).
- protobuf: [buf format](https://buf.build/blog/introducing-buf-format).
- YAML: [yamllint](https://github.com/adrienverge/yamllint).

## Development dependencies

Here's a (probably non-exhaustive) list of things you may want to have installed to develop for the validator.

- [Rust stable](https://rustup.rs/)
- VSCode with [rust-analyzer](https://rust-analyzer.github.io/) (or some other IDE with Rust support)
- Python 3.x (all non-EOL versions should be supported)
- The toolchain should be able to compile libprotobuf for you if you don't already have it, but it's probably a good idea to have a reasonably recent version installed system-wide as well
- [pre-commit](https://pre-commit.com/), so you don't have to rely on CI to catch all your errors, and to help format your code
- git (obviously)
- for the C bindings: [CMake](https://cmake.org/) and a C compiler (gcc, clang, and MSVC should all work; the bindings are very lightweight)

Note: this list is probably non-exhaustive; if you find something missing from this list, please add it!

## Contributor FAQ

### What if I've never written Rust before?

Probably not a problem! Getting a Rust dev environment up and running is [very](https://rustup.rs/) [easy](https://code.visualstudio.com/docs/languages/rust), and the actual validation part of the validator is very structured once you know the patterns, and hides much of the nitty-gritty details. Unless you're planning on working on the various internal support libraries, you shouldn't run into many surprises.

### There's lots of code here. Where do I begin?

After getting all the dependencies set up and playing around with the validator [command line](https://github.com/substrait-io/substrait-validator/tree/main/py) for a bit, run `cargo doc --all-features --document-private-items --open` to generate and open the internal documentation. The toplevel `substrait_validator` crate documentation includes the overview of the validator project as a whole (i.e. including some pointers for the foreign language bindings).

### What if I'm still really confused after that?

[Ask us](https://substrait.io/community/)!
7 changes: 7 additions & 0 deletions rs/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,13 @@ edition = "2021"
license = "Apache-2.0"
include = ["src", "build.rs", "README.md"]

[features]

# Enable to get some extra developer documentation from cargo doc. Be sure to
# use this with --document-private-items as well. Enabling this doesn't
# functionally change anything.
private_docs = []

[dependencies]

# Prost is used to deal with protobuf serialization and deserialization.
Expand Down
135 changes: 135 additions & 0 deletions rs/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

//! Crate for validating [Substrait](https://substrait.io/).
//!
//! # Usage
//!
//! The usage pattern is roughly as follows.
//!
//! 1) Build a [`Config`] structure to configure the validator. You can also
Expand All @@ -22,6 +24,139 @@
//! the crate that was used for protobuf deserialization. If you're looking for
//! a library (or CLI) that supports more human-friendly input, check out the
//! Python bindings.
#![cfg_attr(
feature = "private_docs",
allow(rustdoc::private_intra_doc_links),
doc = "
# Internal workings

*Very* roughly speaking, the validation process boils down to a conversion from
[one type of tree structure](input::proto::substrait::Plan) (including
expansions of any referenced YAML files) to [another](output::tree::Node),
using the facilities provided by the [parse module](mod@parse). This process is
documented in much more detail [here](mod@parse). Once constructed, the
resulting tree can then be [further converted](export) to a few export formats,
or the validation [diagnostics](output::diagnostic) can simply be
[extracted](ParseResult::iter_diagnostics()).

This crate only supports the binary protobuf serialization format as input;
that conversion is ultimately done [here](parse::traversal::parse_proto())
using a combination of [prost] and some unfortunate magic in
[substrait_validator_derive]. That is to say: it does *NOT* support JSON format
or variations thereof. This is because support for protobuf JSON is flaky
beyond the official bindings, likely in no small part due to all the case
conversion magic and special cases crammed into that format. Since there are no
official protobuf bindings for Rust, there is no way to do this from within the
crate without reinventing the wheel as a square.

Instead, the Python bindings, generated using
[maturin](https://github.com/PyO3/maturin), include the user-facing logic for
this. This is also the primary reason why the CLI is written in Python, rather
than in Rust. When a format other than binary protobuf is passed to the Python
package, it uses the official protobuf bindings for Python to (re)serialize to
the binary format, before handing control to the Rust crate. For the return
trip, the protobuf export format (using the message tree defined in the
[substrait.validator](https://github.com/substrait-io/substrait-validator/blob\
/main/proto/substrait/validator/validator.proto) protobuf namespace) is used to
pass the parse result to Python.

C bindings also exist. These are of the not-very-user-friendly sort, however;
they exist primarily to allow the validator to be used from within the testing
frameworks of whatever language you want, provided they support calling into
C-like libraries.

## Testing strategy

Currently, this crate has (almost) no test cases of its own. This is primarily
to do with the fact that validating only part of a plan would require complex
context setup and that, ideally, the (bits of) plan for the test cases are
written in either JSON or a yet-more user-friendly variant thereof. For the
reasons given above, this can't really be done from within Rust.

Instead, tests are run using the [test-runner crate](https://github.com/\
substrait-io/substrait-validator/tree/main/tests) and its associated Python
frontend. The Python frontend pre-parses YAML test description files into a
JSON file that's easy to read from within Rust via serde-json, after which the
Rust crate takes over to run the test cases. The pre-parsing involves
converting the JSON-as-YAML protobuf tree into the binary serialization format,
but also allows diagnostic presence checks to be inserted in the plan where
they are expected (rather than having to link up the tree paths manually) and
allows YAML extensions to be specified inline (they'll be extracted and
replaced with a special URI that the test runner understands).

The APIs for the bindings on top of the Rust crate are tested using
[pytest](https://docs.pytest.org/) (Python) and
[googletest](https://google.github.io/googletest/) (C).

## Resolving extension URIs

URI resolution deserves an honorable mention here, because it unfortunately
can't easily be hidden away in some private module: anything that uses HTTPS
must either link into the operating system's certificate store or ship its own
root certificates. The latter is sure to be a security issue, so let's restrict
ourselves to the former solution.

The problem with this is that it pollutes the Rust crate with runtime linking
shenanigans that are not at all compatible from one system to another. In
particular, we can't build universal Python packages around crates that do
this. Since we rely on Python for the CLI, this is a bit of an issue.

For this reason, URI resolution is guarded behind the `curl` feature. When the
feature is enabled, `libcurl` will be used to resolve URIs, using the system's
certificate store for HTTPS. When disabled, the crate will fall back to
resolving only `file://` URIs, unless a more capable resolver is
[installed](Config::add_uri_resolver()). The Python bindings will do just that:
they install a resolver based on Python's own
[urllib](https://docs.python.org/3/library/urllib.html).

## Build process

The build process for the crates and Python module also involves some
not-so-obvious magic, to do with shipping the Substrait protobuf and YAML
schema as appropriate. The problem is that Cargo and Python's packaging logic
require that all files shipped with the package be located within the package
source tree, which is not the case here due to the common submodule and proto
directories.

### Rust

If the [`in-git-repo` file](https://github.com/substrait-io/\
substrait-validator/blob/main/rs/in-git-repo) exists, the
[build.rs file for this crate](https://github.com/substrait-io/\
substrait-validator/blob/main/rs/build.rs) will copy the proto and schema files
from their respective source locations into `src/resources`, thus keeping them
in sync. The `in-git-repo` file is not included in the crate manifest, so this
step is skipped when the crate is compiled after being downloaded from
crates.io. Note however, that in order to release this crate, it must always
first be built: the only time during the packaging process when build.rs is
called is already on the user's machine, so the resource files won't be
synchronized by `cargo package`.

### Python

The process for Python is much the same, but handled by a
[wrapper around maturin](https://github.com/substrait-io/substrait-validator/\
blob/main/py/substrait_validator_build/__init__.py), as maturin does not expose
pre-build hooks of its own. The `in-git-repo` file isn't necessary here; we can
use the `local_dependencies` file that will be generated by the packaging tools
as part of a source distribution as a marker.

Here, too, it's important that the synchronization logic is run manually prior
to various release-like operations. This can be done by running
[prepare_build.py](https://github.com/substrait-io/substrait-validator/blob/\
main/py/prepare_build.py).

### Protobuf

In order to rely on as few external dependencies as possible, all protoc
invocations by the various parts of the build invoke the `protoc` executable
as found/compiled and exposed by [prost-build](prost-build). That is: this
protoc is also abused to generate Python bindings. Unfortunately, prost-build
is planning to [remove](https://github.com/tokio-rs/prost/pull/620) the build
logic for protoc at the time of writing (and who can blame them), so this will
need to be done differently in the future.
"
)]

#[macro_use]
pub mod output;
Expand Down