Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of helper library for Rust UDFs #1

Merged
merged 2 commits into from
Feb 15, 2023
Merged

Conversation

wmitros
Copy link
Collaborator

@wmitros wmitros commented Dec 1, 2022

No description provided.

bindgen/src/lib.rs Outdated Show resolved Hide resolved
bindgen/src/lib.rs Outdated Show resolved Hide resolved
bindgen/src/lib.rs Outdated Show resolved Hide resolved
bindgen/src/lib.rs Outdated Show resolved Hide resolved
bindgen/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@piodul piodul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general tips on how the code structure could be improved:

The scylla_bindgen function is very complex, has lots of variables and operates directly on AST which is a bit hard to parse. It definitely needs to be broken down into smaller components. It's not really clear at the first glance what is the general idea behind the transformation it performs.

In order to improve readability and maintainability, I suggest introducing an intermediate representation - an abstract model that would describe the user-supplied function in such a way that generating code later is easy.

For example (I'm making this up, might not compile):

struct UserDefinedFunctionDesc {
	arguments: Vec<ArgumentDesc>,
	return_type: ReturnTypeDesc,

	original_function_name: syn::Ident,
	original_body: syn::Block,
}

struct ArgumentDesc {
	name: syn::Ident,
	typ: syn::Type,
}

struct ReturnType {
	typ: syn::Type,
}

First, you could build the model:

// Purposefully avoiding error handling here for simplicity
impl UserDefinedFunctionDesc {
	fn new(attrs: TokenStream, input: TokenStream) -> Self {
		let func = parse::<ItemFn>(item).unwrap();
		let arguments = func.sig.inputs.iter().map(ArgumentDesc::new).collect();
		let return_type = ReturnType::new(func.sig.output);

		// ... more stuff ...

		Self {
			arguments,
			return_type,

			original_function_name,
			original_body,
		}
	}
}

...and then synthesize the code based on it:

impl UserDefinedFunctionDesc {
	fn generate(&self) -> TokenStream {
		let exported_function = self.generate_exported_function();
		let inner_function = self.generate_inner_function();

		quote! {
			#exported_function
			#inner_function
		}
	}

	fn generate_exported_function(&self) -> syn::ItemFn { /* ... */ }
	fn generate_inner_function(&self) -> syn::ItemFn { /* ... */ }
}

impl ArgumentDesc {
	fn exported_arg_type(&self) -> syn::Argument { /* ... */ }
	fn inner_arg_type(&self) -> syn::Argument { /* ... */ }
}

// etc. etc.

scylla-cql/src/frame/response/result.rs Outdated Show resolved Hide resolved
scylla-cql/src/frame/response/cql_to_rust.rs Outdated Show resolved Hide resolved
bindgen/src/lib.rs Outdated Show resolved Hide resolved
bindgen/src/lib.rs Outdated Show resolved Hide resolved
bindgen/src/lib.rs Outdated Show resolved Hide resolved
examples/topn_row.rs Outdated Show resolved Hide resolved
examples/topn_row.rs Outdated Show resolved Hide resolved
serialization/src/lib.rs Outdated Show resolved Hide resolved
serialization/src/lib.rs Outdated Show resolved Hide resolved
serialization/src/lib.rs Outdated Show resolved Hide resolved
@wmitros
Copy link
Collaborator Author

wmitros commented Jan 20, 2023

I addressed most of the comments (though not all yet) and added a few more features:

  • restructuring the entire code, it should look more like a proper crate now
  • allowing any pattern as a function parameter
  • adding a trait for types that can be an arguments of the Rust UDF and using it to get the correlating types for the WASM UDF instead of matching strings - this allows for example using aliases
  • reexporting everything that we're using in macros in _macro_internal and using the full path in the macros (didn't add renaming of the scylla_bindgen crate yet)
  • adding a macro for newtype idiom, so that the "newtypes" can be used as parameters in the UDFs if the underlying type can be used there

Copy link
Collaborator

@piodul piodul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks much better now, thanks. It's easier to see what is going on.

Apart from the comments, there are some other considerations:

  • All public items should be documented: traits (FromCQL, AsCQL, ToColumnType, WasmConvertible, ...), types (SizePointer), maybe more.
    If you want to hide some items from the users but you still need to make them public for the purpose of the procedural macros, then you can use the #[doc(hidden)] attribute. You can also add docstrings to those items that will state that those types are not a part of the public API and may change in minor releases.
  • Some basic CI would be nice: format check, compilation, clippy check and tests. You can use scylla-rust-driver's CI definition as a base (you should slim it down however, e.g. no need to setup a scylla instance, at least not yet).
  • I wrote it already in one of the comments, but I'll repeat here: please add tests!
  • Dependency on the local scylla-cql crate should be dropped, after all changes are upstreamed there.

bindgen/src/lib.rs Outdated Show resolved Hide resolved
bindgen/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines 61 to 62
static mut DECLARED_ABI: std::sync::atomic::AtomicBool =
std::sync::atomic::AtomicBool::new(false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this approach really work? Imagine that you are having multiple dependency crates that define some UDFs. Most likely, separate crates will be compiled by separate processes, so they will have separate instances of this flag and each will emit its own _scylla_abi. I would expect that you will get into trouble if you try to link everything together.

Maybe we could have a separate (declarative?) macro that defines the _scylla_abi symbol? You would only have to define it in the root crate.

Alternatively, maybe there are some linker attributes that can be used to make it possible to define the symbol multiple times (it would be deduplicated)? There is a risk that such a solution could be non-portable, though.

Copy link
Collaborator Author

@wmitros wmitros Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: Not up to date, see answer below

I tried a different approach in the rebase - the _scylla_abi static is now exported directly in the crate, and not in the macro.

I exported it with #[no_mangle] and #[used] attributes, but in theory it may not be enough to make it end up in the final wasm binary. This would be achieved using the #[used(linker)] unstable feature, but if I'm not mistaken, #[used] guarantees that it reaches the linking stage, and we can make sure it doesn't get stripped during linking by referencing it in a public function.
We could reference it in the UDF, but that would make it (slightly) slower. We could also create a new function that does reference it - that makes the binary (slightly) bigger - I went with that for now, because for a very simple function that only performs some calculations the cost of accessing the global seems proportionally higher than the bigger executable, which will always be quite sizeable because it contains the implementation of malloc.

Note that, for now, the function is actually not needed - for non-ELF targets the #[used] attribute does all that we want (in fact, even #[no_mangle] is enough), but in theory it's not guaranteed.
I'm also not against adding a reference to the original generated function implementation, if that would be good enough

Copy link
Collaborator Author

@wmitros wmitros Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that neither #[used] nor pub have any useful effect in terms of exporting the symbol in the final binary. At least in the current Rust version, we have to tag an item with #[no_mangle] or #[export_name = ...] and it will be exported regardless of whether it was placed in the root crate or whether it was even pub.
However, this behavior is not guaranteed to continue, but I found no immediate plans of changing it.
Because of that, I suggest that we depend on it for now. It has 2 main benefits: we don't have to make implementation hacks to make it work, and we don't have to require users to be extra careful in terms of placement of the items exported to Scylla and their visibility setting.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in the current Rust version, we have to tag a item with #[no_mangle] or #[export_name = ...] and it will be exported regardless of whether it was placed in the root crate or whether it was even pub. However, this behavior is not guaranteed to continue, but I found no immediate plans of changing it.

Could you point me to the relevant documentation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior here isn't well documented. We can see that the items survive until the linking phase here, but the fact that they are also present in the final binary is not guaranteed in general.
This behavior is actually causing some issues because there is no way to prevent exporting an item tagged with #[no_mangle] in the final library. The discussion however, is more about how to make it possible to hide such items, and not about changing this default behavior

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about it recently on the meeting and it looks like we should prioritize ease of use. I'm OK with the current approach, especially if it doesn't seem that it will be changed soon.

bindgen-macros/src/udf.rs Outdated Show resolved Hide resolved
bindgen-macros/src/udf.rs Outdated Show resolved Hide resolved
Comment on lines 59 to 60
// We have to make the struct public, otherwise we can't use "<$struct_name as WasmConvertible>::WasmType"
// in the exported function signature even if WasmType is public.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a bit surprising for the user to find out that their struct is made public, even though they didn't mark it that way.

Is the error message clear in case the struct is not pub, i.e. tells that the problem is caused by the struct having wrong visibility and explains why it is needed? If so then I think it's better to inform the user and let them make the fix themselves.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the rebase, the exported UDF function doesn't even have to be pub so we also don't have to concern ourselves with this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that there is still a similar issue in the export_newtype macro.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I just forgot to remove the comment there.

bindgen-serialization/src/wasmconvertible.rs Outdated Show resolved Hide resolved
bindgen-serialization/src/wasmconvertible.rs Outdated Show resolved Hide resolved
bindgen-macros/Cargo.toml Outdated Show resolved Hide resolved
examples/topn_final.rs Outdated Show resolved Hide resolved
@wmitros
Copy link
Collaborator Author

wmitros commented Jan 23, 2023

There are still some comments I haven't addressed in the last rebase, but at least all the comments from the previous review are done now.

@wmitros wmitros force-pushed the initials branch 2 times, most recently from 76d89d9 to 39b96ec Compare January 26, 2023 17:57
@wmitros
Copy link
Collaborator Author

wmitros commented Jan 26, 2023

The last rebase fixed a few comments and again changed the structure of the project - this time mostly due to a lot of renames, but also the bindgen-serialization package was merged with the original bindgen which was practically empty before. Additionally, Cargo.tomls have been slightly upgraded.

The things that weren't fixed in the last rebase are: CI and tests, so I think we can start reviewing this part already.

All public items should be documented: traits (FromCQL, AsCQL, ToColumnType, WasmConvertible, ...), types (SizePointer), maybe more.

I'm not actually exposing anything other than the macros and some cql re-exports from scylla_cql for now, because I'm not sure the crate is ready for user-defined (de)serialization of types yet, and the traits and types from this crate are only useful for that. I think we will want to make this possible in the future, but I don't think it is needed in the first version.
I'm also not sure if I have to document the scylla_cql re-exports

The other thing that I have to add is a README, but it can also wait until after the code is fixed, and the ability to rename this crate (will be practically the same as what we've added to scylla_cql)

@wmitros wmitros force-pushed the initials branch 2 times, most recently from 154351b to b142867 Compare January 27, 2023 12:53
@piodul
Copy link
Collaborator

piodul commented Jan 30, 2023

All public items should be documented: traits (FromCQL, AsCQL, ToColumnType, WasmConvertible, ...), types (SizePointer), maybe more.

I'm not actually exposing anything other than the macros and some cql re-exports from scylla_cql for now, because I'm not sure the crate is ready for user-defined (de)serialization of types yet, and the traits and types from this crate are only useful for that. I think we will want to make this possible in the future, but I don't think it is needed in the first version.

Ok. It looks like those items are not exported because their modules are not exported, so it's fine.

I'm also not sure if I have to document the scylla_cql re-exports

No need, the docstrings in the generated documentation should be inherited from scylla-cql.

scylla-bindgen/src/wasmptr.rs Outdated Show resolved Hide resolved
scylla-bindgen/src/abi_exports.rs Outdated Show resolved Hide resolved
examples/topn_final.rs Outdated Show resolved Hide resolved
examples/return_input.rs Outdated Show resolved Hide resolved
scylla-bindgen-macros/src/export_newtype.rs Outdated Show resolved Hide resolved
@wmitros wmitros force-pushed the initials branch 2 times, most recently from 737cd83 to 0b93f96 Compare February 6, 2023 16:33
@wmitros
Copy link
Collaborator Author

wmitros commented Feb 6, 2023

Looks like I'll have to split your attention again @piodul, as we decided we'll try to finish this before UDF permissions after all.
I addressed all comments (the CI is at https://github.com/wmitros/scylla-rust-udf/pull/2) and added some tests and a small readme.

@wmitros
Copy link
Collaborator Author

wmitros commented Feb 7, 2023

One additional consideration before publishing the crate is whether we want to change the repo name.
I ended up with the crate name scylla_bindgen, but the repo's called scylla-rust-udf. I'm fine with keeping the name as is but maybe it's worth to consider something like scylla-rust-bindgen (it might be confusing because we also have rust bindings in the core scylla repo) or scylla-rust-udf-bindgen (which is very long)

@piodul
Copy link
Collaborator

piodul commented Feb 9, 2023

One additional consideration before publishing the crate is whether we want to change the repo name. I ended up with the crate name scylla_bindgen, but the repo's called scylla-rust-udf. I'm fine with keeping the name as is but maybe it's worth to consider something like scylla-rust-bindgen (it might be confusing because we also have rust bindings in the core scylla repo) or scylla-rust-udf-bindgen (which is very long)

The crates that you are developing are supposed to help with developing UDFs, so I think they should have udf in their name. Having rust in the crate name is redundant (but it's OK to have it in the repo name).

I think that scylla-udf should be OK - or scylla-udf-bindgen if you want to be super explicit about the bindgen-like functionality, but IMO that's unnecessarily verbose.

Copy link
Collaborator

@piodul piodul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wasm_conversion.rs file is empty? I see that in one of the force-pushes you moved its contents out to another file, but perhaps forgot to delete the old file?

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
scylla-bindgen-macros/src/path.rs Outdated Show resolved Hide resolved
scylla-bindgen/src/wasmptr.rs Outdated Show resolved Hide resolved
scylla-bindgen/src/wasmptr.rs Outdated Show resolved Hide resolved
scylla-bindgen/src/wasm_convertible.rs Outdated Show resolved Hide resolved
@wmitros wmitros force-pushed the initials branch 2 times, most recently from 82d01af to 87be2a3 Compare February 9, 2023 16:55
@wmitros
Copy link
Collaborator Author

wmitros commented Feb 9, 2023

The wasm_conversion.rs file is empty? I see that in one of the force-pushes you moved its contents out to another file, but perhaps forgot to delete the old file?

Correct, it should be deleted in the rebase

README.md Outdated
| BLOB | Vec\<u8\> |
| BOOLEAN | bool |
| COUNTER | scylla_udf::Counter |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: some rows in that table are mis-formatted, please fix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@piodul
Copy link
Collaborator

piodul commented Feb 10, 2023

I can't get the crate to compile correctly:

[piodul@scyllapc scylla-rust-udf]$ cargo test
error: failed to load manifest for workspace member `/home/piodul/code/scylla-rust-udf/examples`

Caused by:
  failed to parse manifest at `/home/piodul/code/scylla-rust-udf/examples/Cargo.toml`

Caused by:
  error reading `dependencies.scylla-bindgen` from workspace root manifest's `workspace.dependencies.scylla-bindgen`

Caused by:
  `dependency.scylla-bindgen` was not found in `workspace.dependencies`

I think you missed some places when renaming the crates. Besides, the directory names for the crates are still scylla-bindgen and scylla-bindgen-macros - please rename them too, as it will be quite confusing if the crate names don't match their directories.

I recommend rebasing the PR on top of main/master, so that the CI has a chance to run.

@wmitros wmitros force-pushed the initials branch 4 times, most recently from 975c18c to 1df7487 Compare February 10, 2023 10:50
@wmitros
Copy link
Collaborator Author

wmitros commented Feb 10, 2023

All the cases of scylla-bindgen should be now correctly replaced with scylla-udf. I also made some fixes for rustfmt, clippy and rustdoc, so that CI passes

@piodul
Copy link
Collaborator

piodul commented Feb 10, 2023

I think it looks OK. I have some final requests before we can merge:

  • We should document how to run tests. Currently, they don't work when running them natively, and it's better to run them on the target architecture (WASM) anyway. This information will be useful for future contributors.
  • Please run some manual tests with the examples in the repo, i.e. compile and optimize them according to the instructions in the README, upload them to a Scylla node, run some queries and verify that they work as expected. Please verify if Scylla correctly serializes/deserializes the types - I think that the types that are passed directly without allocating require the most scrutiny, but please also test that e.g. collections work (if two types have similar implementation then testing only one should be OK). Perhaps we will have some integration tests in the future in this repo, but for a manual check should suffice if we don't plan any significant changes in a followup.
  • Please verify that all necessary changes were upstreamed to scylla-cql and scylla-macros (try to compile scylla-udf with that branch).

@wmitros
Copy link
Collaborator Author

wmitros commented Feb 14, 2023

I tested examples with using Scylla's cql-pytest suite, the exact test case can be found at wmitros/scylla@91448f1

The tests uncovered that one example can't be used with scylla - the combine.rs example took both a counter and a cql duration as arguments, but they cannot be columns of the same table (all non-counter columns in a counter table have to be parts of the partition key, and duration cannot be in a partition key).

The topn.rs has also been updated, because the current implementation produced a bit unexpected results - it did return the n longest strings without repetitions, but different strings with the same length were also counted as repetitions. Also, the previous implementation of the row function didn't handle nulls, so if a null row was encountered in the UDA execution, the UDA returned null as well.

I also noticed that the AS $$ .. $$ workaround described in README doesn't really work, because the WAT code can also have $$ substrings, so I recommended just using AS ' .. ' and doubling the ' that appear in the WAT code.

@wmitros
Copy link
Collaborator Author

wmitros commented Feb 15, 2023

@piodul is it ok if we merge this before transferring the repo to Scylla?

@piodul
Copy link
Collaborator

piodul commented Feb 15, 2023

@wmitros please squash all the commits (apart from replace the imported scylla-cql with a git dependency) and I think we are good to merge.

@wmitros
Copy link
Collaborator Author

wmitros commented Feb 15, 2023

@wmitros please squash all the commits (apart from replace the imported scylla-cql with a git dependency) and I think we are good to merge.

Ok, everything's squashed (I merged depend on a released scylla-cql with replace the imported scylla-cql with a git dependency because it seemed more correlated)

@piodul piodul merged commit dcd5a08 into main Feb 15, 2023
@wmitros wmitros deleted the initials branch February 15, 2023 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants