Skip to content

Comments

ARROW-15582: [C++] Add support for registering tricky functions with the Substrait consumer (or add a bunch of substrait meta functions)#13285

Closed
sanjibansg wants to merge 17 commits intoapache:mainfrom
sanjibansg:substrait/compute_functions
Closed

ARROW-15582: [C++] Add support for registering tricky functions with the Substrait consumer (or add a bunch of substrait meta functions)#13285
sanjibansg wants to merge 17 commits intoapache:mainfrom
sanjibansg:substrait/compute_functions

Conversation

@sanjibansg
Copy link
Contributor

[WIP]
This PR adds function mappings for compute functions from substrait to arrow and vice-versa. This introduces a FunctionMapping class to register and store the mappings and supply when required. Registering a function includes encoding the various options and arguments in the respective mapping function's definition.

@github-actions
Copy link

github-actions bot commented Jun 1, 2022

@github-actions
Copy link

github-actions bot commented Jun 1, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@westonpace westonpace self-requested a review June 2, 2022 01:57
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make a pass at simplifying some things. A lot of these lambdas seem to follow a consistent pattern. This is a good start however! Excited to see it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return an invalid status as an else clause here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an AlreadyExist status.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, perhaps the else should be an invalid status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a AlreadyExist status here, will that be better?

@sanjibansg sanjibansg force-pushed the substrait/compute_functions branch from a56a556 to b28e5b1 Compare June 3, 2022 10:31
@sanjibansg sanjibansg force-pushed the substrait/compute_functions branch from 41fc8fe to 7e27a7f Compare June 9, 2022 21:13
@westonpace westonpace self-requested a review June 16, 2022 15:47
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions from a quick scan.

return compute::call(func_name, std::move(arguments), std::move(cast_options));
}
case substrait::Expression::kEnum: {
auto enum_expr = expr.enum_();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this convert to the string value of the enum? Can you add a small comment here explaining that.

}

Status FunctionMapping::AddArrowToSubstrait(std::string arrow_function_name, ArrowToSubstrait conversion_func){
if (arrow_to_substrait.find(arrow_function_name) != arrow_to_substrait.end()){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems backwards to me...wouldn't umap.find(...) != umap.end() mean the item already existed?

}

Status FunctionMapping::AddSubstraitToArrow(std::string substrait_function_name, SubstraitToArrow conversion_func){
if (substrait_to_arrow.find(substrait_function_name) != substrait_to_arrow.end()){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. This seems backwards (but maybe I'm just not thinking right)

}
}

std::vector<arrow::compute::Expression> substrait_convert_arguments(const substrait::Expression::ScalarFunction& call){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<arrow::compute::Expression> substrait_convert_arguments(const substrait::Expression::ScalarFunction& call){
std::vector<arrow::compute::Expression> ConvertSubstraitArguments(const substrait::Expression::ScalarFunction& call){


std::vector<arrow::compute::Expression> substrait_convert_arguments(const substrait::Expression::ScalarFunction& call){
substrait::Expression value;
ExtensionSet ext_set_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ExtensionSet ext_set_;
ExtensionSet ext_set;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems strange. Wouldn't this function take in an extension set as an argument?


substrait::Expression::ScalarFunction arrow_convert_enum_arguments(const arrow::compute::Expression::Call& call, substrait::Expression::ScalarFunction& substrait_call, ExtensionSet* ext_set_, std::string overflow_handling){
substrait::Expression::Enum options;
options.set_specified(overflow_handling);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overflow_handling seems like an odd name given this is a generic function

return arrow::compute::call("abs", substrait_convert_arguments(call));
};

ArrowToSubstrait arrow_add_to_substrait = [] (const arrow::compute::Expression::Call& call, ExtensionSet* ext_set_) -> Result<substrait::Expression::ScalarFunction> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of places where you have ext_set_ and it should probably be ext_set. For the sake of brevity I'm not going to mark them all.

Comment on lines 668 to 670
substrait::Expression::ScalarFunction substrait_call;
ARROW_ASSIGN_OR_RAISE(auto function_reference, ext_set_->EncodeFunction("extract"));
substrait_call.set_function_reference(function_reference);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these calls to EncodeFunction seem pretty repetitive. Is there any way we can move this into the part that calls GetArrowToSubstrait? Also, I don't see anything today that calls GetArrowToSubstrait

}
};

ArrowToSubstrait arrow_year_to_arrow = [] (const arrow::compute::Expression::Call& call, ExtensionSet* ext_set_) -> Result<substrait::Expression::ScalarFunction> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrow_...to_arrow?

Comment on lines +812 to +813
DCHECK_OK(functions_map.AddSubstraitToArrow(id.name.to_string(), conversion_func));
return RegisterFunction(id, id.name.to_string());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a little odd that we need two maps. What happens if two functions exist with the same name but different URIs? Thinking on this longer, maybe substrait_to_arrow should replace the map in the extension id registry (that gets updated by the call to RegisterFunction?)

westonpace added a commit that referenced this pull request Aug 10, 2022
…ctions (#13613)

This picks up where #13285 has left off.  It mostly focuses on the Substrait->Arrow direction at the moment.  In addition, basic support is added for named tables.  This makes it possible to create unit tests that read from in-memory tables instead of requiring unit tests to do a scan.

The PR creates some utilities in `test_plan_builder.h` which allow for the construction of simple Substrait plans programmatically.  This is used to create unit tests for the function mapping.

The PR extracts id "ownership" out of the `ExtensionIdRegistry` and into its own `IdStorage` class.

The PR gets rid of `NestedExtensionIdRegistryImpl` and instead makes `ExtensionIdRegistryImpl` nested if `parent_ != nullptr`.



Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@github-actions
Copy link

Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer.

@github-actions github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025
@github-actions github-actions bot closed this Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: C++ Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants