[Velox To Substrait] Remove hard-coded function references#2496
[Velox To Substrait] Remove hard-coded function references#2496chaojun-zhang wants to merge 19 commits intofacebookincubator:mainfrom chaojun-zhang:remove_hardcode_funcRef
Conversation
✅ Deploy Preview for meta-velox canceled.
|
mbasmanova
left a comment
There was a problem hiding this comment.
@chaojun-zhang This PR has lots of changes and it difficult to review. Can you break it up into smaller pieces? Also, looks like there are no additional tests. How is the new functionality verified?
velox/substrait/TypeUtils.h
Outdated
| TypePtr toVeloxType(const std::string& typeName); | ||
|
|
||
| /// Return the Substrait type signature according to the type | ||
| std::string substraitSignature(const TypePtr& type); |
There was a problem hiding this comment.
Is this the opposite of the toVeloxType function above? If so, let's rename toSubstraitType.
To make this PR smaller and easier to remove, please, extract this new function and its test into a separate PR.
There was a problem hiding this comment.
No, it only convert to substrait type signature which can't be convert backed to velox type, for more detail information please refer to substrait signature here
| const TypePtr& type, | ||
| const std::string& expectedType) { | ||
| SCOPED_TRACE(type->toString()); | ||
| auto substraitType = substraitSignature(type); |
There was a problem hiding this comment.
Do we have tests for toVeloxType function? If not, let's add these. Perhaps, write a round trip test.
There was a problem hiding this comment.
Already have test for toVeloxType function here
toSubstraitSignature is not an opposite of toVeloxType
|
|
||
| std::shared_ptr<const core::ITypedExpr> | ||
| SubstraitVeloxExprConverter::toVeloxExpr( | ||
| const ::substrait::Expression_IfThen& substraitIfThen, |
There was a problem hiding this comment.
Is this an IF expression or a SWITCH expression?
Can you extract this change into a separate PR and add a test?
| typeSignature.emplace_back(substraitSignature(arg)); | ||
| } | ||
| std::string signature = funcName + ":" + folly::join("_", typeSignature); | ||
| return {"", signature}; |
There was a problem hiding this comment.
Currently we don't introduce any substrait extension YAML files instead we treat all velox function signature as custom substrait function variant, so leave it empty.
Done
This Test already covered this new functionality |
mbasmanova
left a comment
There was a problem hiding this comment.
@chaojun-zhang Thank you for iterating on this PR. Some follow-up comments.
| } | ||
| }; | ||
|
|
||
| /// A ExtensionIdResolver is intend to used to resolve functionId or typeId |
There was a problem hiding this comment.
There are multiple typos and the description could be streamlined.
Given a function name and argument types returns a matching FunctionId.
However, I'm confused about the term "resolver". I don't see the implementation resolving anything. Instead, the implementation simply converts argument types into an underscore-delimited string using base type names and prepend it with function name and a colon.
Maybe we could have a free function toFunctionId(functionName, arguments) instead? Or add a static method to FunctionId struct.
static FunctionId FunctionId::create(functionName, arguments).
| /// A ExtensionIdResolver is intend to used to resolve functionId or typeId | ||
| class ExtensionIdResolver { | ||
| public: | ||
| /// resolve functionId by given velox function name and function |
There was a problem hiding this comment.
If you update the class comment as suggested above, this comment can be removed as it will just repeat the class comment.
|
|
||
| using ExtensionIdResolverPtr = std::shared_ptr<const ExtensionIdResolver>; | ||
|
|
||
| /// Maintains a mapping for function and function reference |
There was a problem hiding this comment.
Comments must end with a period. Please, fix throughout the PR.
Perhaps, clarify this comment a bit. "Assigns unique IDs to function signatures using FunctionId."
| class SubstraitExtensionCollector { | ||
| public: | ||
| SubstraitExtensionCollector(); | ||
| /// get function reference by function name and arguments. |
There was a problem hiding this comment.
Add empty line between methods. Fix throughout.
| /// Maintains a mapping for function and function reference | ||
| class SubstraitExtensionCollector { | ||
| public: | ||
| SubstraitExtensionCollector(); |
There was a problem hiding this comment.
This constructor is not needed. Member variables can be initialized inline.
There was a problem hiding this comment.
I try to remove it, but it report the bi-direction map not initialized properly
| }; | ||
|
|
||
| TEST_F(SubstraitExtensionCollectorTest, getFunctionReferenceTest) { | ||
| assertGetFunctionReference("plus", {INTEGER(), INTEGER()}, 0); |
There was a problem hiding this comment.
This test would be more readable if function name is shortened and assertion is moved to the caller.
ASSERT_EQ(getFunctionReference("plus", "{INTEGER(), INTEGER()}", 0);
|
|
||
| TEST_F(SubstraitExtensionCollectorTest, getFunctionReferenceTest) { | ||
| assertGetFunctionReference("plus", {INTEGER(), INTEGER()}, 0); | ||
| assertGetFunctionReference("plus", {INTEGER(), INTEGER()}, 0); |
There was a problem hiding this comment.
Instead of repeating the call and verifying the result is the same, consider, modifying the helper function to always repeat the call and assert the same result. This way we'll get more test coverage with fewer lines of test code.
| assertGetFunctionReference("divide", {INTEGER(), INTEGER()}, 1); | ||
| } | ||
|
|
||
| TEST_F(SubstraitExtensionCollectorTest, addExtensionsToPlanTest) { |
There was a problem hiding this comment.
naming: same as above; drop Test suffix from the test name.
| TEST_F(SubstraitExtensionCollectorTest, getFunctionReferenceTest) { | ||
| assertGetFunctionReference("plus", {INTEGER(), INTEGER()}, 0); | ||
| assertGetFunctionReference("plus", {INTEGER(), INTEGER()}, 0); | ||
| assertGetFunctionReference("divide", {INTEGER(), INTEGER()}, 1); |
There was a problem hiding this comment.
Let's add tests for functions taking complex types, for example, cardinality, array_sum, transform_keys. Let's also add tests for aggregate functions: sum, count. Let's make sure to tests both partial and final aggregate function calls.
| } | ||
|
|
||
| TEST_F(SubstraitExtensionCollectorTest, addExtensionsToPlanTest) { | ||
| extensionCollector_->getFunctionReference("plus", {INTEGER(), INTEGER()}); |
There was a problem hiding this comment.
Let's add a few more functions. Let's add functions with complex type inputs as well as aggregate functions. Let's also add "duplicate" functions to make sure these are properly de-duplicated.
mbasmanova
left a comment
There was a problem hiding this comment.
@chaojun-zhang Thank you for making further improvements. VeloxSubstraitSignature has a good amount of logic and can be reviewed standalone. Consider extracting it and the corresponding test into a separate PR.
| /// signature. | ||
| static std::string toVeloxSignature( | ||
| const std::string& functionName, | ||
| const std::vector<facebook::velox::TypePtr>& inputs); |
There was a problem hiding this comment.
drop facebook::velox:: prefix
There was a problem hiding this comment.
OK, will have a further PR.
| /// Given a collection of function signature, return the velox function | ||
| /// signatures join by the ',' delimiter. | ||
| static std::string toVeloxSignature( | ||
| const std::vector<const facebook::velox::exec::FunctionSignature*>& |
There was a problem hiding this comment.
drop facebook::velox:: prefix
|
|
||
| using namespace facebook::velox; | ||
| using namespace facebook::velox::substrait; | ||
| #include "velox/functions/prestosql/registration/RegistrationFunctions.h" |
There was a problem hiding this comment.
Put all #includes together before "using namespace" directives.
| Test::SetUp(); | ||
| functions::prestosql::registerAllScalarFunctions(); | ||
| } | ||
| static std::string toSubstraitSignature(const TypePtr& type) { |
| } | ||
| }; | ||
|
|
||
| TEST_F(VeloxSubstraitSignatureTest, toSubstraitSignature_with_type) { |
There was a problem hiding this comment.
naming: toSubstraitSignature_with_type -> toSubstraitSignatureWithType
Please fix throughout.
|
|
||
| namespace facebook::velox::substrait::test { | ||
|
|
||
| class VeloxSubstraitSignatureTest : public ::testing::Test { |
There was a problem hiding this comment.
Can we add tests for resolveFunction API?
|
|
||
| /// Given a velox function name and argument types, return a matching function | ||
| /// signature, throw if no match found. | ||
| static const exec::FunctionSignature& resolveFunction( |
There was a problem hiding this comment.
This function handles both scalar and aggregate functions and includes a hack for aggregate functions that assumes that single argument means aggregate function runs in final aggregation stage. It would be cleaner to handle scalar and aggregate functions separately. When resolving aggregate function, the caller needs to specify the aggregation step in addition to function name and input types.
There was a problem hiding this comment.
Ok, move to further PR.
|
|
||
| std::string VeloxSubstraitSignature::toSubstraitSignature( | ||
| const exec::TypeSignature& typeSignature) { | ||
| if ("T" == typeSignature.baseType() || |
There was a problem hiding this comment.
Would it be possible to look at the type variables specified in the function signature to avoid hard-coding some of these?
I assume this won't work for functions like map_keys which use K and V type variables.
There was a problem hiding this comment.
Yes, will do in further PR.
| if (functionName == "and" || functionName == "or" || functionName == "xor") { | ||
| return functionName + ":bool_bool"; | ||
| } | ||
| if (functionName == "not") { |
There was a problem hiding this comment.
Why do we need to hard-code "not" function. Is should be present in the registry. Is it not?
There was a problem hiding this comment.
'not' function not present in the registry currently, do you think it should be present in registry?
There was a problem hiding this comment.
Yes, I expect it to be registered.
velox/functions/prestosql/registration/ArithmeticFunctionsRegistration.cpp
void registerArithmeticFunctions() {
registerSimpleFunctions();
VELOX_REGISTER_VECTOR_FUNCTION(udf_not, "not");
VELOX_REGISTER_VECTOR_FUNCTION(udf_decimal_add, "plus");
VELOX_REGISTER_VECTOR_FUNCTION(udf_decimal_sub, "minus");
VELOX_REGISTER_VECTOR_FUNCTION(udf_decimal_mul, "multiply");
}
add test case to cover the logical functions, compare functions and aggregate functions. etc.
…cRef # Conflicts: # velox/substrait/TypeUtils.cpp # velox/substrait/VeloxToSubstraitExpr.cpp
Ok, I will raise a new PR for this. |
…cRef # Conflicts: # velox/substrait/VeloxToSubstraitPlan.cpp
mbasmanova
left a comment
There was a problem hiding this comment.
@chaojun-zhang Thank you for iterating on the PR. I still think it would be easier to review if the PR was split into smaller pieces. For example VeloxSubstraitSignature class and its test could be extracted into a separate PR.
I noticed that there are some limitations re: lambda functions (these are not supported) and aggregate functions (the step is not used, but the implications are not clear to me). Would you update PR description to call out these explicitly?
| } else { | ||
| VELOX_NYI("Couldn't find the aggregate function '{}' ", funName); | ||
| } | ||
| auto referenceId = extensionCollector_->getReferenceNumber( |
There was a problem hiding this comment.
nit: referenceId -> referenceNumber
| class VeloxSubstraitSignature { | ||
| public: | ||
| /// Given a velox type kind, return the Substrait type signature, throw if no | ||
| /// match found, Substrait signature used in the function extension |
There was a problem hiding this comment.
Most of this comment seems to belong to the other method. Please, update.
| private: | ||
| /// A bi-direction hash map to keep the relation between reference number and | ||
| /// either function or type signature. | ||
| /// @T ExtensionFunctionId |
| template <class T> | ||
| class BiDirectionHashMap { | ||
| public: | ||
| /// If the forwardMap_ does not contain the key, then the key and value will |
There was a problem hiding this comment.
This behavior seems difficult to work with. For the bi-map, users would expect that the map enforces uniqueness of both keys and values, no?
Consider, what happens when we add the following:
- 1, 10 => {1 -> 10}, {10 ->1}
- 1, 20 => {1 -> 20}, {10 ->1, 20 -> 1}
- 2, 20 => {1 -> 20, 2 -> 20}, {10 -> 1, 20 -> 2}
| /// otherwise the key will be overwritten. | ||
| void put(const int& key, const T& value); | ||
|
|
||
| std::map<int, T>& forwardMap() { |
There was a problem hiding this comment.
These methods should be const and should return const &. Also, consider replacing these with lookups by key and value, i.e.
std::optional<int> findKey(value) const;
std::optional<T> findValue(key) const;
There was a problem hiding this comment.
change to const is ok, but if replace with lookup methods that won't work for addExtensionsToPlan that need to iterate over the forwardMap
There was a problem hiding this comment.
I see. I guess we still can make this method const and have it return const &. No need to provide write access to forwardMap_, right?
| const std::vector<TypePtr>& arguments) { | ||
| const auto& substraitFunctionSignature = | ||
| VeloxSubstraitSignature::toSubstraitSignature(functionName, arguments); | ||
| /// TODO: Currently we treat all velox registry based function signatures as |
There was a problem hiding this comment.
Use // for comments inside functions.
| const std::string& functionName, | ||
| const std::vector<TypePtr>& arguments, | ||
| const core::AggregationNode::Step aggregationStep) { | ||
| // TODO: Ignore aggregationStep for now, will refactor when introduce velox |
There was a problem hiding this comment.
Why is it OK to ignore the step? There is already a registry of aggregate functions. What's missing?
There was a problem hiding this comment.
Because we make an agreement that separate the velox registry to a new PR.
| int getReferenceNumber( | ||
| const std::string& functionName, | ||
| std::vector<TypePtr>&& arguments) { | ||
| int functionReferenceId1 = |
There was a problem hiding this comment.
nit: functionReferenceId1 -> referenceNumber1
| int getReferenceNumber( | ||
| const std::string& functionName, | ||
| std::vector<TypePtr>&& arguments, | ||
| const core::AggregationNode::Step step) { |
| } | ||
|
|
||
| TEST_F(SubstraitExtensionCollectorTest, addExtensionsToPlan) { | ||
| // Arrange |
There was a problem hiding this comment.
What does this comment mean? Please, make sure all comments end with a period.
There was a problem hiding this comment.
It's a unit test pattern (Arrange->Act->Assert), have a look at this link
There was a problem hiding this comment.
Thank you for the pointer. I didn't know about this acronym. I do agree though with the author of that blog post that // Arrange and similar comments are redundant. The rest of the code base doesn't use these, hence, let's not add them here.
Test code, in the same way as production code, is supposed to be clean and self-explanatory. Preempting what you are going to do just adds unnecessary clutter. You can still have a clear distinction of the three steps by using spaces and indentation. Which are ultimately the key features the compiler provides you to structure your code.
It does look silly having a comment preceding each paragraph in this text, it tires the reader, as it has to constantly go through the obvious (or skip it). So why would you do that with your code?
On a separate note, this test would be easier to read if assertions were refactored to use a helper function.
auto getFunctionName = [&] (auto id) { return substraitPlan->extensions(0).extension_function().name(); };
ASSERT_EQ(getFunctionName(0), "plus:i32_i32");
ASSERT_EQ(getFunctionName(1), "divide:i32_i32");
...
If the VeloxSubstraitSignature extract to a separate PR then this PR won't be complete functional, we always need the logic toSubstraitSignature , if you think it won't belong to class of 'VeloxSubstraitSignature' , then it must be put into other place which can't be remove. |
My thinking is that this PR will depend on the PR introducing VeloxSubstraitSignature. I.e. we introduce VeloxSubstraitSignature in PR 1, then start using it in PR 2. |
Modify forwardMap_ type to use unordered_map.
| void SubstraitExtensionCollector::BiDirectionHashMap<T>::putIfAbsent( | ||
| const int& key, | ||
| const T& value) { | ||
| if (forwardMap_.find(key) == forwardMap_.end()) { |
There was a problem hiding this comment.
Shouldn't we check both maps first, then insert a (key, value) only if forwardMap_ doesn't have key and reverseMap_ doesn't have value?
|
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
mbasmanova
left a comment
There was a problem hiding this comment.
@chaojun-zhang Thank you for the contribution.
I'll fix putIfAbsent and update PR description before landing.
template <typename T>
bool SubstraitExtensionCollector::BiDirectionHashMap<T>::putIfAbsent(
const int& key,
const T& value) {
if (forwardMap_.find(key) == forwardMap_.end() &&
reverseMap_.find(value) == reverseMap_.end()) {
forwardMap_[key] = value;
reverseMap_[value] = key;
return true;
}
return false;
}
Lambda functions are not supported. Support for aggregated functions is limited.
Replace hard-coded function reference mapping with the one built on the fly.
Assign unique sequential reference numbers to scalar and aggregate functions
used in the plan.