Skip to content

[Velox To Substrait] Remove hard-coded function references#2496

Closed
chaojun-zhang wants to merge 19 commits intofacebookincubator:mainfrom
chaojun-zhang:remove_hardcode_funcRef
Closed

[Velox To Substrait] Remove hard-coded function references#2496
chaojun-zhang wants to merge 19 commits intofacebookincubator:mainfrom
chaojun-zhang:remove_hardcode_funcRef

Conversation

@chaojun-zhang
Copy link
Contributor

@chaojun-zhang chaojun-zhang commented Sep 9, 2022

Replace hard-coded function reference mapping with the one built on the fly.
Assign unique sequential reference numbers to scalar and aggregate functions
used in the plan.

@netlify
Copy link

netlify bot commented Sep 9, 2022

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 137f3cf
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/63291261f3d88700095125a0

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 9, 2022
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaojun-zhang This PR has lots of changes and it difficult to review. Can you break it up into smaller pieces? Also, looks like there are no additional tests. How is the new functionality verified?

TypePtr toVeloxType(const std::string& typeName);

/// Return the Substrait type signature according to the type
std::string substraitSignature(const TypePtr& type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the opposite of the toVeloxType function above? If so, let's rename toSubstraitType.

To make this PR smaller and easier to remove, please, extract this new function and its test into a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it only convert to substrait type signature which can't be convert backed to velox type, for more detail information please refer to substrait signature here

const TypePtr& type,
const std::string& expectedType) {
SCOPED_TRACE(type->toString());
auto substraitType = substraitSignature(type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests for toVeloxType function? If not, let's add these. Perhaps, write a round trip test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already have test for toVeloxType function here

toSubstraitSignature is not an opposite of toVeloxType


std::shared_ptr<const core::ITypedExpr>
SubstraitVeloxExprConverter::toVeloxExpr(
const ::substrait::Expression_IfThen& substraitIfThen,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an IF expression or a SWITCH expression?

Can you extract this change into a separate PR and add a test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

typeSignature.emplace_back(substraitSignature(arg));
}
std::string signature = funcName + ":" + folly::join("_", typeSignature);
return {"", signature};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is uri empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we don't introduce any substrait extension YAML files instead we treat all velox function signature as custom substrait function variant, so leave it empty.

@chaojun-zhang
Copy link
Contributor Author

chaojun-zhang commented Sep 11, 2022

This PR has lots of changes and it difficult to review. Can you break it up into smaller pieces

Done

Also, looks like there are no additional tests. How is the new functionality verified

This Test already covered this new functionality

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaojun-zhang Thank you for iterating on this PR. Some follow-up comments.

}
};

/// A ExtensionIdResolver is intend to used to resolve functionId or typeId
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple typos and the description could be streamlined.

Given a function name and argument types returns a matching FunctionId.

However, I'm confused about the term "resolver". I don't see the implementation resolving anything. Instead, the implementation simply converts argument types into an underscore-delimited string using base type names and prepend it with function name and a colon.

Maybe we could have a free function toFunctionId(functionName, arguments) instead? Or add a static method to FunctionId struct.

static FunctionId FunctionId::create(functionName, arguments).

/// A ExtensionIdResolver is intend to used to resolve functionId or typeId
class ExtensionIdResolver {
public:
/// resolve functionId by given velox function name and function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you update the class comment as suggested above, this comment can be removed as it will just repeat the class comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


using ExtensionIdResolverPtr = std::shared_ptr<const ExtensionIdResolver>;

/// Maintains a mapping for function and function reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments must end with a period. Please, fix throughout the PR.

Perhaps, clarify this comment a bit. "Assigns unique IDs to function signatures using FunctionId."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

class SubstraitExtensionCollector {
public:
SubstraitExtensionCollector();
/// get function reference by function name and arguments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add empty line between methods. Fix throughout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

/// Maintains a mapping for function and function reference
class SubstraitExtensionCollector {
public:
SubstraitExtensionCollector();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor is not needed. Member variables can be initialized inline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to remove it, but it report the bi-direction map not initialized properly

};

TEST_F(SubstraitExtensionCollectorTest, getFunctionReferenceTest) {
assertGetFunctionReference("plus", {INTEGER(), INTEGER()}, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test would be more readable if function name is shortened and assertion is moved to the caller.

ASSERT_EQ(getFunctionReference("plus", "{INTEGER(), INTEGER()}", 0);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


TEST_F(SubstraitExtensionCollectorTest, getFunctionReferenceTest) {
assertGetFunctionReference("plus", {INTEGER(), INTEGER()}, 0);
assertGetFunctionReference("plus", {INTEGER(), INTEGER()}, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of repeating the call and verifying the result is the same, consider, modifying the helper function to always repeat the call and assert the same result. This way we'll get more test coverage with fewer lines of test code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

assertGetFunctionReference("divide", {INTEGER(), INTEGER()}, 1);
}

TEST_F(SubstraitExtensionCollectorTest, addExtensionsToPlanTest) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: same as above; drop Test suffix from the test name.

TEST_F(SubstraitExtensionCollectorTest, getFunctionReferenceTest) {
assertGetFunctionReference("plus", {INTEGER(), INTEGER()}, 0);
assertGetFunctionReference("plus", {INTEGER(), INTEGER()}, 0);
assertGetFunctionReference("divide", {INTEGER(), INTEGER()}, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add tests for functions taking complex types, for example, cardinality, array_sum, transform_keys. Let's also add tests for aggregate functions: sum, count. Let's make sure to tests both partial and final aggregate function calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

}

TEST_F(SubstraitExtensionCollectorTest, addExtensionsToPlanTest) {
extensionCollector_->getFunctionReference("plus", {INTEGER(), INTEGER()});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a few more functions. Let's add functions with complex type inputs as well as aggregate functions. Let's also add "duplicate" functions to make sure these are properly de-duplicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@mbasmanova mbasmanova changed the title [Velox To Substrait] remove hard code function reference mapping [Velox To Substrait] Remove hard-coded function references Sep 13, 2022
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaojun-zhang Thank you for making further improvements. VeloxSubstraitSignature has a good amount of logic and can be reviewed standalone. Consider extracting it and the corresponding test into a separate PR.

/// signature.
static std::string toVeloxSignature(
const std::string& functionName,
const std::vector<facebook::velox::TypePtr>& inputs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop facebook::velox:: prefix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will have a further PR.

/// Given a collection of function signature, return the velox function
/// signatures join by the ',' delimiter.
static std::string toVeloxSignature(
const std::vector<const facebook::velox::exec::FunctionSignature*>&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop facebook::velox:: prefix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


using namespace facebook::velox;
using namespace facebook::velox::substrait;
#include "velox/functions/prestosql/registration/RegistrationFunctions.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put all #includes together before "using namespace" directives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Test::SetUp();
functions::prestosql::registerAllScalarFunctions();
}
static std::string toSubstraitSignature(const TypePtr& type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add empty line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

}
};

TEST_F(VeloxSubstraitSignatureTest, toSubstraitSignature_with_type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: toSubstraitSignature_with_type -> toSubstraitSignatureWithType

Please fix throughout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


namespace facebook::velox::substrait::test {

class VeloxSubstraitSignatureTest : public ::testing::Test {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add tests for resolveFunction API?


/// Given a velox function name and argument types, return a matching function
/// signature, throw if no match found.
static const exec::FunctionSignature& resolveFunction(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function handles both scalar and aggregate functions and includes a hack for aggregate functions that assumes that single argument means aggregate function runs in final aggregation stage. It would be cleaner to handle scalar and aggregate functions separately. When resolving aggregate function, the caller needs to specify the aggregation step in addition to function name and input types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, move to further PR.


std::string VeloxSubstraitSignature::toSubstraitSignature(
const exec::TypeSignature& typeSignature) {
if ("T" == typeSignature.baseType() ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to look at the type variables specified in the function signature to avoid hard-coding some of these?

I assume this won't work for functions like map_keys which use K and V type variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will do in further PR.

if (functionName == "and" || functionName == "or" || functionName == "xor") {
return functionName + ":bool_bool";
}
if (functionName == "not") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to hard-code "not" function. Is should be present in the registry. Is it not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'not' function not present in the registry currently, do you think it should be present in registry?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I expect it to be registered.

velox/functions/prestosql/registration/ArithmeticFunctionsRegistration.cpp

void registerArithmeticFunctions() {
  registerSimpleFunctions();
  VELOX_REGISTER_VECTOR_FUNCTION(udf_not, "not");
  VELOX_REGISTER_VECTOR_FUNCTION(udf_decimal_add, "plus");
  VELOX_REGISTER_VECTOR_FUNCTION(udf_decimal_sub, "minus");
  VELOX_REGISTER_VECTOR_FUNCTION(udf_decimal_mul, "multiply");
}

@chaojun-zhang
Copy link
Contributor Author

@chaojun-zhang Thank you for making further improvements. VeloxSubstraitSignature has a good amount of logic and can be reviewed standalone. Consider extracting it and the corresponding test into a separate PR.

Ok, I will raise a new PR for this.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaojun-zhang Thank you for iterating on the PR. I still think it would be easier to review if the PR was split into smaller pieces. For example VeloxSubstraitSignature class and its test could be extracted into a separate PR.

I noticed that there are some limitations re: lambda functions (these are not supported) and aggregate functions (the step is not used, but the implications are not clear to me). Would you update PR description to call out these explicitly?

} else {
VELOX_NYI("Couldn't find the aggregate function '{}' ", funName);
}
auto referenceId = extensionCollector_->getReferenceNumber(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: referenceId -> referenceNumber

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

class VeloxSubstraitSignature {
public:
/// Given a velox type kind, return the Substrait type signature, throw if no
/// match found, Substrait signature used in the function extension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this comment seems to belong to the other method. Please, update.

private:
/// A bi-direction hash map to keep the relation between reference number and
/// either function or type signature.
/// @T ExtensionFunctionId
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use @tparam

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

template <class T>
class BiDirectionHashMap {
public:
/// If the forwardMap_ does not contain the key, then the key and value will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior seems difficult to work with. For the bi-map, users would expect that the map enforces uniqueness of both keys and values, no?

Consider, what happens when we add the following:

  1. 1, 10 => {1 -> 10}, {10 ->1}
  2. 1, 20 => {1 -> 20}, {10 ->1, 20 -> 1}
  3. 2, 20 => {1 -> 20, 2 -> 20}, {10 -> 1, 20 -> 2}

/// otherwise the key will be overwritten.
void put(const int& key, const T& value);

std::map<int, T>& forwardMap() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods should be const and should return const &. Also, consider replacing these with lookups by key and value, i.e.

std::optional<int> findKey(value) const;
std::optional<T> findValue(key) const;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to const is ok, but if replace with lookup methods that won't work for addExtensionsToPlan that need to iterate over the forwardMap

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I guess we still can make this method const and have it return const &. No need to provide write access to forwardMap_, right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated

const std::vector<TypePtr>& arguments) {
const auto& substraitFunctionSignature =
VeloxSubstraitSignature::toSubstraitSignature(functionName, arguments);
/// TODO: Currently we treat all velox registry based function signatures as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use // for comments inside functions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

const std::string& functionName,
const std::vector<TypePtr>& arguments,
const core::AggregationNode::Step aggregationStep) {
// TODO: Ignore aggregationStep for now, will refactor when introduce velox
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it OK to ignore the step? There is already a registry of aggregate functions. What's missing?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we make an agreement that separate the velox registry to a new PR.

int getReferenceNumber(
const std::string& functionName,
std::vector<TypePtr>&& arguments) {
int functionReferenceId1 =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: functionReferenceId1 -> referenceNumber1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

int getReferenceNumber(
const std::string& functionName,
std::vector<TypePtr>&& arguments,
const core::AggregationNode::Step step) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop const

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

TEST_F(SubstraitExtensionCollectorTest, addExtensionsToPlan) {
// Arrange
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean? Please, make sure all comments end with a period.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a unit test pattern (Arrange->Act->Assert), have a look at this link

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the pointer. I didn't know about this acronym. I do agree though with the author of that blog post that // Arrange and similar comments are redundant. The rest of the code base doesn't use these, hence, let's not add them here.

Test code, in the same way as production code, is supposed to be clean and self-explanatory. Preempting what you are going to do just adds unnecessary clutter. You can still have a clear distinction of the three steps by using spaces and indentation. Which are ultimately the key features the compiler provides you to structure your code.

It does look silly having a comment preceding each paragraph in this text, it tires the reader, as it has to constantly go through the obvious (or skip it). So why would you do that with your code?

On a separate note, this test would be easier to read if assertions were refactored to use a helper function.

auto getFunctionName = [&] (auto id) { return substraitPlan->extensions(0).extension_function().name(); };

ASSERT_EQ(getFunctionName(0), "plus:i32_i32");
ASSERT_EQ(getFunctionName(1), "divide:i32_i32");
...

@chaojunzhang
Copy link

@chaojun-zhang Thank you for iterating on the PR. I still think it would be easier to review if the PR was split into smaller pieces. For example VeloxSubstraitSignature class and its test could be extracted into a separate PR.

I noticed that there are some limitations re: lambda functions (these are not supported) and aggregate functions (the step is not used, but the implications are not clear to me). Would you update PR description to call out these explicitly?

If the VeloxSubstraitSignature extract to a separate PR then this PR won't be complete functional, we always need the logic toSubstraitSignature , if you think it won't belong to class of 'VeloxSubstraitSignature' , then it must be put into other place which can't be remove.

@mbasmanova
Copy link
Contributor

If the VeloxSubstraitSignature extract to a separate PR then this PR won't be complete functional, we always need the logic toSubstraitSignature

My thinking is that this PR will depend on the PR introducing VeloxSubstraitSignature. I.e. we introduce VeloxSubstraitSignature in PR 1, then start using it in PR 2.

 Modify forwardMap_ type to use unordered_map.
void SubstraitExtensionCollector::BiDirectionHashMap<T>::putIfAbsent(
const int& key,
const T& value) {
if (forwardMap_.find(key) == forwardMap_.end()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check both maps first, then insert a (key, value) only if forwardMap_ doesn't have key and reverseMap_ doesn't have value?

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaojun-zhang Thank you for the contribution.

I'll fix putIfAbsent and update PR description before landing.

template <typename T>
bool SubstraitExtensionCollector::BiDirectionHashMap<T>::putIfAbsent(
    const int& key,
    const T& value) {
  if (forwardMap_.find(key) == forwardMap_.end() &&
      reverseMap_.find(value) == reverseMap_.end()) {
    forwardMap_[key] = value;
    reverseMap_[value] = key;
    return true;
  }
  return false;
}

Lambda functions are not supported. Support for aggregated functions is limited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants