Skip to content

Add TypeSignature Parser using Flex and Bison#7590

Closed
majetideepak wants to merge 4 commits intofacebookincubator:mainfrom
majetideepak:signatureParser
Closed

Add TypeSignature Parser using Flex and Bison#7590
majetideepak wants to merge 4 commits intofacebookincubator:mainfrom
majetideepak:signatureParser

Conversation

@majetideepak
Copy link
Collaborator

No description provided.

@netlify
Copy link

netlify bot commented Nov 15, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 376b1c0
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65720329b1810400086be50f

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 15, 2023
@majetideepak majetideepak force-pushed the signatureParser branch 2 times, most recently from e632be9 to 5dad298 Compare November 15, 2023 18:06
@majetideepak
Copy link
Collaborator Author

This PR works on MacOS. I will fix the build problem on Ubuntu after #7568 lands.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@majetideepak Deepak, this parser is specific to Presto, right? It can parse type strings generated by Presto and supports Presto-specific types like TIMESTAMP WITH TIME ZONE and JSON, right? If that's the case, perhaps, it is better placed in Presto-specific directory. Maybe somewhere under functions/prestosql.

CC: @pedroerp

@majetideepak
Copy link
Collaborator Author

majetideepak commented Nov 16, 2023

@mbasmanova This PR is to parse TypeSignatures which is invoked when binding function signatures in Velox. This extends the Presto-specific parser you are referring to which is here #7568

@majetideepak
Copy link
Collaborator Author

Currently, only the flex file is re-used. But we could decouple both and move #7568 under functions/prestosql.

@majetideepak majetideepak force-pushed the signatureParser branch 2 times, most recently from addc80e to 3982990 Compare November 22, 2023 14:17
@majetideepak majetideepak marked this pull request as ready for review November 22, 2023 14:17
@majetideepak majetideepak marked this pull request as draft November 22, 2023 17:12
@majetideepak majetideepak force-pushed the signatureParser branch 2 times, most recently from 0871c5a to cf02cd4 Compare November 23, 2023 01:41
@majetideepak majetideepak marked this pull request as ready for review November 23, 2023 01:43
@majetideepak majetideepak force-pushed the signatureParser branch 2 times, most recently from 2e8eaa1 to 13eeea0 Compare November 24, 2023 09:53

namespace facebook::velox::exec {

void toAppend(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this used? should this be in an anonymous namespace?

Copy link
Collaborator Author

@majetideepak majetideepak Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required by folly::join. I moved this as is from FunctionSignature.*. I will try an anonymous namespace.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an anonymous namespace does not work here.

namespace facebook::velox::exec {

TypeSignaturePtr inferTypeWithSpaces(
std::vector<std::string>& words,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const?

std::vector<std::string>& words,
bool cannotHaveFieldName) {
VELOX_CHECK_GE(words.size(), 2);
std::string fieldName = words[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto&

/// the remaining words are a Velox type.
TypeSignaturePtr inferTypeWithSpaces(
std::vector<std::string>& words,
bool cannotHaveFieldName = true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we flip the flag to canHaveFieldName instead? Makes it a little easier to understand that the double negative.


#include "velox/expression/signature_parser/SignatureParser.h"

facebook::velox::exec::TypeSignaturePtr facebook::velox::exec::parseTypeSignature(const std::string& signatureText)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scanner takes a string_view. Could we also take a string_view here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::istringstream is(signatureText); below does not accept a string_view.


%%

type_spec : type { scanner->setTypeSignature($1); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bummer we need to repeat the type grammar here. Is there no way refactor and include that part?

Copy link
Collaborator Author

@majetideepak majetideepak Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each rule returns a TypeSignature here vs. a TypePtr in the Type Parser.
To unify both, we have to make each rule return strings and finally return a parse tree of strings. Then each parser has to recurse the tree again to convert the string to a TypeSignature or TypePtr.
I am planning to use the TypeParser in Prestissimo to replace Antlr here.
I feel the double-pass approach would be inefficient.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at this further.
flex/bison are not designed to reuse the grammar.
We could use a union/std::variant to encapsulate TypeSignature TypePtr into a single variable and make this work via virtual functions.
I started this and it turned out to be more complex. Some observations below:

  1. The lex file (.ll) cannot be reused since it implements the parseType or parseSignature APIs. This cannot be moved out. This means only the Bison parser file (.yy) can be reused if at all.
  2. There are some differences in the decimal grammar. eg. DECIMAL(a_precision, b_precision) is a valid function signature but not a valid type.
  3. Row fields layout is different for TypeSignature vs RowType. This will introduce more customization using virtual functions.

The parsing grammar is not going to change in the future. So duplication of the Bison file should be okay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for looking into this!

ASSERT_EQ(signature.baseName(), "row");
ASSERT_EQ(signature.parameters().size(), 1);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the missing feature from the existing code was to support quoted named row fields. Could you add a few tests to ensure that works now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test for this.

@kgpai
Copy link
Contributor

kgpai commented Nov 28, 2023

FYI - The signature check job seems to be failing because 'DECIMAL' changed to 'decimal' and right now that is taken as a different type :) . I will make the signature check job type insensitive.

@majetideepak majetideepak force-pushed the signatureParser branch 2 times, most recently from 46e8f57 to d088ca0 Compare December 4, 2023 04:28
@majetideepak
Copy link
Collaborator Author

@kgpai thank you for looking. I resolved the case insensitive issue in my PR.

Copy link
Contributor

@pedroerp pedroerp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Just a small last comment but looks like it's almost ready to go :)

const auto& fieldName = words[0];
auto typeName = words[1];
for (int i = 2; i < words.size(); ++i) {
typeName = fmt::format("{} {}", typeName, words[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this will have quadratic time. Could we use folly::join() instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

typeName = fmt::format("{} {}", typeName, words[i]);
}
const auto allWords = fmt::format("{} {}", fieldName, typeName);
if (hasType(allWords) || !canHaveFieldName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you just concatenate all of them, and use a std::string_view to skip the first token? (assuming hasToken() can take a string_view)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this for both TypeParser and SignatureParser. For TypeParser, we will need a substring if there is a field name.


%%

type_spec : type { scanner->setTypeSignature($1); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for looking into this!

@facebook-github-bot
Copy link
Contributor

@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pedroerp merged this pull request in 8948b9e.

@conbench-facebook
Copy link

Conbench analyzed the 1 benchmark run on commit 8948b9e9.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

@majetideepak majetideepak deleted the signatureParser branch January 3, 2024 13:56
makagonov pushed a commit to makagonov/velox that referenced this pull request May 27, 2025
Summary: Pull Request resolved: facebookincubator#7590

Reviewed By: kgpai

Differential Revision: D51989804

Pulled By: pedroerp

fbshipit-source-id: 82ac7a26fa8619ba13288c2fc15d70a6781c0911
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants