Skip to content

fix: Replace boost::regex with RE2 in parse_duration function#15124

Closed
abhinavmuk04 wants to merge 1 commit intomainfrom
export-D84391289
Closed

fix: Replace boost::regex with RE2 in parse_duration function#15124
abhinavmuk04 wants to merge 1 commit intomainfrom
export-D84391289

Conversation

@abhinavmuk04
Copy link
Copy Markdown
Contributor

Summary:
Some implementations of functions in Velox rely on boost::regex as regular expression engine. That is not recommended. RE2 is the preferred engine.

boost relies on backtracking and it can lead to significantly slower performance compared to RE2. RE2 has O(n) complexity for a given string of length n, as it uses finite-state machines.

This change replaces boost::regex with RE2 in the ParseDurationFunction in DateTimeFunctions.h. The implementation now uses RE2::FullMatch instead of boost::regex_search, and directly captures the value and unit strings using RE2's API, simplifying the code by eliminating the need for boost::smatch.

Differential Revision: D84391289

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Oct 10, 2025

@abhinavmuk04 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D84391289.

@netlify
Copy link
Copy Markdown

netlify bot commented Oct 10, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 99cdebd
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/68f7f22063388c0008ea16b7

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 10, 2025
@abhinavmuk04 abhinavmuk04 changed the title Replace boost::regex with RE2 in parse_duration function fix: Replace boost::regex with RE2 in parse_duration function Oct 10, 2025
@MBkkt
Copy link
Copy Markdown
Collaborator

MBkkt commented Oct 11, 2025

@abhinavmuk04 Hi, nice changes!

I have some suggestions:

  1. We can use raw-string literals to avoid complicating regex pattern
  2. We can avoid std::string copy for matched strings (valueStr and unit)
  3. Better to use from_chars, because it works faster and doesn't throw exception, instead return result
template <typename T>
struct ParseDurationFunction {
  VELOX_DEFINE_FUNCTION_TYPES(T);

  std::unique_ptr<RE2> durationRegex_;

  FOLLY_ALWAYS_INLINE void initialize(
      const std::vector<TypePtr>& /*inputTypes*/,
      const core::QueryConfig& /*config*/,
      const arg_type<Varchar>* /*amountUnit*/) {
    durationRegex_ =
        std::make_unique<RE2>(R"(^\s*(\d+(?:\.\d+)?)\s*([a-zA-Z]+)\s*$)");
  }

  FOLLY_ALWAYS_INLINE void call(
      out_type<IntervalDayTime>& result,
      const arg_type<Varchar>& amountUnit) {
    std::string_view valueStr;
    std::string_view unit;
    if (!RE2::FullMatch(
            std::string_view{amountUnit}, *durationRegex_, &valueStr, &unit)) {
      VELOX_USER_FAIL(
          "Input duration is not a valid data duration string: {}", amountUnit);
    }

    double value{};
    auto [_, error] = std::from_chars(
        valueStr.data(), valueStr.data() + valueStr.size(), value);
    if (error != std::errc{}) {
      VELOX_USER_FAIL(
          "Input duration value is not a valid number: {}", valueStr);
    }

    result = valueOfTimeUnitToMillis(value, unit);
  }
};

@MBkkt
Copy link
Copy Markdown
Collaborator

MBkkt commented Oct 11, 2025

@abhinavmuk04 will you check #15134 ?

@abhinavmuk04
Copy link
Copy Markdown
Contributor Author

@abhinavmuk04 will you check #15134 ?

@MBkkt I see in this PR you seem to be handling the parse_duration function. Is this something you are working on? If so, I can let go of this PR

@abhinavmuk04
Copy link
Copy Markdown
Contributor Author

@abhinavmuk04 will you check #15134 ?

@MBkkt I see in this PR you seem to be handling the parse_duration function. Is this something you are working on? If so, I can let go of this PR

@MBkkt wanted to follow up on this

@MBkkt
Copy link
Copy Markdown
Collaborator

MBkkt commented Oct 15, 2025

@abhinavmuk04 Not really. If you will merge your PR, I will just rebase. If you don't merge this PR, I will wait for review for mine, to merge it and get rid of boost::regex.

Some explanation:

I just noticed your PR and thought about it will be nice to do 3 things:

  1. I did some review suggestions to your PR (avoid string copies, use faster "to double conversion", single time automaton construction, etc)
  2. Removed second usage of boost::regex
  3. Removed boost::regex from CMake

Copy link
Copy Markdown
Contributor

@kevinwilfong kevinwilfong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks @MBkkt for all the suggestions!

Copy link
Copy Markdown
Contributor

@kevinwilfong kevinwilfong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I noticed you didn't apply the changes from @MBkkt

@kevinwilfong
Copy link
Copy Markdown
Contributor

It looks like you had them in a previous commit (the one I was initially looking at) and accidentally reverted them.

Summary:

Some implementations of functions in Velox rely on boost::regex as regular expression engine. That is not recommended. RE2 is the preferred engine.

boost relies on backtracking and it can lead to significantly slower performance compared to RE2. RE2 has O(n) complexity for a given string of length n, as it uses finite-state machines.

This change replaces boost::regex with RE2 in the ParseDurationFunction in DateTimeFunctions.h. The implementation now uses RE2::FullMatch instead of boost::regex_search, and directly captures the value and unit strings using RE2's API, simplifying the code by eliminating the need for boost::smatch.

Reviewed By: kevinwilfong

Differential Revision: D84391289
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Oct 21, 2025

This pull request has been merged in ec15078.

mbasmanova pushed a commit to kou/velox that referenced this pull request Oct 27, 2025
`RE2::FullMatch()` uses `absl::string_view` not `std::string_view`:

https://github.com/google/re2/blob/61c4644171ee6b480540bf9e569cba06d9090b4b/re2/re2.h#L411

`absl::string_view` may not be an alias of `std::string_view`. In the
case, the following error is reported:

```text
In file included from velox/functions/prestosql/registration/DateTimeFunctionsRegistration.cpp:18:
velox/functions/prestosql/DateTimeFunctions.h:1939:10: error: no matching function for call to 'FullMatch'
 1939 |     if (!RE2::FullMatch(
      |          ^~~~~~~~~~~~~~
/include/re2/re2.h:411:15: note: candidate function template not viable: no known conversion from 'std::string_view' (aka 'basic_string_view<char>') to 'absl::string_view' for 1st argument
  411 |   static bool FullMatch(absl::string_view text, const RE2& re, A&&... a) {
      |               ^         ~~~~~~~~~~~~~~~~~~~~~~
```

Related: facebookincubatorGH-15124
meta-codesync bot pushed a commit that referenced this pull request Oct 27, 2025
Summary:
`RE2::FullMatch()` uses `absl::string_view` not `std::string_view`:

https://github.com/google/re2/blob/61c4644171ee6b480540bf9e569cba06d9090b4b/re2/re2.h#L411

`absl::string_view` may not be an alias of `std::string_view`. In the case, the following error is reported:

```text
In file included from velox/functions/prestosql/registration/DateTimeFunctionsRegistration.cpp:18:
velox/functions/prestosql/DateTimeFunctions.h:1939:10: error: no matching function for call to 'FullMatch'
 1939 |     if (!RE2::FullMatch(
      |          ^~~~~~~~~~~~~~
/include/re2/re2.h:411:15: note: candidate function template not viable: no known conversion from 'std::string_view' (aka 'basic_string_view<char>') to 'absl::string_view' for 1st argument
  411 |   static bool FullMatch(absl::string_view text, const RE2& re, A&&... a) {
      |               ^         ~~~~~~~~~~~~~~~~~~~~~~
```

Old RE2 that is provided by CentOS Stream 9 doesn't accept `absl::string_view`.

Old RE2 uses `re2::StringPiece` for `RE2::FullMatch()` and new RE2 provides `re2::StringPiece` as an alias of `absl::string_view`. So we can use `re2::StringPiece` for both of old and new RE2.

We can drop support for old RE2 to always use `absl::string_view` but we use `re2::StringPiece` for now. It seems that RE2 will use `std::string_view` instead of `absl::string_view` eventually. For example, google/re2@2a029e2 is a commit to migrate to `std::optional` from `absl::optional`.

We can revisit this after RE2 migrates to `std::string_view`.

Related: GH-15124

Pull Request resolved: #15259

Reviewed By: kevinwilfong

Differential Revision: D85539525

Pulled By: mbasmanova

fbshipit-source-id: 1dde1c47d7a337d220488aa64b5efa3408876d1e
mhaseeb123 pushed a commit to mhaseeb123/velox that referenced this pull request Oct 27, 2025
…ubator#15124)

Summary:
Pull Request resolved: facebookincubator#15124

Some implementations of functions in Velox rely on boost::regex as regular expression engine. That is not recommended. RE2 is the preferred engine.

boost relies on backtracking and it can lead to significantly slower performance compared to RE2. RE2 has O(n) complexity for a given string of length n, as it uses finite-state machines.

This change replaces boost::regex with RE2 in the ParseDurationFunction in DateTimeFunctions.h. The implementation now uses RE2::FullMatch instead of boost::regex_search, and directly captures the value and unit strings using RE2's API, simplifying the code by eliminating the need for boost::smatch.

Reviewed By: kevinwilfong

Differential Revision: D84391289

fbshipit-source-id: 4ac0f56e53392fc373d0976e029bd84859ab9874
mhaseeb123 pushed a commit to mhaseeb123/velox that referenced this pull request Oct 27, 2025
…bator#15259)

Summary:
`RE2::FullMatch()` uses `absl::string_view` not `std::string_view`:

https://github.com/google/re2/blob/61c4644171ee6b480540bf9e569cba06d9090b4b/re2/re2.h#L411

`absl::string_view` may not be an alias of `std::string_view`. In the case, the following error is reported:

```text
In file included from velox/functions/prestosql/registration/DateTimeFunctionsRegistration.cpp:18:
velox/functions/prestosql/DateTimeFunctions.h:1939:10: error: no matching function for call to 'FullMatch'
 1939 |     if (!RE2::FullMatch(
      |          ^~~~~~~~~~~~~~
/include/re2/re2.h:411:15: note: candidate function template not viable: no known conversion from 'std::string_view' (aka 'basic_string_view<char>') to 'absl::string_view' for 1st argument
  411 |   static bool FullMatch(absl::string_view text, const RE2& re, A&&... a) {
      |               ^         ~~~~~~~~~~~~~~~~~~~~~~
```

Old RE2 that is provided by CentOS Stream 9 doesn't accept `absl::string_view`.

Old RE2 uses `re2::StringPiece` for `RE2::FullMatch()` and new RE2 provides `re2::StringPiece` as an alias of `absl::string_view`. So we can use `re2::StringPiece` for both of old and new RE2.

We can drop support for old RE2 to always use `absl::string_view` but we use `re2::StringPiece` for now. It seems that RE2 will use `std::string_view` instead of `absl::string_view` eventually. For example, google/re2@2a029e2 is a commit to migrate to `std::optional` from `absl::optional`.

We can revisit this after RE2 migrates to `std::string_view`.

Related: facebookincubatorGH-15124

Pull Request resolved: facebookincubator#15259

Reviewed By: kevinwilfong

Differential Revision: D85539525

Pulled By: mbasmanova

fbshipit-source-id: 1dde1c47d7a337d220488aa64b5efa3408876d1e
MBkkt added a commit to serenedb/velox that referenced this pull request Oct 28, 2025
meta-codesync bot pushed a commit that referenced this pull request Oct 29, 2025
Summary:
re2 works better, we will also decrease binary size and count of dependencies

First commit inspired by this PR #15124

Pull Request resolved: #15134

Reviewed By: kgpai

Differential Revision: D85479675

Pulled By: kevinwilfong

fbshipit-source-id: e5534aad5c8843ef707ab3b5d55925c3bca6f7cb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants