Skip to content

refactor: Replace boost regex with re2#15134

Closed
MBkkt wants to merge 8 commits intofacebookincubator:mainfrom
serenedb:mbkkt/remove-boost-regex
Closed

refactor: Replace boost regex with re2#15134
MBkkt wants to merge 8 commits intofacebookincubator:mainfrom
serenedb:mbkkt/remove-boost-regex

Conversation

@MBkkt
Copy link
Copy Markdown
Collaborator

@MBkkt MBkkt commented Oct 11, 2025

re2 works better, we will also decrease binary size and count of dependencies

First commit inspired by this PR #15124

@netlify
Copy link
Copy Markdown

netlify bot commented Oct 11, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 736cf47
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/690141de1904720008fdb6da

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 11, 2025
@MBkkt MBkkt force-pushed the mbkkt/remove-boost-regex branch 4 times, most recently from 3279d93 to 3d1cb00 Compare October 12, 2025 14:46
}
// For nested case, e.g., for "{"my": {"hello": 10}}",
// "$.my" will return an object type.
case simdjson::ondemand::json_type::object:
Copy link
Copy Markdown
Collaborator

@rui-mo rui-mo Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @philo-he Would you like to take a look at the change to get_json_object? Thanks.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, change here, it was same code in two cases

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes make sense. Thanks.

@MBkkt MBkkt force-pushed the mbkkt/remove-boost-regex branch from 3d1cb00 to 1d6f8e8 Compare October 22, 2025 11:22
@MBkkt
Copy link
Copy Markdown
Collaborator Author

MBkkt commented Oct 22, 2025

@kevinwilfong Can you review this changes, please?

@MBkkt MBkkt force-pushed the mbkkt/remove-boost-regex branch from 1d6f8e8 to a365fdc Compare October 24, 2025 07:37
@majetideepak majetideepak requested review from czentgr and kgpai October 24, 2025 13:52
@majetideepak
Copy link
Copy Markdown
Collaborator

I will let Meta folks chime in as well.
CC: @kgpai

Copy link
Copy Markdown
Contributor

@kevinwilfong kevinwilfong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks looks good to me

static const LazyRE2 kDurationRegex{
R"(^\s*(\d+(?:\.\d+)?)\s*([a-zA-Z]+)\s*$)"};
// TODO: Remove re2::StringPiece != std::string_view hacks.
// It's needed because system is old, update please.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you be more specific either here in the the PR comments about what "the system" is, are you referring to CI?

Copy link
Copy Markdown
Collaborator Author

@MBkkt MBkkt Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated comment.

Yes, in some system in CI, system abseil was build with absl::string_view != std::string_view and re2 was build with re2::StringPiece != absl::string_view

It's old unsupported version of abseil, because it's true only for C++14, but abseil drops C++14 support already.

@kevinwilfong
Copy link
Copy Markdown
Contributor

@MBkkt it looks like there are conflicts, could you rebase?

@kevinwilfong kevinwilfong self-assigned this Oct 24, 2025
@MBkkt MBkkt force-pushed the mbkkt/remove-boost-regex branch from a365fdc to 48fc86b Compare October 24, 2025 23:29
@MBkkt
Copy link
Copy Markdown
Collaborator Author

MBkkt commented Oct 24, 2025

@kevinwilfong rebased

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Oct 24, 2025

@kevinwilfong has imported this pull request. If you are a Meta employee, you can view this in D85479675.

Copy link
Copy Markdown
Collaborator

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
I don't see boost API being replaced though. So we optimized some re2 usage with fast_float and removed boost::regex from linking which was (apparently) not used anyway.

@MBkkt
Copy link
Copy Markdown
Collaborator Author

MBkkt commented Oct 27, 2025

@czentgr Yes, but It's because this PR exists too long.
Initially it was two places with boost regex, but after multiple updates from main and zero comments to PR...

Comment on lines +1928 to +1929
static const LazyRE2 kDurationRegex{
R"(^\s*(\d+(?:\.\d+)?)\s*([a-zA-Z]+)\s*$)"};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @MBkkt, it looks like our internal build is a little stricter than what runs here.

-Wmissing-field-initializers is showing that the options_ field isn't being initialized here, combined with -Werror this is causing the build to fail and I'm unable to merge

could you either change this to invoke the constructor rather than the list initializer {} -> () or add an initializer for options_ .options_={}

Copy link
Copy Markdown
Collaborator Author

@MBkkt MBkkt Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kevinwilfong

  1. Can you change in future (in a separate PR) to make github CI same as internal in terms of warnings?
  2. I'm not sure it will help, but lets try this: (same as .options_ =, but without explicit field mention) f09e68f
// Silence warnings about missing initializers for members of LazyRE2.
#if defined(__GNUC__)
#pragma GCC diagnostic ignored "-Wmissing-field-initializers"
#endif

// Helper for writing global or static RE2s safely.
// Write
//     static LazyRE2 re = {".*"};
// and then use *re instead of writing
//     static RE2 re(".*");
// The former is more careful about multithreaded
// situations than the latter.
//
// N.B. This class never deletes the RE2 object that
// it constructs: that's a feature, so that it can be used
// for global and function static variables.
class LazyRE2 {
 private:
  struct NoArg {};

 public:
  typedef RE2 element_type;  // support std::pointer_traits

  // Constructor omitted to preserve braced initialization in C++98.

  // Pretend to be a pointer to Type (never NULL due to on-demand creation):
  RE2& operator*() const { return *get(); }
  RE2* operator->() const { return get(); }

  // Named accessor/initializer:
  RE2* get() const {
    absl::call_once(once_, &LazyRE2::Init, this);
    return ptr_;
  }

  // All data fields must be public to support {"foo"} initialization.
  const char* pattern_;
  RE2::CannedOptions options_;
  NoArg barrier_against_excess_initializers_;

  mutable RE2* ptr_;
  mutable absl::once_flag once_;

 private:
  static void Init(const LazyRE2* lazy_re2) {
    lazy_re2->ptr_ = new RE2(lazy_re2->pattern_, lazy_re2->options_);
  }

  void operator=(const LazyRE2&);  // disallowed
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like that doesn't work (it complains about barrier_against_excess_initializers_ in that case)

Explicitly listing the fields does
{.pattern_ = R"(^\s*(\d+(?:.\d+)?)\s*([a-zA-Z]+)\s*$)", .options_ = {}}

Copy link
Copy Markdown
Collaborator Author

@MBkkt MBkkt Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

155b0be something like this maybe can help with internal CI warning?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While that does work internally, unfortunately it looks like it breaks the Mac build here in OS

Copy link
Copy Markdown
Collaborator Author

@MBkkt MBkkt Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, 736cf47

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Merging, sorry for the slog

@MBkkt MBkkt force-pushed the mbkkt/remove-boost-regex branch 2 times, most recently from 51d004e to f09e68f Compare October 27, 2025 19:58
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Oct 29, 2025

@kevinwilfong merged this pull request in 909429d.

#pragma once

#define XXH_INLINE_ALL
#include <fast_float/fast_float.h>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to use velox_include_directories to include the fast-float header? If the environment does not have fast-float, the velox build will fail, I'm not sure why Velox does not build it from source. @MBkkt CC @czentgr
https://github.com/IBM/velox/actions/runs/18915929559/job/53999183287

FAILED: velox/buffer/CMakeFiles/velox.dir/__/functions/prestosql/registration/DateTimeFunctionsRegistration.cpp.o 
/usr/bin/ccache /opt/rh/devtoolset-11/root/usr/bin/c++ -DAWS_MQTT_WITH_WEBSOCKETS -DAWS_SDK_VERSION_MAJOR=1 -DAWS_SDK_VERSION_MINOR=11 -DAWS_SDK_VERSION_PATCH=285 -DAWS_USE_EPOLL -DAZ_RTTI -DCURL_STATICLIB -DGEOS_INLINE -DGFLAGS_IS_A_DLL=0 -DNDEBUG -DSIMDJSON_EXPERIMENTAL_ALLOW_INCOMPLETE_JSON -DSIMDJSON_THREADS_ENABLED=1 -DUSE_UNSTABLE_GEOS_CPP_API -DVELOX_DISABLE_GOOGLETEST -DVELOX_ENABLE_ABFS -DVELOX_ENABLE_COMPRESSION_LZ4 -DVELOX_ENABLE_GCS -DVELOX_ENABLE_GEO -DVELOX_ENABLE_HDFS -DVELOX_ENABLE_PARQUET -DVELOX_ENABLE_S3 -I/work/. -I/work/velox/external/xxhash -I/work/_build/release -I/work/velox/tpch/gen/dbgen/include -I/work/velox/tpcds/gen/dsdgen/include -I/work/_build/release/_deps/simdjson-src/include -isystem /work/gluten/dev/vcpkg/vcpkg_installed/x64-linux-avx/include -isystem /work/velox -isystem /work/velox/external -isystem /work/gluten/dev/vcpkg/vcpkg_installed/x64-linux-avx/lib/pkgconfig/../../include -isystem /work/gluten/dev/vcpkg/vcpkg_installed/x64-linux-avx/include/geos -Wno-error=stringop-overflow -Wno-error=cpp -Wno-missing-field-initializers -Wno-unknown-warning-option -mavx2 -mfma -mavx -mf16c -mlzcnt -mbmi2 -D USE_VELOX_COMMON_BASE -D HAS_UNCAUGHT_EXCEPTIONS -DFOLLY_CFG_NO_COROUTINES -Wall -Wextra -Wno-unused        -Wno-unused-parameter        -Wno-sign-compare        -Wno-ignored-qualifiers        -Wno-implicit-fallthrough          -Wno-class-memaccess          -Wno-comment          -Wno-int-in-bool-context          -Wno-redundant-move          -Wno-array-bounds          -Wno-maybe-uninitialized          -Wno-unused-result          -Wno-format-overflow          -Wno-strict-aliasing -Werror -O3 -DNDEBUG -std=gnu++20 -fPIC -fdiagnostics-color=always -ffp-contract=off -DS2N_KYBER512R3_AVX2_BMI2 -DS2N_STACKTRACE -DS2N_CPUID_AVAILABLE -DS2N_FEATURES_AVAILABLE -fPIC -DS2N_FALL_THROUGH_SUPPORTED -DS2N___RESTRICT__SUPPORTED -DS2N_MADVISE_SUPPORTED -DS2N_CLONE_SUPPORTED -MD -MT velox/buffer/CMakeFiles/velox.dir/__/functions/prestosql/registration/DateTimeFunctionsRegistration.cpp.o -MF velox/buffer/CMakeFiles/velox.dir/__/functions/prestosql/registration/DateTimeFunctionsRegistration.cpp.o.d -o velox/buffer/CMakeFiles/velox.dir/__/functions/prestosql/registration/DateTimeFunctionsRegistration.cpp.o -c /work/velox/functions/prestosql/registration/DateTimeFunctionsRegistration.cpp
In file included from /work/velox/functions/prestosql/registration/DateTimeFunctionsRegistration.cpp:18:
/work/./velox/functions/prestosql/DateTimeFunctions.h:19:10: fatal error: fast_float/fast_float.h: No such file or directory
   19 | #include <fast_float/fast_float.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.

Copy link
Copy Markdown
Collaborator Author

@MBkkt MBkkt Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I shouldn't, because library target already linked in cmake

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you build folly in some specific way, because fast_float link provided from Folly::folly now I think.
Lets do this explicit

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh #15322 does it help?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IBM installs the folly by vcpkg, so the environment already has folly, but vcpkg does not support install fast-float version 8.0.2 apache/gluten#10977, so fast-float does not exist in the environment. It does not call setup-script.sh

Copy link
Copy Markdown
Collaborator Author

@MBkkt MBkkt Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so fast-float does not exist in the environment. It does not call setup-script.sh

So you don't have fast_float in vcpkg folly or how it works?

but vcpkg does not support install fast-float version 8.0.2

Hmm, strange, on their site they're claim support 8.0.2
image

Maybe I'm missing something because I'm not working with C++ package managers and system libraries.

As an alternative solution we probably can update fast_float version? I don't think it's bad, but it's also affects folly

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe because Gluten uses an old vcpkg version? I'm not sure for that.

But I think current #15322 can help, so we don't need to update fast_float version and care about Gluten vcpkg fast-float version.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, lets try current approach, if it doesn't help, please ping me. And we will try something else

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your in time fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants