Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pluggable regex engine support #837

Merged
merged 19 commits into from
Jan 22, 2025
Merged

add pluggable regex engine support #837

merged 19 commits into from
Jan 22, 2025

Conversation

anonrig
Copy link
Member

@anonrig anonrig commented Jan 10, 2025

This is a proposal. Not meant to be merged as it is.

The goal is to have an abstract class under ada::url_pattern_regex::provider, where any application can implement their own class that inherits from this provider class. Later, we will use this abstract method class methods inside the URLPattern to enable Node.js like applications to not depend on std::regex.

By default, we will create a std_regex_provider class that inherits from provider to test our changes locally.

Any feedback is welcome.

cc @lemire @jasnell @mikea

@lemire
Copy link
Member

lemire commented Jan 10, 2025

Sounds like a good idea!

@jasnell
Copy link
Contributor

jasnell commented Jan 10, 2025

Definite +1 on the idea. The only key requirement is that in practical use, it will be necessary for implementations using V8 to acquire the lock on the isolate if it doesn't have it already. Just worth keeping in mind when considering the overhead here.

I would make sure that the API allows passing in a void* argument as "user data" that is passed back to the callback when it is invoked.

@lemire lemire marked this pull request as draft January 14, 2025 02:36
@anonrig anonrig force-pushed the make-regexp-pluggable branch from 0f1c571 to 74f6cc3 Compare January 20, 2025 00:20
@lemire
Copy link
Member

lemire commented Jan 20, 2025

@anonrig I did a pass over it... simplifying the design somewhat. Nothing I did should be controversial. Note that our tests pass right now. (Although the work is not completed.)

Observe how I did away with the class polymorphism and instead I use C++ concepts to define providers:

template <typename T>
concept regex_concept = requires(T t, std::string_view pattern,
                                 bool ignore_case, std::string_view input) {
  // Ensure the class has a type alias 'regex_type'
  typename T::regex_type;

  // Function to create a regex instance
  {
    T::create_instance(pattern, ignore_case)
  } -> std::same_as<std::optional<typename T::regex_type>>;

  // Function to perform regex search
  {
    t.regex_search(input, std::declval<typename T::regex_type&>())
  } -> std::same_as<std::vector<std::string>>;

  // Function to match regex pattern
  {
    t.regex_match(input, std::declval<typename T::regex_type&>())
  } -> std::same_as<bool>;

  // Copy constructor
  { T(std::declval<const T&>()) } -> std::same_as<T>;

  // Move constructor
  { T(std::declval<T&&>()) } -> std::same_as<T>;
};

There is no need for inheritance and virtual functions.

Note that this might not be quite the right concept, see below.

There are two design issues. I could have fixed those with a few minutes of coding, but I think it is important to discuss them as they have consequences, and I could be wrong.

  1. Templates in C++ are powerful and they do allow users to customize the library (e.g., by providing your own definitions and so forth), but only if the customization code and the template definitions are accessible at compile time: e.g., template definitions are typically part of the header files. Currently, the template definitions are stored in .cpp files which would make it awkward to provide extensions. When the cpp file is compiled, it only has access to ada code... and can't compile the custom code provided by the user of the library. This goes away if the definition is included in header files. So I recommend moving everything having to do with url pattern to header files.
  2. I think that the std_regex_provider type and custom versions are stateless. That is, all we need are static functions. So there is no sense in passing around provider instances. I have not changed that as there may be reasons to have state. Related to this issue is the fact that the match template currently still calls std::regex_search when it could call regex_provider:: regex_search instead... It is almost trivial... but we need to be clear. The same is true with regex_match`.

To elaborate on point 2, consider this:

template <url_pattern_regex::regex_concept regex_provider>
bool protocol_component_matches_special_scheme(
    url_pattern_component<regex_provider>& component) {
  auto regex = component.regexp;
  // TODO: Use provider.regex_match
  return std::regex_match("http", regex) || std::regex_match("https", regex) ||
         std::regex_match("ws", regex) || std::regex_match("wss", regex) ||
         std::regex_match("ftp", regex);
}

If I turn regex_match into a static function, I can just do...

template <url_pattern_regex::regex_concept regex_provider>
bool protocol_component_matches_special_scheme(
    url_pattern_component<regex_provider>& component) {
  auto regex = component.regexp;
  return regex_provider::regex_match("http", regex) || regex_provider::regex_match("https", regex) ||
         regex_provider::regex_match("ws", regex) || regex_provider::regex_match("wss", regex) ||
         regex_provider::regex_match("ftp", regex);
}

and so, maybe the right concept is...

template <typename T>
concept regex_concept = requires(T t, std::string_view pattern,
                                 bool ignore_case, std::string_view input) {
  // Ensure the class has a type alias 'regex_type'
  typename T::regex_type;

  // Function to create a regex instance
  {
    T::create_instance(pattern, ignore_case)
  } -> std::same_as<std::optional<typename T::regex_type>>;

  // Function to perform regex search
  {
    T::regex_search(input, std::declval<typename T::regex_type&>())
  } -> std::same_as<std::vector<std::string>>;

  // Function to match regex pattern
  {
    T::regex_match(input, std::declval<typename T::regex_type&>())
  } -> std::same_as<bool>;

  // Disallow constructors
 requires (!std::is_constructible_v<T>); 
};

@anonrig
Copy link
Member Author

anonrig commented Jan 20, 2025

I think that the std_regex_provider type and custom versions are stateless. That is, all we need are static functions. So there is no sense in passing around provider instances. I have not changed that as there may be reasons to have state. Related to this issue is the fact that the match template currently still calls std::regex_search when it could call regex_provider:: regex_search instead... It is almost trivial... but we need to be clear. The same is true with regex_match`.

@jasnell Can we get away with using static functions for both Node.js and Cloudflare Workers while implementing this?

@jasnell
Copy link
Contributor

jasnell commented Jan 21, 2025

Maybe? Not 100% sure as I'm not familiar with std_regex_provider

@lemire
Copy link
Member

lemire commented Jan 21, 2025

@anonrig

The v8 regex class might not fit the pattern very well:

https://github.com/nodejs/node/blob/23c2d33592e72f7ba9bb5d30fe3181714bb508d1/deps/v8/include/v8-regexp.h#L20

Plugging this in could require substantial work.

@anonrig
Copy link
Member Author

anonrig commented Jan 21, 2025

Plugging this in could require substantial work.

Once we have a working provider support, I'll update my Node.js implementation to make sure this is doable before merging this PR.

@anonrig
Copy link
Member Author

anonrig commented Jan 21, 2025

@anonrig
Copy link
Member Author

anonrig commented Jan 22, 2025

@lemire

I've updated the implementation to use the new regex provider in everywhere, and got rid of regex usage except url_pattern_regex.h-cpp files. I'll test it with the node.js tomorrow.

@jasnell

For workers and node.js, we just need to create a class with this signature, and later pass it to the parse_url_pattern() function as a type

class v8_based_regex_provider {
 public:
  v8_based_regex_provider() = default;
  using regex_type = v8::RegExp;
  static std::optional<regex_type> create_instance(std::string_view pattern,
                                                   bool ignore_case);
  static std::optional<std::vector<std::string>> regex_search(
      std::string_view input, const regex_type& pattern);
  static bool regex_match(std::string_view input, const regex_type& pattern);
};

I know that we need to pass jsg::Lock& js as the first parameter for these static functions for workerd. Can we get around with the current function signature, or do we need to add a new parameter with void* to each of these static functions? Also can you think of anything else for Node.js?

@anonrig anonrig force-pushed the make-regexp-pluggable branch from 85ccb76 to a790af3 Compare January 22, 2025 00:47
@anonrig
Copy link
Member Author

anonrig commented Jan 22, 2025

@lemire I've force pushed and rebased from main to get rid of this fuzzer failures (which is fixed in main branch)

@lemire
Copy link
Member

lemire commented Jan 22, 2025

@anonrig

I've force pushed and rebased from main to get rid of this fuzzer failures (which is fixed in main branch)

hmmmm... did it work?

@anonrig
Copy link
Member Author

anonrig commented Jan 22, 2025

@lemire

Unfortunately after this PR, we use 1GB more memory with the fuzzer. I'll increase the limit but I think we should investigate this before releasing a new version.

ERROR: libFuzzer: out-of-memory (used: 4690Mb; limit: 3000Mb)

@anonrig anonrig force-pushed the make-regexp-pluggable branch from d130d4c to 40a965e Compare January 22, 2025 01:06
@anonrig anonrig force-pushed the make-regexp-pluggable branch from 40a965e to 367b27b Compare January 22, 2025 01:13
@anonrig
Copy link
Member Author

anonrig commented Jan 22, 2025

@lemire

Unfortunately after this PR, we use 1GB more memory with the fuzzer. I'll increase the limit but I think we should investigate this before releasing a new version.

ERROR: libFuzzer: out-of-memory (used: 4690Mb; limit: 3000Mb)

Now it's using 5GB. I don't think this is related to Ada at all.

@lemire
Copy link
Member

lemire commented Jan 22, 2025

@anonrig 5GB: it may depend on how the fuzzer is written. There might a bug in it.

@anonrig
Copy link
Member Author

anonrig commented Jan 22, 2025

@anonrig 5GB: it may depend on how the fuzzer is written. There might a bug in it.

I increased it to 10GB, and we will have a failure.

@lemire
Copy link
Member

lemire commented Jan 22, 2025

@anonrig Can this be reproduced locally?

@lemire
Copy link
Member

lemire commented Jan 22, 2025

@anonrig Do we actually fuzz randomly generated regular expressions on randomly generated strings?

https://github.com/ada-url/ada/blob/main/fuzz/url_pattern.cc

I would recommend working from a set of patterns and testing on random strings.

@anonrig anonrig force-pushed the make-regexp-pluggable branch from 0eb5c99 to 4e614af Compare January 22, 2025 18:00
@anonrig anonrig force-pushed the make-regexp-pluggable branch from 4e614af to 1c910f4 Compare January 22, 2025 18:04
@anonrig
Copy link
Member Author

anonrig commented Jan 22, 2025

@anonrig Can this be reproduced locally?

This could help https://google.github.io/oss-fuzz/advanced-topics/reproducing/

@anonrig anonrig force-pushed the make-regexp-pluggable branch from a30ad2d to 7a8da2c Compare January 22, 2025 18:17
anonrig and others added 2 commits January 22, 2025 13:23
@anonrig anonrig marked this pull request as ready for review January 22, 2025 21:43
@anonrig
Copy link
Member Author

anonrig commented Jan 22, 2025

I recommend landing this PR as it is, and on a different PR we can move any url_pattern implementation that uses template arguments to their correct inl.h files. What do you think @lemire?

@lemire
Copy link
Member

lemire commented Jan 22, 2025

I recommend landing this PR as it is, and on a different PR we can move any url_pattern implementation that uses template arguments to their correct inl.h files. What do you think @lemire?

Yeah. Sure, you can break it down into several PRs. I would not release with this PR however. In fact, I would not release before we can make sure that it works in Node.

@lemire lemire self-requested a review January 22, 2025 23:24
@anonrig anonrig merged commit 57d3866 into main Jan 22, 2025
64 checks passed
@anonrig anonrig deleted the make-regexp-pluggable branch January 22, 2025 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants