-
-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add pluggable regex engine support #837
Conversation
Sounds like a good idea! |
Definite +1 on the idea. The only key requirement is that in practical use, it will be necessary for implementations using V8 to acquire the lock on the isolate if it doesn't have it already. Just worth keeping in mind when considering the overhead here. I would make sure that the API allows passing in a |
0f1c571
to
74f6cc3
Compare
@anonrig I did a pass over it... simplifying the design somewhat. Nothing I did should be controversial. Note that our tests pass right now. (Although the work is not completed.) Observe how I did away with the class polymorphism and instead I use C++ concepts to define providers: template <typename T>
concept regex_concept = requires(T t, std::string_view pattern,
bool ignore_case, std::string_view input) {
// Ensure the class has a type alias 'regex_type'
typename T::regex_type;
// Function to create a regex instance
{
T::create_instance(pattern, ignore_case)
} -> std::same_as<std::optional<typename T::regex_type>>;
// Function to perform regex search
{
t.regex_search(input, std::declval<typename T::regex_type&>())
} -> std::same_as<std::vector<std::string>>;
// Function to match regex pattern
{
t.regex_match(input, std::declval<typename T::regex_type&>())
} -> std::same_as<bool>;
// Copy constructor
{ T(std::declval<const T&>()) } -> std::same_as<T>;
// Move constructor
{ T(std::declval<T&&>()) } -> std::same_as<T>;
}; There is no need for inheritance and virtual functions. Note that this might not be quite the right concept, see below. There are two design issues. I could have fixed those with a few minutes of coding, but I think it is important to discuss them as they have consequences, and I could be wrong.
To elaborate on point 2, consider this: template <url_pattern_regex::regex_concept regex_provider>
bool protocol_component_matches_special_scheme(
url_pattern_component<regex_provider>& component) {
auto regex = component.regexp;
// TODO: Use provider.regex_match
return std::regex_match("http", regex) || std::regex_match("https", regex) ||
std::regex_match("ws", regex) || std::regex_match("wss", regex) ||
std::regex_match("ftp", regex);
} If I turn regex_match into a static function, I can just do... template <url_pattern_regex::regex_concept regex_provider>
bool protocol_component_matches_special_scheme(
url_pattern_component<regex_provider>& component) {
auto regex = component.regexp;
return regex_provider::regex_match("http", regex) || regex_provider::regex_match("https", regex) ||
regex_provider::regex_match("ws", regex) || regex_provider::regex_match("wss", regex) ||
regex_provider::regex_match("ftp", regex);
} and so, maybe the right concept is... template <typename T>
concept regex_concept = requires(T t, std::string_view pattern,
bool ignore_case, std::string_view input) {
// Ensure the class has a type alias 'regex_type'
typename T::regex_type;
// Function to create a regex instance
{
T::create_instance(pattern, ignore_case)
} -> std::same_as<std::optional<typename T::regex_type>>;
// Function to perform regex search
{
T::regex_search(input, std::declval<typename T::regex_type&>())
} -> std::same_as<std::vector<std::string>>;
// Function to match regex pattern
{
T::regex_match(input, std::declval<typename T::regex_type&>())
} -> std::same_as<bool>;
// Disallow constructors
requires (!std::is_constructible_v<T>);
}; |
@jasnell Can we get away with using static functions for both Node.js and Cloudflare Workers while implementing this? |
Maybe? Not 100% sure as I'm not familiar with |
The v8 regex class might not fit the pattern very well: Plugging this in could require substantial work. |
Once we have a working provider support, I'll update my Node.js implementation to make sure this is doable before merging this PR. |
@lemire here is the workerd implementation for regex: https://github.com/cloudflare/workerd/blob/9cea433561971d6d0547e48eef244a5cb155daf0/src/workerd/api/urlpattern.c%2B%2B#L15 |
I've updated the implementation to use the new regex provider in everywhere, and got rid of regex usage except url_pattern_regex.h-cpp files. I'll test it with the node.js tomorrow. For workers and node.js, we just need to create a class with this signature, and later pass it to the class v8_based_regex_provider {
public:
v8_based_regex_provider() = default;
using regex_type = v8::RegExp;
static std::optional<regex_type> create_instance(std::string_view pattern,
bool ignore_case);
static std::optional<std::vector<std::string>> regex_search(
std::string_view input, const regex_type& pattern);
static bool regex_match(std::string_view input, const regex_type& pattern);
}; I know that we need to pass jsg::Lock& js as the first parameter for these static functions for workerd. Can we get around with the current function signature, or do we need to add a new parameter with void* to each of these static functions? Also can you think of anything else for Node.js? |
85ccb76
to
a790af3
Compare
@lemire I've force pushed and rebased from main to get rid of this fuzzer failures (which is fixed in main branch) |
hmmmm... did it work? |
Unfortunately after this PR, we use 1GB more memory with the fuzzer. I'll increase the limit but I think we should investigate this before releasing a new version.
|
d130d4c
to
40a965e
Compare
40a965e
to
367b27b
Compare
Now it's using 5GB. I don't think this is related to Ada at all. |
@anonrig 5GB: it may depend on how the fuzzer is written. There might a bug in it. |
I increased it to 10GB, and we will have a failure. |
@anonrig Can this be reproduced locally? |
@anonrig Do we actually fuzz randomly generated regular expressions on randomly generated strings? https://github.com/ada-url/ada/blob/main/fuzz/url_pattern.cc I would recommend working from a set of patterns and testing on random strings. |
0eb5c99
to
4e614af
Compare
4e614af
to
1c910f4
Compare
This could help https://google.github.io/oss-fuzz/advanced-topics/reproducing/ |
a30ad2d
to
7a8da2c
Compare
* reduce the scope of the fuzzer * lint
I recommend landing this PR as it is, and on a different PR we can move any url_pattern implementation that uses template arguments to their correct inl.h files. What do you think @lemire? |
Yeah. Sure, you can break it down into several PRs. I would not release with this PR however. In fact, I would not release before we can make sure that it works in Node. |
This is a proposal. Not meant to be merged as it is.
The goal is to have an abstract class under ada::url_pattern_regex::provider, where any application can implement their own class that inherits from this provider class. Later, we will use this abstract method class methods inside the URLPattern to enable Node.js like applications to not depend on std::regex.
By default, we will create a std_regex_provider class that inherits from provider to test our changes locally.
Any feedback is welcome.
cc @lemire @jasnell @mikea