Use PCRE for builtins.match and builtins.split#7336
Use PCRE for builtins.match and builtins.split#7336yorickvP wants to merge 3 commits intoNixOS:masterfrom
builtins.match and builtins.split#7336Conversation
nix-repl> builtins.match "(?<date>(?<year>(\\d\\d)?\\d\\d) - (?<month>\\d\\d) - (?<day>\\d\\d))" "2020 - 10 - 10"
{ date = "2020 - 10 - 10"; day = "10"; month = "10"; year = "2020"; }
|
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/tweag-nix-dev-update-40/23480/1 |
| Returns a list composed of non matched strings interleaved with the | ||
| lists of the [extended POSIX regular | ||
| expression](http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04) | ||
| *regex* matches of *str*. Each item in the lists of matched |
There was a problem hiding this comment.
PCRE is not POSIX ERE.
Also worth mentioning that PCRE supports arbitrary look-around, which can cause (maybe accidental) exponential run time. Maybe we should limit the input to a reasonable subset.
There was a problem hiding this comment.
So it should point to https://www.pcre.org/original/doc/html/pcrepattern.html.
| int errorcode; | ||
| PCRE2_SIZE erroffset; | ||
|
|
||
| code = pcre2_compile((const unsigned char*)re.data(), re.length(), 0, &errorcode, &erroffset, nullptr); |
There was a problem hiding this comment.
| code = pcre2_compile((const unsigned char*)re.data(), re.length(), 0, &errorcode, &erroffset, nullptr); | |
| code = pcre2_compile(static_cast<const unsigned char*>(re.data()), re.length(), 0, &errorcode, &erroffset, nullptr); |
| name_table.reserve(namecount); | ||
| for (size_t i = 0; i < namecount; i++) { | ||
| int n = tabptr[0] << 8 | tabptr[1]; | ||
| name_table.emplace_back((const char*)(tabptr+2), n); |
There was a problem hiding this comment.
| name_table.emplace_back((const char*)(tabptr+2), n); | |
| name_table.emplace_back(static_cast<const char*>(tabptr+2), n); |
| { | ||
| friend class MatchData; | ||
| protected: | ||
| pcre2_code* code; |
There was a problem hiding this comment.
if you made this a std::unique_ptr, then you wouldn't need to manually call pcre2_code_free in the destructor. In addition to that, the constructor function could not leak the successfully compiled pcre2 resource in case of exceptions.
relevant snippets:
// member declaration
using pcre_ptr = std::unique_ptr<Bar, void(*)(pcre2_code*)>;
pcre_ptr code;
// constructor:
Regex(std::string_view re) {
...
code = pcre_ptr(pcre2_compile((const unsigned char*)re.data(), re.length(), 0, &errorcode, &erroffset, nullptr), pcre2_code_free);
}
...|
|
||
| void compile() | ||
| { | ||
| assert(pcre2_jit_compile(code, PCRE2_JIT_COMPLETE) == 0); |
There was a problem hiding this comment.
i suggest putting the call to pcre2_jit_compile outside the assert and store its result in a variable. then, assert on the variable's value.
The reason i suggest that is that asserts are meant to be optimized out of production builds, but that would then remove the compile call...
| MatchData(MatchData&&) = delete; | ||
| ~MatchData() | ||
| { | ||
| pcre2_match_data_free(match); |
There was a problem hiding this comment.
std::unique_ptr with a custom destructor as above would be great.
| MatchData(Regex& re) noexcept | ||
| : re(re) | ||
| { | ||
| match = pcre2_match_data_create_from_pattern(re.code, NULL); | ||
| size_ = pcre2_get_ovector_count(match); | ||
| ovector = pcre2_get_ovector_pointer(match); | ||
| }; |
There was a problem hiding this comment.
| MatchData(Regex& re) noexcept | |
| : re(re) | |
| { | |
| match = pcre2_match_data_create_from_pattern(re.code, NULL); | |
| size_ = pcre2_get_ovector_count(match); | |
| ovector = pcre2_get_ovector_pointer(match); | |
| }; | |
| MatchData(Regex& re) noexcept | |
| : re{re} | |
| , match{pcre2_match_data_create_from_pattern(re.code, NULL)} | |
| , size_{pcre2_get_ovector_count(match)} | |
| , ovector {pcre2_get_ovector_pointer(match)} | |
| { | |
| }; |
plus the reordering comment from above
| v.mkAttrs(bindings); | ||
| } else { | ||
| // the first match is the whole string | ||
| const size_t len = match.size() - 1; |
There was a problem hiding this comment.
it feels like asserting that the size is > 0 would increase the safety of this function
| if (!match[i+1].has_value()) | ||
| (v.listElems()[i] = state.allocValue())->mkNull(); | ||
| else | ||
| (v.listElems()[i] = state.allocValue())->mkString(*match[i + 1]); |
There was a problem hiding this comment.
| if (!match[i+1].has_value()) | |
| (v.listElems()[i] = state.allocValue())->mkNull(); | |
| else | |
| (v.listElems()[i] = state.allocValue())->mkString(*match[i + 1]); | |
| auto *val = state.allocValue() | |
| v.listElems()[i] = val; | |
| if (!match[i+1].has_value()) | |
| val->mkNull(); | |
| else | |
| val->mkString(*match[i + 1]); |
|
|
||
| void prim_match(EvalState & state, const PosIdx pos, Value * * args, Value & v) | ||
| { | ||
| auto re = state.forceStringNoCtx(*args[0], pos); |
| non-matching parts interleaved by the lists of the matching groups. */ | ||
| void prim_split(EvalState & state, const PosIdx pos, Value * * args, Value & v) | ||
| { | ||
| auto re = state.forceStringNoCtx(*args[0], pos); |
There was a problem hiding this comment.
same here, asserts would improve debuggability if we ever end up with code that inserts nullptrs
| result.push_back(prefix); | ||
|
|
||
| // Add a list for matched substrings. | ||
| auto elem = state.allocValue(); |
There was a problem hiding this comment.
the code is a bit inconsistent in choosing the type for allocValue calls. It's either Value* or auto and it would be nice to choose one variant and stick to it.
|
Looks good to me. The main objection is that this makes the Nix language specification depend on "whatever PCRE does". That's not a real problem for Nix, but might be a problem for people who want to reimplement it without having a dependency on a C library. |
thufschmitt
left a comment
There was a problem hiding this comment.
I like this overall 👍 left a few inline comments, but nothing too big.
Since this is a potential breaking change, I would like to see it feature-gated for at least a release cycle. Maybe on by default, but with an easy off switch in case it's hurting people. That would require keeping the old code around of course, but that'd just be temporary.
| struct RegexCache; | ||
|
|
||
| std::shared_ptr<RegexCache> makeRegexCache(); | ||
| size_t regexCacheSize(std::shared_ptr<RegexCache> cache); |
There was a problem hiding this comment.
Could this be replaced by a RegexCache::size() method?
|
|
||
| class MatchData; | ||
|
|
||
| class Regex |
There was a problem hiding this comment.
Don't let yourself be blocked on that, but it would be nicer to move the pure "PCRE abstraction layer" part under libutil
|
|
||
| } catch (RegexError & e) { | ||
| state.debugThrowLastTrace(EvalError({ | ||
| .msg = hintfmt("error while evaluating regex '%s': ", re, e.what()), |
There was a problem hiding this comment.
| .msg = hintfmt("error while evaluating regex '%s': ", re, e.what()), | |
| .msg = hintfmt("error while evaluating regex '%s': %s", re, e.what()), |
|
|
||
| } catch (RegexError & e) { | ||
| state.debugThrowLastTrace(EvalError({ | ||
| .msg = hintfmt("error while evaluating regex '%s': ", re, e.what()), |
There was a problem hiding this comment.
| .msg = hintfmt("error while evaluating regex '%s': ", re, e.what()), | |
| .msg = hintfmt("error while evaluating regex '%s': %s", re, e.what()), |
| state.debugThrowLastTrace(EvalError({ | ||
| .msg = hintfmt("error while evaluating regex '%s': ", re, e.what()), | ||
| .errPos = state.positions[pos] | ||
| })); |
There was a problem hiding this comment.
Actually the above doesn't look so great, maybe rather something like
| state.debugThrowLastTrace(EvalError({ | |
| .msg = hintfmt("error while evaluating regex '%s': ", re, e.what()), | |
| .errPos = state.positions[pos] | |
| })); | |
| e.addTrace(state.positions[pos], "while evaluating regex '%s'", re); | |
| state.debugThrowLastTrace(e); |
|
Is there no portable library we could use that implements POSIX ERE? That would avoid compatibility problems between Nix versions, avoid the footguns of backtracking etc., and be something more reasonable to find multiple implementations of for Nix reimpls. |
POSIX ERE is a relatively small and easy-to-implement standard, so that other languages usually also provide or have library of it. |
|
|
Technically it should not, if we don't use the Unicode part. The issue is that currently While |
|
If we can leave out icu support using an override on boost, I'm fine with that. |
|
One option is that we support both regexes during a migration period. |
|
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/2022-12-02-nix-team-meeting-minutes-13/23731/1 |
|
Another option for a portable POSIX ERE implementation might be musl's? It's permissively licensed, and only a few thousand lines total, so it would be reasonable for Nix to just bundle it, IMO. |
|
Discussed in the Nix team meeting on 2022-12-05:
Complete discussion
|
|
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/2022-12-05-nix-team-meeting-minutes-14/23836/1 |
|
boost builds without icu (it's mostly used for Boost.Locale afaict), but it requires an ugly (boost.override { icu = null; }).overrideAttrs (o: {
configureFlags = (lib.take 4 o.configureFlags) ++ [ "--without-icu" ];
}) |
I created #205166 to simplify this. |
|
While I do think fixing the stack behavior is a higher priority than extending the language with PCRE-specific features, adding PCRE may be feasible. I believe PCRE was designed to process all EREs correctly, but I haven't found definitive evidence of this. Finding this would greatly help, so if anyone knows better where to look... To some degree, any change in behavior could be a breaking change. (cue spacebar heating) However, I'd be willing to assume that nobody is validating ERE regexes by trying to use them in builtins.match regex "", in large part because tryEval can't catch a bad regex anyway. This should make the "boolean" whether the extended functionality is present unobservable from within the language. So we have two possible approaches: either "prove" that PCRE processes EREs correctly, or use a good ERE library instead. |
|
@roberth afaik, PCRE is mostly a superset of Posix EREs, but not strictly. $ echo e | grep --extendes-regexp '[[=e=]]'
e
$ echo e | grep --perl-regexp '[[=e=]]'
grep: POSIX collating elements are not supportedNow the question is whether this really matters, and I'm not really convinced it does given how niche it is |
Ooh, that's locale dependent. I didn't manage to exploit that as a potential impurity (but maybe I did it wrong). Maybe it's just not implemented in GNU stdc++, but it may be implemented in a library that does. What helps in this case is that libpcre prints an error, and doesn't continue with a garbage regex. If anyone does rely on regexes with collating elements, they'll be able to detect the problem in time, and work around it by applying regexes to their, presumably foreign input, regexes. If character equivalents are already truly broken in Nix (which they should be, as it's impure), this is hardly a regression. May have to dig through this... |
|
Just my 2 cents on this: |
|
Superseded by #7762 |
This avoids C++'s standard library regexes, which aren't the same across platforms, and have many other issues, like using stack so much that they stack overflow when processing a lot of data. To avoid backwards and forward compatibility issues, regexes are processed using a function converting libstdc++ regexes into Boost regexes, escaping characters that Boost needs to have escaped, and rejecting features that Boost has and libstdc++ doesn't. Related context: - Original failed attempt to use `boost::regex` in CppNix, failed due to boost icu dependency being large (disabling ICU is no longer necessary because linking ICU requires using a different header file, `boost/regex/icu.hpp`): NixOS/nix#3826 - An attempt to use PCRE, rejected due to providing less backwards compatibility with `std::regex` than `boost::regex`: NixOS/nix#7336 - Second attempt to use `boost::regex`, failed due to `}` regex failing to compile (dealt with by writing a wrapper that parses a regular expression and escapes `}` characters): NixOS/nix#7762 Closes #34. Closes #476. Change-Id: Ieb0eb9e270a93e4c7eed412ba4f9f96cb00a5fa4
Fixes #2147, fixes #4758
See also #3826.
This is not technically fully compatible, but I can't find a builtins.match invocation in the wild that doesn't work, since programs could only rely on quite a limited subset anyways.
Performance is similar because no one really used a lot of regexes, but this should theoretically be faster.
Additionally adds support for named captures, so you can do