-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate to pcre2 #515
Migrate to pcre2 #515
Conversation
I am surprised this was merged on the deprecated nekovm before here: |
This one isn't green. :P |
@Simn Check again ;) |
I'm a bit irritated that there are so many new |
It's just because we have an entire copy of the pcre library within |
And yes, the files were reorganised with pcre2 as far as I can tell. This the documentation file that describes how to build pcre2 on Windows which I followed: |
@ncannasse Please check if this works for you and merge accordingly. We should make this change, but I'd like to make sure it doesn't break anything for you. |
By the way, for large dependencies included in the repo you might want to mark them as "vendored" in the attributes: |
Ah, I wasn't aware of that. We should probably just mark the entire |
#541 should mark these directories as third party libs as you suggested. |
Few comments: I'm quite worried with the following: // this is reinitialised for each new regex object...
match_context = pcre2_match_context_create_16(NULL);
pcre2_set_depth_limit_16(match_context, 3500); Seems like a memory leak because the context is never free |
@tobil4sk Why are you adding the I also agree with @ncannasse, if you are calling Incidentally, since it seems you are always setting the match recursion depth limit to a constant, you could instead set the compile-time option It seems like the same memory leak was already introduced into nekovm with HaxeFoundation/neko#249 (where oddly you decided to not use the |
@ncannasse This is required for the entire lifetime of the program, which is why it could not be freed anywhere. The memory leak existed before as well, however, I guess now is a good time to deal with it. With @Uzume's tip (thanks btw!) about the compile time option, we can avoid this issue altogether.
@Uzume I left the 16 suffixes because they were used before (whereas the neko files didn't have them). I can remove them but I assumed there was some reason to add them if they're optional. |
Although, the only issue with this is that now the variable must be set externally (either in config.h or every build system) rather then conveniently in the regexp file. Setting it manually in make/cmake/vs builds is a bit messy, but at the same time having it in EDIT: |
@tobil4sk No, I do not think there was a memory leak before. Before we had: // ...
static pcre16_extra limit;
// ...
// hl_regexp_new_options
// ...
limit.flags = PCRE_EXTRA_MATCH_LIMIT_RECURSION;
limit.match_limit_recursion = 3500; // adapted based on Windows 1MB stack size
// ...
// hl_regexp_match
// ...
int res = pcre16_exec(e->p,&limit,(PCRE_SPTR16)s,pos+len,pos,PCRE_NO_UTF16_CHECK,e->matches,e->nmatches * 3);
// ...
Now we have: // ...
static pcre2_match_context_16 *match_context;
// ...
// hl_regexp_new_options
// ...
// this is reinitialised for each new regex object...
match_context = pcre2_match_context_create_16(NULL);
pcre2_set_depth_limit_16(match_context, 3500); // adapted based on Windows 1MB stack size
// ...
// hl_regexp_match
// ...
int res = pcre2_match_16(e->regex,(PCRE2_SPTR16)s,pos+len,pos,PCRE2_NO_UTF_CHECK,e->match_data,match_context);
// ...
The repeated calls of nekovm uses 8-bit code units (i.e., ASCII, UTF-8, etc.) but hashlink uses 16-bit code units (i.e., UTF-16). In the old In the new For example, in the new I recommend you drop the |
@Uzume Thanks for the insight!
Yes, I see the difference now (and where the memory leak would be). The only reason it is repeatedly allocated is because, as far as I could tell, there was no way of allocating something when a module is first loaded, the same way that can be done with neko.
I think this is probably the best way, because aside from that there is no other reason for having a context object. Also, the context should really be global, as there is currently no reason to have a different context per regex object. Just about whether to put it in config.h or to put it in make and the other build systems.
That explanation makes sense, so I do think now it would make sense to get rid of the suffix ;) |
Alternatively, I guess we could check whether the context object is null before allocating it, but I guess that might have some thread safety concerns attached to it. |
There is a lot of indirection in the code but you can see the difference in how // pcre2.h
// ...
#define PCRE2_TYPES_LIST \
//...
struct pcre2_real_match_context; \
typedef struct pcre2_real_match_context pcre2_match_context; \
// ...
#define PCRE2_JOIN(a,b) a ## b
#define PCRE2_GLUE(a,b) PCRE2_JOIN(a,b)
#define PCRE2_SUFFIX(a) PCRE2_GLUE(a,PCRE2_LOCAL_WIDTH)
// ...
#define pcre2_match_context PCRE2_SUFFIX(pcre2_match_context_)
// ...
#define PCRE2_TYPES_STRUCTURES_AND_FUNCTIONS \
PCRE2_TYPES_LIST \
// ...
#define PCRE2_LOCAL_WIDTH 8
PCRE2_TYPES_STRUCTURES_AND_FUNCTIONS
#undef PCRE2_LOCAL_WIDTH
#define PCRE2_LOCAL_WIDTH 16
PCRE2_TYPES_STRUCTURES_AND_FUNCTIONS
#undef PCRE2_LOCAL_WIDTH
#define PCRE2_LOCAL_WIDTH 32
PCRE2_TYPES_STRUCTURES_AND_FUNCTIONS
#undef PCRE2_LOCAL_WIDTH
// ...
#undef PCRE2_SUFFIX
// ...
#if PCRE2_CODE_UNIT_WIDTH == 8 || \
PCRE2_CODE_UNIT_WIDTH == 16 || \
PCRE2_CODE_UNIT_WIDTH == 32
#define PCRE2_SUFFIX(a) PCRE2_GLUE(a, PCRE2_CODE_UNIT_WIDTH)
#elif PCRE2_CODE_UNIT_WIDTH == 0
#undef PCRE2_JOIN
#undef PCRE2_GLUE
#define PCRE2_SUFFIX(a) a
// ...
#endif
// ...
// pcre2_intmodedep.h
// ...
typedef struct pcre2_real_match_context {
pcre2_memctl memctl;
#ifdef SUPPORT_JIT
pcre2_jit_callback jit_callback;
void *jit_callback_data;
#endif
int (*callout)(pcre2_callout_block *, void *);
void *callout_data;
int (*substitute_callout)(pcre2_substitute_callout_block *, void *);
void *substitute_callout_data;
PCRE2_SIZE offset_limit;
uint32_t heap_limit;
uint32_t match_limit;
uint32_t depth_limit;
} pcre2_real_match_context;
// ...
// pcre2_context.c
// ...
const pcre2_match_context PRIV(default_match_context) = {
{ default_malloc, default_free, NULL },
#ifdef SUPPORT_JIT
NULL, /* JIT callback */
NULL, /* JIT callback data */
#endif
NULL, /* Callout function */
NULL, /* Callout data */
NULL, /* Substitute callout function */
NULL, /* Substitute callout data */
PCRE2_UNSET, /* Offset limit */
HEAP_LIMIT,
MATCH_LIMIT,
MATCH_LIMIT_DEPTH };
// ...
PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION
pcre2_set_depth_limit(pcre2_match_context *mcontext, uint32_t limit)
{
mcontext->depth_limit = limit;
return 0;
}
// ... The last field of // pcre2_match.c
// ...
PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION
pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, PCRE2_SIZE length,
PCRE2_SIZE start_offset, uint32_t options, pcre2_match_data *match_data,
pcre2_match_context *mcontext)
// ...
#ifdef SUPPORT_JIT
// ...
rc = pcre2_jit_match(code, subject, length, start_offset, options,
match_data, mcontext);
// ...
#endif /* SUPPORT_JIT */
// ...
if (mcontext == NULL)
{
mcontext = (pcre2_match_context *)(&PRIV(default_match_context));
mb->memctl = re->memctl;
}
// ...
mb->match_limit_depth = (mcontext->depth_limit < re->limit_depth)?
mcontext->depth_limit : re->limit_depth;
// ...
I shall leave it to you to figure out that in the case of |
@tobil4sk You might also be able to do something along the lines of: static pcre2_match_context *match_context = pcre2_match_context_create(NULL); In that case you would still have a memory leak but of just a single solitary Sadly you cannot do something like this: static pcre2_match_context match_context; Because of the way pcre2 has defined |
The only issue is that if Also, currently, there is a comment explaining why this value for
Unfortunately, it doesn't allow this because it is not a constant value (would have been a good solution though). |
It is a constant value (after initial allocation the pointer would never change during the entire execution) but you are right it is not a so called compile-time constant and static initialization would require such. I had a momentary lapse (sorry about that). It has been a while since I had to be concerned with lifetimes in C. |
Upon thinking about this more, I am also concerned with some other issues (that were already in the original code). For example HL_PRIM bool hl_regexp_match( ereg *e, vbyte *s, int pos, int len ) {
static pcre2_match_context *match_context = NULL;
int res;
if (match_context == NULL) {
match_context = pcre2_match_context_create(NULL);
pcre2_set_depth_limit(match_context, 3500); // adapted based on Windows 1MB stack size;
}
res = pcre2_match(e->regex,(PCRE2_SPTR16)s,pos+len,pos,PCRE2_NO_UTF_CHECK,e->match_data,match_context);
e->matched = res >= 0;
if( res >= 0 )
return true;
if( res != PCRE2_ERROR_NOMATCH )
hl_error("An error occurred while running pcre2_match()");
return false;
} It is not a limit of the regular expression but rather a limit of the matching so unless we decide to put this into |
@Uzume I think if we can handle it at compile time then that is the preferable solution. For now I have placed it in each build file as well as config.h just to be safe, in case pcre2 is updated in future and the config.h is replaced, which would cause the setting to be lost. Is this fine? @ncannasse |
If we ever decide to link pcre2 dynamically this will have to be handled at run time again, maybe if that happens using a local static might be a fair solution. |
@tobil4sk I am not sure carrying that match depth limit is really still necessary. I can see how you are just porting that from the PCRE1 integration code in hashlink but the internal architecture has of PCRE has changed significantly so I am not sure it is really even interesting much less necessary to consider keeping that in the first place. For example, in
Based on just the above quote, I'd say newer default implementations (not dfa or jit) have traded some stack pressure for heap pressure. Also in NON-AUTOTOOLS-BUILD in the section labelled
Previous in the same file and section for PCRE1 it stated:
You can see similar text in earlier PCRE2 (the latter two about
The code you are trying to port seems to refer to the same Windows 1Mb stack size: limit.flags = PCRE_EXTRA_MATCH_LIMIT_RECURSION;
limit.match_limit_recursion = 3500; // adapted based on Windows 1MB stack size And I am sure there were other drastic changes during the changed from PCRE1 to PCRE2, so the question is do we need it and do we bother? I am not sure I am in a position to answer that but methinks it definitely makes sense to ask such. |
In following PCRE2 seems to provide four build possibilities with regard to
If I am doing a "manual" or custom build (where I am not using the build scripts from the PCRE sources), methinks I would prefer option 4 over option 3 as that route allows one to integrate things with one less extraneous file.
Option 4 and setting things externally in the build system(s) also alleviates the future risk you mention above, while also allowing one to integrate the code with one less file.
Removing PCRE1 also optionally used a |
There are multiple other settings for specifying static linking and disabling jit support that are defined in this config file and it wouldn't make sense to define the long list of flags everywhere. These settings are pretty obvious as they are mentioned by name in the guide. The only reason there is a risk of missing out the match limit depth setting is because you need to know that it was configured before. I guess we could have a file somewhere explaining how to update the library so that the settings are not lost. |
The stuff about the stack/heap changes sounds promising. If we can get rid of this setting then all the problems are solved. I doubt there were hashlink tests to make sure whatever cases this setting was for were not crashing, but hopefully we can just safely remove it altogether judging by the fact that the stack is no longer used here? |
@tobil4sk The stack is used but not to the same extent so it is much less likely to be causing an issue (which the code you are trying to port seems to be trying to avoid; likely because there was an issue in the past). For that reason, I believe it should not cause issues to now remove it but I have no way to actually test this theory. Maybe @ncannasse knows. He was the one that added that setting in the original PCRE1 integration on 2016-02-21, see df1828b (search for There is another potential option too. Since this setting seems only applicable to Windows, we could only set it on Windows builds. It might be possible to do this externally from the build scripts but if we opted to keep the #if defined(_WIN32) && !defined(MATCH_LIMIT_DEPTH)
#define MATCH_LIMIT_DEPTH 3500 // adapted based on Windows 1MB stack size
#endif Then at least it would not be applied across all builds when obviously it is a tweak specific to Windows. But frankly I do not think this does the same thing as it used to do. In digging through PCRE2 history I can see changes in /* The above limit applies to all backtracks, whether or not they are nested.
In some environments it is desirable to limit the nesting of backtracking
(that is, the depth of tree that is searched) more strictly, in order to
restrict the maximum amount of heap memory that is used. The value of
MATCH_LIMIT_DEPTH provides this facility. To have any useful effect, it
must be less than the value of MATCH_LIMIT. The default is to use the same
value as MATCH_LIMIT. There is a runtime method for setting a different
limit. In the case of pcre2_dfa_match(), this limit controls the depth of
the internal nested function calls that are used for pattern recursions,
lookarounds, and atomic groups. */
#ifndef MATCH_LIMIT_DEPTH
#define MATCH_LIMIT_DEPTH MATCH_LIMIT
#endif Notice: "limit the nesting of backtracking[...] in order to restrict the maximum amount of heap memory that is used." And earlier: /* The above limit applies to all backtracks, whether or not they are nested.
In some environments it is desirable to limit the nesting of backtracking
more strictly, in order to restrict the maximum amount of heap memory that
is used. The value of MATCH_LIMIT_RECURSION provides this facility. To have
any useful effect, it must be less than the value of MATCH_LIMIT. The
default is to use the same value as MATCH_LIMIT. There is a runtime method
for setting a different limit. */
#ifndef MATCH_LIMIT_RECURSION
#define MATCH_LIMIT_RECURSION MATCH_LIMIT
#endif Apparently this changed names from And earlier still: /* The above limit applies to all calls of match(), whether or not they
increase the recursion depth. In some environments it is desirable to limit
the depth of recursive calls of match() more strictly, in order to restrict
the maximum amount of stack (or heap, if HEAP_MATCH_RECURSE is defined)
that is used. The value of MATCH_LIMIT_RECURSION applies only to recursive
calls of match(). To have any useful effect, it must be less than the value
of MATCH_LIMIT. The default is to use the same value as MATCH_LIMIT. There
is a runtime method for setting a different limit. */
#ifndef MATCH_LIMIT_RECURSION
#define MATCH_LIMIT_RECURSION MATCH_LIMIT
#endif Notice: "limit the depth of recursive calls[...] in order to restrict the maximum amount of stack[...] that is used." So it used to limit stack usage during recursive backtracking but now limits heap usage during backtracking (which no longer uses recursion). I am assuming it was possible to optimize the algorithm to use tail-recursion (if it wasn't already using such) which was then turned in to a loop. Thus less stack allocations and more heap allocations (which doesn't threaten the 1Mb stack default on Windows).
I agree. I doubt there are tests for this and as such, I recommend just ripping this out and see if someone screams. It can always be put back in if necessary. This should likely also be done for nekovm too as the |
@ncannasse We have resolved this issue. This setting is obsolete, so we just safely removed the offending lines. Everything is building properly now so it should be safe to merge now, unless there are any other questions/worries? |
Actually I noticed the
https://www.pcre.org/current/doc/html/pcre2compat.html EDIT: Done now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tobil4sk It looks good. I hope this gets committed soon. Thanks for all your hard work.
According to my research PCRE_JAVASCRIPT_COMPAT
was split into PCRE2_ALT_BSUX
, PCRE2_ALLOW_EMPTY_CLASS
and PCRE2_MATCH_UNSET_BACKREF
. See 2015-01-05: [pcre-dev] PCRE2 is released:
[...] The PCRE_JAVASCRIPT_COMPAT option has been split into
independent functional options PCRE2_ALT_BSUX, PCRE2_ALLOW_EMPTY_CLASS, and
PCRE2_MATCH_UNSET_BACKREF.
Perhaps those should be added too. You might also want to consider switching from PCRE2_ALT_BSUX
to PCRE2_EXTRA_ALT_BSUX
(which implies the former) adding ECMAscript 6 style \u{hhh..}
hexadecimal character codes.
Thanks as well for all your help!
Thanks for the shout. I missed those patch notes because the compat page seemed to suggest ALT_BSUX was all there was to it. Just for reference, here is a page explaining The problem with |
Also, reading further through those notes you sent, I wonder if it would be worth implementing a wrapper for
This would probably be faster than the current |
@tobil4sk: That is an interesting read, however, based on that, I wonder about the value of implementing either of
I was also thinking along similar lines to improve the Haxe EReg implementations after getting all backends ported from |
This was done following the steps on in the `NON-AUTOTOOLS-BUILD` file found in the root of the pcre2 repository: https://github.com/PhilipHazel/pcre2/blob/pcre2-10.40/NON-AUTOTOOLS-BUILD
CMake now has to specify each pcre file individually, as some are not meant to be compiled by themselves.
These were necessary in pcre1 but are handled by PCRE2_CODE_UNIT_WIDTH in pcre2
Since pcre 10.30, pcre2_match no longer uses recursive function calls, so this setting should no longer be needed. This avoids the memory leak caused by having to configure this setting using a context structure allocated at run time.
This is the pcre2 equivalent of the PCRE_JAVASCRIPT_COMPAT flag we used previously. Compare bullet point 4 of the pcre2 compatibility docs: https://www.pcre.org/current/doc/html/pcre2compat.html to bullet point 5 of the pcre1 compatibility docs: https://www.pcre.org/original/doc/html/pcrecompat.html
I've done some investigating regarding how other targets handle these escape characters: Details regarding Haxe target support for \x and \uFirstly, It is still possible to test it via A lot of targets (C++, Interp, Lua, Neko, PHP) give an error even for All targets support TLDR: There is not much consistency, but currently Hashlink is consistent with JavaScript and Flash.
For now I think the best thing to do here is to keep Hashlink's behaviour the same, which is achieved in pcre2 by enabling |
@ncannasse If that sounds alright, then I don't think there is anything else left to do here. We managed to get rid of the memory leak issue. |
`PCRE2_ALT_BSUX`, `PCRE2_ALLOW_EMPTY_CLASS`, and `PCRE2_MATCH_UNSET_BACKREF` are equivalent to the old `PCRE_JAVASCRIPT_COMPAT`. See: https://lists.exim.org/lurker/message/20150105.162835.0666407a.en.html
Your investigation and detailed report are much appreciated. I agree getting the targets at least close to aligned while jettisoning older pcre1 is the first priority but in the larger picture it might make sense to consider changes here down the road (along with other changes to EReg like the I see you also added the aforementioned It seems odd to have so many compatibility issues with something called "PCRE" which by its very name implies compatibility with Perl regex but such is the nature the beast. @tobil4sk: Thank you. |
I would really like to merge this and get this whole PCRE2 business out of the way. Is there any reason not to do so yet? |
All good to me, thanks for everyone who participated in the issue/PR, merge complete! |
Thanks! This just leaves hxcpp now... 😅 |
@ncannasse Thanks for getting this merged. @tobil4sk Thanks for your hard work and patches. Maybe PCRE1 can soon be laid to rest from a Haxe perspective and we can look into cleaning up Haxe EReg (with some possible optimizations like wrapping and using |
PCRE 1 is no longer maintained and doesn't get updates, so this PR ports the regexp module of the standard library to use PCRE 2 instead. All build systems have been updated to link PCRE 2.
See also:
HaxeFoundation/haxe#10491