Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to PCRE2 #11447

Merged
merged 2 commits into from
Jun 1, 2015
Merged

Upgrade to PCRE2 #11447

merged 2 commits into from
Jun 1, 2015

Conversation

malmaud
Copy link
Contributor

@malmaud malmaud commented May 27, 2015

This will hopefully lay the groundwork for supporting named subpatterns (#11362) and general improvements in the regex functionality.

@@ -12,13 +12,13 @@ else
CPP_STDOUT = $(CPP) -E
endif

all: pcre_h.jl errno_h.jl build_h.jl.phony fenv_constants.jl file_constants.jl uv_constants.jl version_git.jl.phony
all: errno_h.jl build_h.jl.phony fenv_constants.jl file_constants.jl uv_constants.jl version_git.jl.phony
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why deleting the pcre_h.jl dependency here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just temporary, since the pcre_h.jl autogenerator was failing on pcre2.h - it would run without error, but output a blank pcre_h.jl. I'm looking into it.

@malmaud malmaud force-pushed the pcre2 branch 2 times, most recently from 8424882 to 504f78b Compare May 28, 2015 12:04
@malmaud
Copy link
Contributor Author

malmaud commented May 28, 2015

This is ready @tkelman, with two caveats:

  • There doesn't seem to be an apt-get package for PCRE2, so I changed the Travis CI configuration to download the source instead. Maybe we should make our own package and host it on @staticfloat's ppa. Not sure how to get the Windows appveyor CI working.
  • The \udddd syntax for specifying a codepoint literal in a regex doesn't seem to exist in PCRE2 anymore. Instead, you use \x{dddd}. That's a potentially breaking change.

@tkelman
Copy link
Contributor

tkelman commented May 28, 2015

Since PCRE is relatively small and all C, we could just build it within this PR until it gets merged and into the nightlies. Try adding pcre on this line:

echo 'override STAGE1_DEPS = libuv' >> Make.user

function __init__()
JIT_STACK_START_SIZE = 32768
JIT_STACK_MAX_SIZE = 1048576
global JIT_STACK = ccall((:pcre_jit_stack_alloc, :libpcre), Ptr{Void},
(Cint, Cint), JIT_STACK_START_SIZE, JIT_STACK_MAX_SIZE)
global JIT_STACK = ccall((:pcre2_jit_stack_create_8, "libpcre2-8"), Ptr{Void},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this libpcre2-8 likely to change in the future? If so, would be best to make a single const variable for it.

@tkelman
Copy link
Contributor

tkelman commented May 28, 2015

The \udddd syntax for specifying a codepoint literal in a regex doesn't seem to exist in PCRE2 anymore. Instead, you use \x{dddd}. That's a potentially breaking change.

That's a bit annoying. Is there some optional configuration flag we could set when we build the library to get that back?

@malmaud
Copy link
Contributor Author

malmaud commented May 28, 2015

That's a bit annoying. Is there some optional configuration flag we could set when we build the library to get that back?

I found a match flag that re-enables it, intuitively named ALT_BSUX.

Since PCRE is relatively small and all C, we could just build it within this PR until it gets merged and into the nightlies. Try adding pcre on this line:

echo 'override STAGE1_DEPS = libuv' >> Make.user

Done, although there is still the issue of pcre_h.jl being incorrect in the nightlies (https://github.com/JuliaLang/julia/blob/master/contrib/windows/msys_build.sh#L96).

@malmaud malmaud changed the title WIP: Upgrade to PCRE2 Upgrade to PCRE2 May 28, 2015
@malmaud
Copy link
Contributor Author

malmaud commented May 28, 2015

The Travis failure seems to be caused by a random parallel-processing error not related to this PR.

@tkelman
Copy link
Contributor

tkelman commented May 29, 2015

Try commenting out these 4 lines - https://github.com/malmaud/julia/blob/afa14048a5551fec9ee79f9472a16fb10e260b56/contrib/windows/msys_build.sh#L94-L97

If you're downloading pcre sources, then we'll have the headers. Using pcre_h.jl from the nightlies is only necessary when trying to use the pcre dll from the nightlies, since that doesn't come with headers.

@tkelman
Copy link
Contributor

tkelman commented May 29, 2015

If we can get this passing on AppVeyor, then is there any concern that this could break any regex-using packages in subtle ways? If it should be more or less compatible from the Julia side and it's just about ready, @StefanKarpinski could you take a look and make a judgment on whether or not this should be 0.4 material?

@malmaud
Copy link
Contributor Author

malmaud commented May 29, 2015

is there any concern that this could break any regex-using packages in subtle ways?

The API on the Julia side is totally unchanged, and at least officially the regex syntax and semantics of PCRE2 are not different than PCRE1. So the risk of breakage seems minimal.

@malmaud
Copy link
Contributor Author

malmaud commented May 29, 2015

@tkelman
Copy link
Contributor

tkelman commented May 29, 2015

Looks like that did the trick on appveyor, and this should be good to go now. No clue what's wrong with Travis though, that failure is pretty bizarre.

exception on 1: ERROR: KeyError: 2 not found
 in getindex at dict.jl:668
 in anonymous at multi.jl:1597
 in _mapreduce at reduce.jl:139
 in check_same_host at multi.jl:1597
 in anonymous at multi.jl:838
 in run_work_thunk at multi.jl:589
 in anonymous at task.jl:838
exception on 8: ERROR: LoadError: MethodError: `!` has no method matching !(::KeyError)
 in SharedArray at sharedarray.jl:41
 in include at ./boot.jl:252
 in runtests at /tmp/julia/share/julia/test/testdefs.jl:197
 in anonymous at multi.jl:838
 in run_work_thunk at multi.jl:589
 in anonymous at task.jl:838
while loading /tmp/julia/share/julia/test/parallel.jl, in expression starting on line 24
ERROR: LoadError: LoadError: MethodError: `!` has no method matching !(::KeyError)
 in anonymous at task.jl:1394
while loading /tmp/julia/share/julia/test/parallel.jl, in expression starting on line 24
while loading /tmp/julia/share/julia/test/runtests.jl, in expression starting on line 5
    From worker 8:       * parallel            

Takes about 90 seconds or so to download and build the library on appveyor, so once this gets merged and into the nightlies I'll put the appveyor script back the way it was to cut down on build time. If you need to update the library down the line in some way that needs a from-scratch version of pcre_h.jl you should know what to change to get that done, yeah?

I'll let Stefan make the final call, but thumbs up from me. Thanks for this!

@malmaud
Copy link
Contributor Author

malmaud commented May 29, 2015

Thanks for working through it with me! I'm looking forward to the improved regex functionality this should enable down the line.

@tkelman
Copy link
Contributor

tkelman commented May 29, 2015

Oh yeah @nalimilan any concerns about getting pcre2 packaged on any older rhel/fedora distros?

@tkelman
Copy link
Contributor

tkelman commented Jun 1, 2015

Bump @StefanKarpinski could you take a look at this? The Travis error is unrelated.

@ScottPJones
Copy link
Contributor

👍

@StefanKarpinski
Copy link
Member

Just to be paranoid, I'm restarting that failed job (although I do believe that it's unrelated). After that, we can merge this. Thanks for putting in the elbow grease to make the upgrade happen, @malmaud!

@StefanKarpinski
Copy link
Member

Ok, tests pass but I'm seeing this on startup:

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+5051 (2015-05-29 12:34 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 5b9b59f* (3 days old master)
|__/                   |  x86_64-apple-darwin14.1.0

ERROR: ArgumentError: embedded NUL chars are not allowed in C strings
 in unsafe_convert at /Users/stefan/projects/julia/usr/lib/julia/sys.dylib
 in exec at ./pcre.jl:126
 in match at ./regex.jl:130
 in hist_from_file at ./REPL.jl:332
 in __setup_interface#129__ at ./REPL.jl:719
 in run_frontend at ./REPL.jl:837
 in run_repl at ./REPL.jl:166
 in _start at ./client.jl:454

INFO: Disabling history file for this session.
julia>

Investigating...

function exec(re,subject,offset,options,match_data)
rc = ccall((:pcre2_match_8, PCRE_LIB), Cint,
(Ptr{Void}, Cstring, Csize_t, Csize_t, Cuint, Ptr{Void}, Ptr{Void}),
re, subject, sizeof(subject), offset, options, match_data, MATCH_CONTEXT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is subtle but I think that using Cstring here may be inappropriate since we're a) passing the length of the data as another argument, rather than relying on NUL-termination of the data and b) apparently sometimes passing data that contains NUL bytes? cc: @stevengj

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @StefanKarpinski , it should not use Cstring, if the interface is with the length.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be revealing a more troubling problem, which is that junk strings seem to be getting passed in here – such as "\t:(Int^M:N)\0|>dump\n" – note the embedded NUL byte. The real issue, of course, is what the heck is this data and why is it being passed into the exec function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is not as nefarious as I thought it was – subject is just the string we're looking for matches in. But Cstring is inappropriate here and possibly in a few other places in this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You guys are totally right. And when you're right you're right.

On Monday, June 1, 2015, Stefan Karpinski [email protected] wrote:

In base/pcre.jl
#11447 (comment):

end

-function exec(regex::Ptr{Void}, extra::Ptr{Void}, str::ByteString, offset::Integer,

  •          options::Integer, ovec::Vector{Int32})
    
  • return exec(regex, extra, str, 0, offset, sizeof(str), options, ovec)
    +function exec(re,subject,offset,options,match_data)
  • rc = ccall((:pcre2_match_8, PCRE_LIB), Cint,
  •           (Ptr{Void}, Cstring, Csize_t, Csize_t, Cuint, Ptr{Void}, Ptr{Void}),
    
  •           re, subject, sizeof(subject), offset, options, match_data, MATCH_CONTEXT)
    

Ok, this is not as nefarious as I thought it was – subject is just the
string we're looking for matches in. But Cstring is inappropriate here
and possibly in a few other places in this change.


Reply to this email directly or view it on GitHub
https://github.com/JuliaLang/julia/pull/11447/files#r31456497.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on a patch that fixes this and a variety of other ccall signature problems. Some of which are probably my fault since they're inherited from the original PCRE code. We didn't have all the necessary Cfoo type aliases back then (or any of them, actually), so I just used the types that happened to be correct on my platform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the ccall problems and merging this, Stefan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. Thanks again for pushing it forward.

@nalimilan
Copy link
Member

Oh yeah @nalimilan any concerns about getting pcre2 packaged on any older rhel/fedora distros?

I haven't checked yet, but I guess it is parallel-installable with PCRE1, so shouldn't be an issue.

@StefanKarpinski StefanKarpinski merged commit 5b9b59f into JuliaLang:master Jun 1, 2015
StefanKarpinski added a commit that referenced this pull request Jun 1, 2015
mbauman pushed a commit to mbauman/julia that referenced this pull request Jun 6, 2015
tkelman pushed a commit to tkelman/julia that referenced this pull request Jun 6, 2015
garrison added a commit to garrison/TOML.jl that referenced this pull request Jun 14, 2015
The relevant change to Regex was made in JuliaLang/julia#11447
@stevengj stevengj mentioned this pull request Aug 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants