Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCRE compilation error for patterns ending with an escaped backslash #28175

Closed
wdebeaum opened this issue Jul 18, 2018 · 15 comments
Closed

PCRE compilation error for patterns ending with an escaped backslash #28175

wdebeaum opened this issue Jul 18, 2018 · 15 comments
Labels
strings "Strings!"

Comments

@wdebeaum
Copy link

If a Regex literal ends with a (singly) escaped backslash, you'll get this error:

julia> r"\\"
ERROR: LoadError: PCRE compilation error: \ at end of pattern at offset 1
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compile(::String, ::UInt32) at ./pcre.jl:103
 [3] compile(::Regex) at ./regex.jl:59
 [4] Regex(::String, ::UInt32, ::UInt32) at ./regex.jl:30
 [5] Regex(::String) at ./regex.jl:55
 [6] @r_str(::LineNumberNode, ::Module, ::Any, ::Vararg{Any,N} where N) at ./regex.jl:91
in expression starting at REPL[0]:1

julia> versioninfo()
Julia Version 0.7.0-beta2.unknown
Commit 5369849 (2018-07-18 12:53 UTC)
DEBUG build
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM)2 Duo CPU     E6850  @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, core2)

It's possible to get around this by doubly escaping the backslash:

julia> occursin(r"\\\\", "\\")
true

But that's surprising to me for two reasons. One is that it's only for backslashes at the end of the pattern:

julia> occursin(r"\\\\=", "\\=")
false

julia> occursin(r"\\=", "\\=")
true

The other is that even at the end of the pattern this double escaping was unnecessary in 0.6:

julia> versioninfo()
Julia Version 0.6.3
Commit 239dcf7* (2018-07-09 17:16 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM)2 Duo CPU     E6850  @ 3.00GHz
  WORD_SIZE: 64
  BLAS: libblas
  LAPACK: liblapack
  LIBM: libopenlibm
  LLVM: libLLVM-4.0.1 (ORCJIT, core2)

julia> ismatch(r"\\", "\\")
true
@wdebeaum
Copy link
Author

This also seems to be a problem with substitution strings, e.g. in 0.6:

julia> replace("foo", r"(o+)", s"/\1\\")
"f/oo\\"

while in 0.7:

julia> replace("foo", r"(o+)" => s"/\1\\")
ERROR: Bad replacement string: /\1\
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] replace_err(::String) at ./regex.jl:283
 [3] _replace(::Base.GenericIOBuffer{Array{UInt8,1}}, ::SubstitutionString{String}, ::String, ::UnitRange{Int64}, ::Regex) at ./regex.jl:305
 [4] #replace#333(::Int64, ::Function, ::String, ::Pair{Regex,SubstitutionString{String}}) at ./strings/util.jl:459
 [5] replace(::String, ::Pair{Regex,SubstitutionString{String}}) at ./strings/util.jl:447
 [6] top-level scope at none:0

but:

julia> replace("foo", r"(o+)" => s"/\1\\\\")
"f/oo\\"

In general, I'd say that any @*_str macros that support backslash-escaped backslashes (i.e. \\ turns into \) ought to either refrain from applying that rule to trailing backslashes (since the parser already did it in order to find the close quote properly), or re-double the trailing backslashes before applying escaping rules.

@StefanKarpinski StefanKarpinski added the bug Indicates an unexpected problem or unintended behavior label Jul 27, 2018
@StefanKarpinski StefanKarpinski added this to the 1.1 milestone Jul 27, 2018
@vtjnash vtjnash added strings "Strings!" and removed bug Indicates an unexpected problem or unintended behavior labels Oct 10, 2018
@StefanKarpinski StefanKarpinski modified the milestones: 1.1, 1.2 Dec 6, 2018
@KristofferC KristofferC modified the milestones: 1.2, 1.x Apr 17, 2019
@tlienart
Copy link
Contributor

I got a similar problem while testing a package on a windows VM as well, the error message was

ERROR: PCRE compilation error: unknown property name after \P or \p at offset 33

The offending string was something like "c:\\blah\\pages" with the \\p causing an issue apparently.
This happened in my case when doing a walkdir in a windows environment and trying to match part of the file path. Adding an escape_string solves the issue.

@StefanKarpinski
Copy link
Member

Was this being passed to Regex? Or in an r"" string?

@tlienart
Copy link
Contributor

a Regex(...). Here's a contrived example which would lead to problems on windows but not on other systems, I'm definitely not saying it's the right way to do stuff. The aim of the function is to merge two folders.

function mergefolders(src, dst)
    for (root, _, files)  walkdir(src)
        for file  files
            newpath = replace(root, Regex("^$(escape_string(src))")=>"$dst")
            isdir(newpath) || mkpath(newpath)
            newpathfile = joinpath(newpath, file)
            cp(joinpath(root, file), newpathfile; force=true)
        end
    end
end

with the escape_string, things work fine, without it will work fine on linux/mac but not on windows as you may get things like \\p which would cause an error like the one mentioned above.

I wouldn't say it's a bug in this case but it definitely surprised me and the error message made little sense to me until I found this error and realised that there may be \\ in windows paths. Maybe one thing would be to escape_string elements when doing a walkdir on a windows machine?

@wdebeaum
Copy link
Author

I think @tlienart's problem really has very little to do with this issue beyond also involving regex compilation errors and escaping.

I would say that any time you're interpolating a string into a regex with the intent that it should match exactly and not be interpreted as a regex itself, and you're not absolutely sure that that string contains no special regex characters like \ or ., you should use escape_string on it. Even if you're doing a walkdir on non-Windows, filenames can still contain special characters, especially ..

@tlienart
Copy link
Contributor

That's fair enough (and I mentioned that I did not believe what I encountered was a bug), the comment was more pointing out that the error message was (to me) rather cryptic.

@zlatanvasovic
Copy link
Contributor

All you have to do to fix this is to replace \\ with [\\]. If you want one or more \ matches, use [\\]+. I experienced a similar error today and it solved the problem.

PCRE compiles regexes this way, so it's not an issue with Julia itself.

@wdebeaum
Copy link
Author

wdebeaum commented Jun 5, 2020

@zdroid just because there is a workaround for the issue doesn't mean there's no issue (you'll note I included a different workaround in the original description of this issue).

And it's not an issue with PCRE. Perl itself doesn't complain about the equivalent Perl code qr"\\". It even matches properly:

#!/usr/bin/perl
("\\" =~ qr"\\") && print "match\n";

The equivalent Julia program doesn't compile:

occursin(r"\\", "\\") && print("match\n")

@zlatanvasovic
Copy link
Contributor

I didn't say there is no issue, just that there is a simple fix.

About PCRE, well, look at the source code yourself: https://github.com/JuliaLang/julia/blob/master/base/pcre.jl#L124. It's the line that produces the error. Either the file was substantially changed for Julia 0.7 (which https://github.com/JuliaLang/julia/commits/master/base/pcre.jl doesn't agree with, as far as I can tell), or PCRE was updated to a newer version.

@wdebeaum
Copy link
Author

wdebeaum commented Jun 5, 2020

I don't think what changed in Julia 0.7 to cause this was anything to do with the actual call into the PCRE library that you point out. Rather it was a change in how custom string literals (which include regexes, substitutions, and others) are processed in general. The issue that change was supposed to fix is #22926, the change is f356869, and another (still open) issue that change seems to have caused besides this one is #28261. It might be that a fix for the present issue involves changing the definition of @r_str here: https://github.com/JuliaLang/julia/blob/master/base/regex.jl#L109 (as well as @s_str and any other custom string macro implementing backslash escapes).

@omus
Copy link
Member

omus commented Dec 18, 2020

Another example:

julia> Regex("\\\\")
r"\\"

julia> r"\\"
ERROR: LoadError: PCRE compilation error: \ at end of pattern at offset 1
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compile(::String, ::UInt32) at ./pcre.jl:128
 [3] compile(::Regex) at ./regex.jl:72
 [4] Regex(::String, ::UInt32, ::UInt32) at ./regex.jl:37
 [5] Regex(::String) at ./regex.jl:60
 [6] @r_str(::LineNumberNode, ::Module, ::Any) at ./regex.jl:109
in expression starting at REPL[5]:1

Julia 1.5.3

@vtjnash vtjnash closed this as completed Feb 3, 2021
@vtjnash
Copy link
Member

vtjnash commented Feb 3, 2021

The various parsing and printing issues mentioned here have been fixed

@wdebeaum
Copy link
Author

wdebeaum commented Apr 6, 2021

Have they, though? I just got around to testing this out, but the original example still didn't work how I expected it to in the latest release (1.6.0). So I got the nightly build and tested it too, and got the same results:

julia> r"\\"
ERROR: LoadError: PCRE compilation error: \ at end of pattern at offset 1
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] compile(pattern::String, options::UInt32)
   @ Base.PCRE ./pcre.jl:155
 [3] compile(regex::Regex)
   @ Base ./regex.jl:82
 [4] Regex(pattern::String, compile_options::UInt32, match_options::UInt32)
   @ Base ./regex.jl:47
 [5] Regex(pattern::String)
   @ Base ./regex.jl:70
 [6] var"@r_str"(__source__::LineNumberNode, __module__::Module, pattern::Any, flags::Vararg{Any})
   @ Base ./regex.jl:119
in expression starting at REPL[1]:1

julia> versioninfo()
Julia Version 1.7.0-DEV.847
Commit fedefe913a* (2021-04-06 03:03 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5 CPU         760  @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, nehalem)

and some of the later examples:

julia> replace("foo", r"(o+)" => s"/\1\\")
ERROR: Bad replacement string: /\1\
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] replace_err(repl::String)
   @ Base ./regex.jl:527
 [3] _replace(io::IOBuffer, repl_s::SubstitutionString{String}, str::String, r::UnitRange{Int64}, re::Base.RegexAndMatchData)
   @ Base ./regex.jl:552
 [4] replace(str::String, pat_repl::Pair{Regex, SubstitutionString{String}}; count::Int64)
   @ Base ./strings/util.jl:542
 [5] replace(str::String, pat_repl::Pair{Regex, SubstitutionString{String}})
   @ Base ./strings/util.jl:525
 [6] top-level scope
   @ REPL[3]:1

julia> occursin(r"\\", "\\") && print("match\n")
ERROR: LoadError: PCRE compilation error: \ at end of pattern at offset 1
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] compile(pattern::String, options::UInt32)
   @ Base.PCRE ./pcre.jl:155
 [3] compile(regex::Regex)
   @ Base ./regex.jl:82
 [4] Regex(pattern::String, compile_options::UInt32, match_options::UInt32)
   @ Base ./regex.jl:47
 [5] Regex(pattern::String)
   @ Base ./regex.jl:70
 [6] var"@r_str"(__source__::LineNumberNode, __module__::Module, pattern::Any, flags::Vararg{Any})
   @ Base ./regex.jl:119
in expression starting at REPL[4]:1

julia> Regex("\\\\")
r"\\\\"

This last one seems to be the only one that really changed, and I'm not sure that's even the right change. It does at least mean that the printed representation of a Regex can be parsed back as the same Regex. But I would prefer to see Regex("\\\\") print as r"\\", and r"\\" parse as Regex("\\\\"). That would more closely resemble how escaping within regex literals works in Perl and other languages that have regex literals:

Perl:

("\\" =~ qr"\\") && print "match\n";
# or more idiomatically:
("\\" =~ /\\/) && print "match\n";

Ruby:

("\\" =~ %r"\\") and puts "match"
("\\" =~ /\\/) and puts "match"

JavaScript:

/\\/.test("\\") && console.log("match")

All of the above lines print "match", and don't cause compilation errors.

Also, the equivalents of Regex("\\\\") in Ruby and JavaScript both return something that prints as /\\/ (I'm not sure how to express this in Perl):

Ruby:

irb(main):001:0> Regexp.new("\\\\")
=> /\\/

JavaScript:

> new RegExp("\\\\")
/\\/

@vtjnash
Copy link
Member

vtjnash commented Apr 6, 2021

That is not a parser issue, though it is a possible breaking change.

@wdebeaum
Copy link
Author

wdebeaum commented Apr 6, 2021

Maybe I should have said "evaluate to" instead of "parse as" in my last comment. But in the original description of this issue I didn't say it was only an issue with parsing and printing. So why did you close the issue when you considered only the parsing and printing issues to be fixed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

No branches or pull requests

7 participants