Interaction with backreferences / variable-width escape sequences #17

nikic · 2015-06-18T17:57:09Z

Using something like new RegExp("(foo)\\1" + RegExp.escape(input)), if input were to start with a number, this would extend the backreference \1 to something like \11. Does this need to be accounted for?

The text was updated successfully, but these errors were encountered:

benjamingr · 2015-06-18T18:00:47Z

@nikic good question.

At this moment we are not escaping anything that has any context, similarly ?! isn't escaped, - isn't escaped and so on.

Basically, whenever you need context sensitivity you'd do:

new RegExp("(foo)\1(" + RegExp.escape(input) +")")

Accounting for all context cases sounds problematic.

nikic · 2015-06-18T18:03:41Z

@benjamingr I think there is a difference between this case and something like ?! though: It could only become an issue if you insert into a malformed regular expression, i.e. you'd have to write "(?" + RegExp.escape(...) + ")", where the syntax (?...) by itself (i.e. not followed by : or any of the other usual group flavors) is not valid.

bwoebi · 2015-06-18T18:35:08Z

@benjamingr It also is possible to just add a non-capturing group around the whole; that way it is context-independent. At that point it's then also enough to have the default set of escapes which we have currently. … Except that it will bork with [] character classes.

So, generally, I think that's an unavoidable issue and must be handled by the programmer.

benjamingr · 2015-06-19T09:21:00Z

@bwoebi I don't think we have those (non capturing groups).

@nikic yes, this is more like that - case or other cases where it can change the context of the RegExp.

bwoebi · 2015-06-19T12:45:44Z

@benjamingr We don't have non capturing groups, but that's just the "proper" way to delimit that… one also could use .{0} as prefix, which would work in JS too. But point is, that'd brok again with character classes.

Issue is that it's not really possible to have real context sensitive escaping… the only idea I have for that is how e.g. prepared statements work with mysql… you basically have placeholders in your query/regex and then can context-sensitively escape them.

benjamingr · 2015-06-19T12:48:09Z

Can you write a short "proof" illustrating that it is impossible (or hard)
to write escape in a way that would allow escaping in a way that would
preserve context? (Without non-capturing groups)

Sensitive escaping might be our plan for RegExp.tag in the future (name not
final).

On Fri, Jun 19, 2015 at 3:45 PM, Bob Weinand [email protected]
wrote:

@benjamingr https://github.com/benjamingr We don't have non capturing
groups, but that's just the "proper" way to delimit that… one also could
use .{0} as prefix, which would work in JS too. But point is, that'd brok
again with character classes.

Issue is that it's not really possible to have real context sensitive
escaping… the only idea I have for that is how e.g. prepared statements
work with mysql… you basically have placeholders in your query/regex and
then can context-sensitively escape them.

—
Reply to this email directly or view it on GitHub
#17 (comment)
.

nikic · 2015-06-19T12:50:33Z

I don't see what would be hard about it, just escape - and digits additionally (at least if occurring as first character). I don't think the problem here is technical, it's more a question of whether or not such cases should be supported.

(Where "escaping" for digits would mean encoding them.)

bwoebi · 2015-06-19T12:59:43Z

@benjamingr Point is that we cannot reliably prevent that very case @nikic did mention in his initial post. If you insert any characters before the string (note that we cannot escape numbers, that'd mean a backreference), it won't be compatible with a character class as it counts as a symbol there.

@nikic How do you want to escape numbers? By their octal representation? Then RegExp.escape("0") +"0" will end up as "\300" … character number 192 then. Putting a backslash in front would mean a back-reference… so, what now?

EDIT:

Hmm... okay, we can hex-escape that then… that will work. So, turns out I was wrong a bit and it's indeed possible to properly escape it

nikic · 2015-06-19T13:03:21Z

@bwoebi As said, by encoding them. I am not terribly familiar with details of JS regex flavor, but presumably it has some kind of fixed-width escape sequence. Probably \xHH or \uHHHH.

bwoebi · 2015-06-19T13:04:32Z

@nikic yeah, see my EDIT, noticed that just now.

benjamingr · 2015-06-19T13:09:11Z

Numbers like 1 can be escaped like [1] but I don't see a way to reliably escape things.

Adding a capturing group will mess with back-references.

Other solutions seem to have issues with quantifiers and context sensitive inserts. I don't see a way we can solve this from the library level - at least without parsing the RegExp ourselves first.

domenic · 2015-06-19T15:12:59Z

I'm not sure I fully understand all the issues here, but the sense I'm getting is that maybe just going for an escape-everything approach (#15) would solve things? (EDIT: I saw the doc about how other languages don't like that idea, interesting.)

domenic · 2015-06-19T15:13:54Z

In any case, I think the important thing to distill out of this thread is a list of failure scenarios for the current spec, and put those into the readme. Then we can judge how problematic those are by asking if they're ridiculous edge cases or actually quite believable.

benjamingr · 2015-06-19T15:18:19Z

Python is moving to a escape only metacharacters approach in its new
regex module (replacing re) by the way.
https://bitbucket.org/mrabarnett/mrab-regex/src/6193ea4246da272cf18a190c46aa116737067780/regex_3/Python/regex.py?at=default#cl-409

(Courtesy of Martijn Pieters)

What this issue points out is that escaping everything wouldn't work and
that escaping needs to be context aware so it is the programmer who has to
handle escaping in these cases.

The two other participants are PHP core contributors who I invited to
participate in the discussion by the way :)

On Fri, Jun 19, 2015 at 6:13 PM, Domenic Denicola [email protected]
wrote:

In any case, I think the important thing to distill out of this thread is
a list of failure scenarios for the current spec, and put those into the
readme. Then we can judge how problematic those are by asking if they're
ridiculous edge cases or actually quite believable.

—
Reply to this email directly or view it on GitHub
#17 (comment)
.

nikic · 2015-06-19T15:31:35Z

Speaking in formal, the suggestion here is basically this:

REPLACE

If c is matched by SyntaxCharacter then do:
1. Append code unit 0x005C (REVERSE SOLIDUS) to cuList.
Append the elements of the UTF16Encoding (10.1.1) of c to cuList.

WITH

If c is matched by SyntaxCharacter then do:
1. Append code unit 0x005C (REVERSE SOLIDUS) to cuList.
2. Append the elements of the UTF16Encoding (10.1.1) of c to cuList.
Otherwise, if c is matched by HexDigit (*) then do:
1. Append code unit 0x005C (REVERSE SOLIDUS) to cuList.
2. Append code unit 0x0078 (LATIN SMALL LETTER X) to cuList.
3. Append two code units that are the result of converting the numeric code point value of c to a String of two hexadecimal digits. Alphabetic hexadecimal digits are presented as lowercase Latin letters.
Otherwise append the elements of the UTF16Encoding (10.1.1) of c to cuList.

(*) Potential addition: "And c is the first element of cpList"

Btw, just for the record PHP also ignores the issue with backreferences and escapes only meta characters (albeit a larger number). Doing that has a lot of precedent in other languages.

benjamingr · 2015-06-19T16:54:59Z

In any case, I think the important thing to distill out of this thread is a list of failure scenarios for the current spec, and put those into the readme.

Definitely @domenic , I'm not sure the discussion has exhausted itself yet though.

@nikic this would help with backreferences but would not help for example in matching sets or in other context sensitive cases - for example "[" + escaped +"]"

It would also have the effect of making the regexp created less readable. Wouldn't it make more sense to just insert an empty capturing group into the RegExp if the first literal matches HexDigit?

ljharb · 2015-06-19T17:14:11Z

@benjamingr @bwoebi Isn't ?: how you denote a non-capturing group in JS regex? See RegExp.prototype.source as an example.

benjamingr · 2015-06-19T17:16:39Z

@ljharb oh wow yeah, it is. I have no idea how I totally missed that while I used it last week at least twice or when I read the spec for .source today.

Just prefixing and postfixing the RegExp with (?:) would not solve all context sensitivity issues. For example, it would not work if you're inside a set match ([ ]). See any way around it?

nikic · 2015-06-19T17:17:59Z

@nikic this would help with backreferences but would not help for example in matching sets or in other context sensitive cases - for example "[" + escaped +"]"

I'm not sure I see the issue you're referring to here, could you elaborate? Apart from use of - in there, but that can be escaped as well.

benjamingr · 2015-06-20T09:33:27Z

@nikic as an expert, do you think the readability impact of the resulting regular expression is an important factor? If so - do you think this guarantee is worth it?

benjamingr · 2015-06-20T11:04:15Z

Superseding this with #29

benjamingr added the discussion appreciated label Jun 18, 2015

benjamingr mentioned this issue Jun 19, 2015

Escape -? #28

Closed

benjamingr closed this as completed Jun 20, 2015

benjamingr mentioned this issue Jan 28, 2021

RegExp.escape escaping SyntaxCharacter alone is insufficient #48

Closed

bergus mentioned this issue Nov 27, 2023

Which leading characters should be escaped? #66

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interaction with backreferences / variable-width escape sequences #17

Interaction with backreferences / variable-width escape sequences #17

nikic commented Jun 18, 2015

benjamingr commented Jun 18, 2015

nikic commented Jun 18, 2015

bwoebi commented Jun 18, 2015

benjamingr commented Jun 19, 2015

bwoebi commented Jun 19, 2015

benjamingr commented Jun 19, 2015

nikic commented Jun 19, 2015

bwoebi commented Jun 19, 2015

nikic commented Jun 19, 2015

bwoebi commented Jun 19, 2015

benjamingr commented Jun 19, 2015

domenic commented Jun 19, 2015

domenic commented Jun 19, 2015

benjamingr commented Jun 19, 2015

nikic commented Jun 19, 2015

benjamingr commented Jun 19, 2015

ljharb commented Jun 19, 2015

benjamingr commented Jun 19, 2015

nikic commented Jun 19, 2015

benjamingr commented Jun 20, 2015

benjamingr commented Jun 20, 2015

Interaction with backreferences / variable-width escape sequences #17

Interaction with backreferences / variable-width escape sequences #17

Comments

nikic commented Jun 18, 2015

benjamingr commented Jun 18, 2015

nikic commented Jun 18, 2015

bwoebi commented Jun 18, 2015

benjamingr commented Jun 19, 2015

bwoebi commented Jun 19, 2015

benjamingr commented Jun 19, 2015

nikic commented Jun 19, 2015

bwoebi commented Jun 19, 2015

nikic commented Jun 19, 2015

bwoebi commented Jun 19, 2015

benjamingr commented Jun 19, 2015

domenic commented Jun 19, 2015

domenic commented Jun 19, 2015

benjamingr commented Jun 19, 2015

nikic commented Jun 19, 2015

benjamingr commented Jun 19, 2015

ljharb commented Jun 19, 2015

benjamingr commented Jun 19, 2015

nikic commented Jun 19, 2015

benjamingr commented Jun 20, 2015

benjamingr commented Jun 20, 2015