Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid match offsets with optional groups #7

Closed
guibou opened this issue Nov 8, 2019 · 6 comments
Closed

Invalid match offsets with optional groups #7

guibou opened this issue Nov 8, 2019 · 6 comments

Comments

@guibou
Copy link

guibou commented Nov 8, 2019

The following example is invalid:

Control.Lens.Regex.Text Control.Lens> toListOf ([regex|A(x)?(B)|] . matchAndGroups) "AB"
[("AB",["",""])]

I was expecting ("AB", ["", "B"]). Note that matching AxB returns the correct result [("AxB", ["x", "B"])] and changing the regex to Ax?(B) returns the correct content for the last group.

Note that the same issue happen with the Text and ByteString interfaces on version 1.0.0.0.

pcre-heavy does not have this problem:

Prelude Text.Regex.PCRE.Heavy> scan [re|A(x)?(B)|] "AB"
[("AB",["","B"])]
Prelude Text.Regex.PCRE.Heavy> scan [re|A(x)?(B)|] "AxB"
[("AxB",["x","B"])]
@guibou guibou changed the title Invalid matches offsets whith optional groups Invalid matche offsets with optional groups Nov 8, 2019
@guibou guibou changed the title Invalid matche offsets with optional groups Invalid match offsets with optional groups Nov 8, 2019
@ChrisPenner
Copy link
Owner

ChrisPenner commented Nov 8, 2019

Yeah; this is a confusing issue to solve, moreso do to ambiguity in semantics than anything, for instance what should be the behaviour of something like: (A)(B)|(C) for instance? There are many more complex examples than unfortunately make this pretty tricky to handle. You're the first to need it so far, so I'll take another look at it to see if I can find a decent strategy that works sensibly in all these situations (unfortunately pcre-heavy doesn't support named groups).

I've had an ignored test case for something like this sitting around, maybe now's a good time to figure out what the semantics should be haha.

Following pcre-heavy behaviour is probably the most sensible thing, but I want to make sure I do due diligence on this 😄

Can you add a bit of detail about the types of cases you'll be using it so I make sure I cover all the cases in the test suite? Thanks!

@ChrisPenner
Copy link
Owner

Looks like this is actually affecting the match groups too:

>>> "xQB" ^.. [regex|x(A)|x(Q)B|] . groups
[["","B"]]

Can hopefully solve them both at once 🤞

@guibou
Copy link
Author

guibou commented Nov 8, 2019

Hello.

I need to parse github urls, using something such as: (https?://)?(www.)?github.com/(.*?)/(.*?)/issues/([0-9]+). See the current results:

>>> fileContent = "hello https://github.com/foo/bar/issues/123 hello I'm a cow https://github.com/guibou/PyF/issues/12345 https://www.github.com/Foo/Bar/issues/1234"
>>> mapM_ print $ toListOf (regex [rx|(https?://)?(www.)?github.com/(.*?)/(.*?)/issues/([0-9]+)|] . matchAndGroups) fileContent
("https://github.com/foo/bar/issues/123",["http://","","/12","",""])
("https://github.com/guibou/PyF/issues/12345",["https://","","","",""])
("https://www.github.com/Foo/Bar/issues/1234",["https://","www.","Foo","Bar","1234"])

My issue is that non-optional capturing groups can be empty when they should not or may not be empty but wrong. Especially see the /12 group in the first result.

Note that if the "optional" groups are set to non capturing (with ?:), it works as expected:

mapM_ print $ toListOf (regex [rx|(?:https?://)?(?:www.)?github.com/(.*?)/(.*?)/issues/([0-9]+)|] . matchAndGroups) "hello https://github.com/foo/bar/issues/123 hello I'm a cow https://github.com/guibou/PyF/issues/12345 https://www.github.com/Foo/Bar/issues/1234"
("https://github.com/foo/bar/issues/123",["foo","bar","123"])
("https://github.com/guibou/PyF/issues/12345",["guibou","PyF","12345"])
("https://www.github.com/Foo/Bar/issues/1234",["Foo","Bar","1234"])

If I want to capture (https?://)?, it can also rewrite it like that:

mapM_ print $ toListOf (regex [rx|(https?://|)(?:www.)?github.com/(.*?)/(.*?)/issues/([0-9]+)|] . matchAndGroups) "hello https://github.com/foo/bar/issues/123 hello I'm a cow https://github.com/guibou/PyF/issues/12345 https://www.github.com/Foo/Bar/issues/1234 github.com/Chien/Rat/issues/1234 www.github.com/Poulet/tortue/issues/12 https://www.github.com/tortue/hibou/issues/12"
("https://github.com/foo/bar/issues/123",["http://","foo","bar","123"])
("https://github.com/guibou/PyF/issues/12345",["https://","guibou","PyF","12345"])
("https://www.github.com/Foo/Bar/issues/1234",["https://","Foo","Bar","1234"])
("github.com/Chien/Rat/issues/1234",["","Chien","Rat","1234"])
("www.github.com/Poulet/tortue/issues/12",["","Poulet","tortue","12"])
("https://www.github.com/tortue/hibou/issues/12",["https://","tortue","hibou","12"])

In this implementation, all "optional" groups are not capturing. Actually that's satisfying.

what should be the behaviour of something like: (A)(B)|(C) for instance?

Here pcre-heavy gives:

Prelude Text.Regex.PCRE.Heavy> scan [re|(A)(B)|(C)|] "AB"
[("AB",["A","B"])]
Prelude Text.Regex.PCRE.Heavy> scan [re|(A)(B)|(C)|] "C"
[("C",["","","C"])]

Which is not intuitive, but at least groups contains values which represents the content of actual groups as defined by the regex.

That one is weird too ;):

Prelude Text.Regex.PCRE.Heavy> scan [re|((.))+|] "ABCDEF"
[("ABCDEF",["F","F"])]

Following pcre-heavy behaviour is probably the most sensible thing, but I want to make sure I do due diligence on this 

Thank to our discussion, I'm now convinced that the problem is more fundamental and cannot be easily solved and the best solution is to rewrite the regex in a non-ambiguous way. However I agree with you that it may be nice to follow the result of pcre-heavy, which even if the returned groups may not be intuitive (see my last example especially), they are at least not truncated or "wrongly" empty.

@ChrisPenner
Copy link
Owner

ChrisPenner commented Nov 9, 2019

Thanks for the explanation, I'm glad you found something that works for you!

I've adapted the behaviour for optional groups to be more in line with pcre-heavy on latest head, so you can use: e83feec if you need the new behaviour..

I'm going to investigate a little further before cutting a new release because it looks like there are some issues with handling nested groups (which is what's showing up in your last example there: [re|((.))+|]); however I have a feeling those will be much trickier to solve; at the very least it will involve breaking a few traversal laws since the same match text can appear in multiple groups if they're nested. I'll need to figure out what it means to edit one of these... It may be that lens-regex-pcre will just require you to use non-capturing groups when nesting, and that should all work out.

I'll leave this open until I cut a release; let me know if you discover anything else. Hope the library is working out for you despite the issues!

@guibou
Copy link
Author

guibou commented Nov 15, 2019

Thank you. The fix indeed works!

The library is indeed wonderful and works for most of my needs. For this specific topic I ended using raw scan from Text.Regex.PCRE.Heavy because I realized that I don't need anything else, but I'll definitly use lens-regex-pcre in my future developments if it involves more heavy regex work.

@ChrisPenner
Copy link
Owner

Shipped an update: http://hackage.haskell.org/package/lens-regex-pcre-1.1.0.0

Also adds traversals for named groups which might also help with this 👍

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants