-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--smart-case detection of upper-case characters is unintuitive #717
Comments
Haha yeah I am not a fan of it, but recognize that others like it. It should definitely work intuitively, and I basically agree with your analysis here. I think you are on the right track with respect to fixing this. The problem is that the regex "AST" isn't a true faithful AST. The parser itself does various transformations that cannot be reversed, so there is no way to "see" that a I have been working on a rewrite of the parser for quite some time now, and with it, a proper AST. Once that is active (I'm not sure when that will be), then we can scan the AST and apply this heuristic with as much precision as we care to. |
@okdana If you wanted to do this in the shorter term to get at least some simple cases right, then I'd be in favor of this. I believe the check is done in the |
Fixes BurntSushi#717 (partially) The previous implementation of the smart-case feature was actually *too* smart, in that it inspected the final character ranges in the AST to determine if the pattern contained upper-case characters. This meant that patterns like `foo\w` would not be handled case-insensitively, since `\w` includes the range of upper-case characters A–Z. As a medium-term solution to this problem, we now inspect the input pattern itself for upper-case characters, ignoring any that immediately follow a `\`. This neatly handles all of the most basic cases like `\w`, `\S`, and `É`, though it still has problems with more complex features like `\p{Ll}`. Handling those correctly will require improvements to the AST.
Fixes #717 (partially) The previous implementation of the smart-case feature was actually *too* smart, in that it inspected the final character ranges in the AST to determine if the pattern contained upper-case characters. This meant that patterns like `foo\w` would not be handled case-insensitively, since `\w` includes the range of upper-case characters A–Z. As a medium-term solution to this problem, we now inspect the input pattern itself for upper-case characters, ignoring any that immediately follow a `\`. This neatly handles all of the most basic cases like `\w`, `\S`, and `É`, though it still has problems with more complex features like `\p{Ll}`. Handling those correctly will require improvements to the AST.
@okdana With the new AST coming in soon, what do you think about this set of rules for smart case? When the Examples: |
Hmm. I can't think of any case where that wouldn't make sense. Yeah, sounds cool. |
The
--smart-case
option doesn't behave the way i expect in terms of determining whether the pattern contains upper-case characters. Specifically, a search like the following will fail:This is because it expands the class
\w
and discovers that it contains the range of upper-case charactersA-Z
.ag
uses a different metric: It simply checks the input pattern byte by byte to see if there's an upper-case character in the ASCII range. This works out in thefoo\w
case, but it will fail in the similarly obvious case offoo\S
(whichrg
happens to get right) — not to mention multi-byte stuff likeÉ
andß
.There are also more complicated cases where they both fail in different ways, like this:
rg
returns a match here, because\p{Ll}
expands to only lower-case characters. But i think this should be treated as an 'explicit' range, like[A-Z]
would be.At the very least, though, i feel like something as basic as
foo\w
should probably work as expected with smart case.Is there any way to determine through the AST the actual input that produced a class? Or would that be easy to add? I'm not sure i'm at the level yet where i should be messing with the
regex
crate, but that seems like it'd be a convenient way to sort it.The easiest 'good enough' alternative that comes to mind would be to do a semi-naïve scan of the pattern that's just smart enough to ignore stuff like short-hand classes, but that feels kind of dumb (especially since i get the impression that you don't really care for this feature in the first place...).
The text was updated successfully, but these errors were encountered: