--smart-case detection of upper-case characters is unintuitive #717

okdana · 2017-12-17T22:51:46Z

The --smart-case option doesn't behave the way i expect in terms of determining whether the pattern contains upper-case characters. Specifically, a search like the following will fail:

rg -S 'foo\w' <<< FOOBAR

This is because it expands the class \w and discovers that it contains the range of upper-case characters A-Z.

ag uses a different metric: It simply checks the input pattern byte by byte to see if there's an upper-case character in the ASCII range. This works out in the foo\w case, but it will fail in the similarly obvious case of foo\S (which rg happens to get right) — not to mention multi-byte stuff like É and ß.

There are also more complicated cases where they both fail in different ways, like this:

rg -S '\p{Ll}' <<< FOOBAR

rg returns a match here, because \p{Ll} expands to only lower-case characters. But i think this should be treated as an 'explicit' range, like [A-Z] would be.

At the very least, though, i feel like something as basic as foo\w should probably work as expected with smart case.

Is there any way to determine through the AST the actual input that produced a class? Or would that be easy to add? I'm not sure i'm at the level yet where i should be messing with the regex crate, but that seems like it'd be a convenient way to sort it.

The easiest 'good enough' alternative that comes to mind would be to do a semi-naïve scan of the pattern that's just smart enough to ignore stuff like short-hand classes, but that feels kind of dumb (especially since i get the impression that you don't really care for this feature in the first place...).

The text was updated successfully, but these errors were encountered:

BurntSushi · 2017-12-18T12:11:00Z

(especially since i get the impression that you don't really care for this feature in the first place...)

Haha yeah I am not a fan of it, but recognize that others like it. It should definitely work intuitively, and I basically agree with your analysis here.

I think you are on the right track with respect to fixing this. The problem is that the regex "AST" isn't a true faithful AST. The parser itself does various transformations that cannot be reversed, so there is no way to "see" that a \w was actually written as a \w instead of a [...].

I have been working on a rewrite of the parser for quite some time now, and with it, a proper AST. Once that is active (I'm not sure when that will be), then we can scan the AST and apply this heuristic with as much precision as we care to.

BurntSushi · 2017-12-18T12:12:01Z

The easiest 'good enough' alternative that comes to mind would be to do a semi-naïve scan of the pattern that's just smart enough to ignore stuff like short-hand classes

@okdana If you wanted to do this in the shorter term to get at least some simple cases right, then I'd be in favor of this. I believe the check is done in the grep crate in this repo.

Fixes BurntSushi#717 (partially) The previous implementation of the smart-case feature was actually *too* smart, in that it inspected the final character ranges in the AST to determine if the pattern contained upper-case characters. This meant that patterns like `foo\w` would not be handled case-insensitively, since `\w` includes the range of upper-case characters A–Z. As a medium-term solution to this problem, we now inspect the input pattern itself for upper-case characters, ignoring any that immediately follow a `\`. This neatly handles all of the most basic cases like `\w`, `\S`, and `É`, though it still has problems with more complex features like `\p{Ll}`. Handling those correctly will require improvements to the AST.

Fixes #717 (partially) The previous implementation of the smart-case feature was actually *too* smart, in that it inspected the final character ranges in the AST to determine if the pattern contained upper-case characters. This meant that patterns like `foo\w` would not be handled case-insensitively, since `\w` includes the range of upper-case characters A–Z. As a medium-term solution to this problem, we now inspect the input pattern itself for upper-case characters, ignoring any that immediately follow a `\`. This neatly handles all of the most basic cases like `\w`, `\S`, and `É`, though it still has problems with more complex features like `\p{Ll}`. Handling those correctly will require improvements to the AST.

BurntSushi · 2018-03-10T04:01:14Z

@okdana With the new AST coming in soon, what do you think about this set of rules for smart case?

When the --smart-case flag is given, ripgrep chooses to match case insensitively if and only if there is at least one literal character in the pattern and that all such literals are not considered as uppercase. If the pattern contains no literals or at least one uppercase literal character, then normal case sensitive search is used.

Examples: abc, [a-z], abc\w and abc\p{Ll} are all searched case insensitively while aBc, [A-Z], aBc\w and \p{Ll} are searched case sensitively.

okdana · 2018-03-10T05:11:32Z

Hmm. I can't think of any case where that wouldn't make sense. Yeah, sounds cool.

BurntSushi added the bug A bug. label Dec 18, 2017

okdana mentioned this issue Dec 18, 2017

Improve smartness of --smart-case #718

Merged

BurntSushi closed this as completed in #718 Dec 18, 2017

roblourens mentioned this issue Jan 4, 2018

Enable "smart case" for search microsoft/vscode#41119

Closed

BurntSushi mentioned this issue Mar 10, 2018

improve --smart-case once and for all #851

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--smart-case detection of upper-case characters is unintuitive #717

--smart-case detection of upper-case characters is unintuitive #717

okdana commented Dec 17, 2017 •

edited

Loading

BurntSushi commented Dec 18, 2017

BurntSushi commented Dec 18, 2017 •

edited

Loading

BurntSushi commented Mar 10, 2018 •

edited

Loading

okdana commented Mar 10, 2018

--smart-case detection of upper-case characters is unintuitive #717

--smart-case detection of upper-case characters is unintuitive #717

Comments

okdana commented Dec 17, 2017 • edited Loading

BurntSushi commented Dec 18, 2017

BurntSushi commented Dec 18, 2017 • edited Loading

BurntSushi commented Mar 10, 2018 • edited Loading

okdana commented Mar 10, 2018

okdana commented Dec 17, 2017 •

edited

Loading

BurntSushi commented Dec 18, 2017 •

edited

Loading

BurntSushi commented Mar 10, 2018 •

edited

Loading