Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit custom character class ranges to single scalars #422

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

natecook1000
Copy link
Member

As shown by issue #401, the standard lexicographic-ordering-based comparisons for characters yield very unexpected results when matching with ranges in custom character classes. This change resolves that unexpected behavior by only matching single-scalar characters within ranges and only allowing single-scalar characters to be range endpoints.

Multi-scalar characters are still allowed within custom character classes, and meta characters and built-in character classes continue to function in the same way as before. With this change, we have the following behavior:

try /[1-2]/.wholeMatch(in: "1")     // "1"
try /[12]/.wholeMatch(in: "1")      // "1"
try /[1-2]/.wholeMatch(in: "1️⃣")    // nil
try /[12]/.wholeMatch(in: "1️⃣")     // nil

try /\d/.wholeMatch(in: "1️⃣")                      // "1️⃣"
try /\d/.asciiOnlyDigits().wholeMatch(in: "1️⃣")    // nil
try /[\d]/.wholeMatch(in: "1️⃣")                    // "1️⃣"
try /[\d]/.asciiOnlyDigits().wholeMatch(in: "1️⃣")  // nil

try /[🇦🇫-🇿🇼]/.wholeMatch(in: "Flags! 🇬🇭🇰🇷")       // error: invalid character class range

Character class ranges don't work well with multi-scalar inputs,
in either the range or the matched character. This change limits
range endpoints to single-scalar characters and matches only
characters that are themselves a single scalar.

Fixes issue swiftlang#407, which now displays this behavior:

```
try /[1-2]/.wholeMatch(in: "1️⃣")     // nil
try /[12]/.wholeMatch(in: "1️⃣")      // nil
try /(?U)[\d]/.wholeMatch(in: "1️⃣")  // nil
```
This applies the current matching semantics for character classes,
matching either characters or Unicode scalars depending on the
current options.
@natecook1000
Copy link
Member Author

@swift-ci Please test

The prior implementation didn't make a lot of sense, and couldn't
handle cases like `/(?i)[X-c]/`. This new approach uses simple case
matching to test if the character is within the range, then tests if
the uppercase or lowercase mappings are within the range.

Fixes swiftlang#395
@natecook1000
Copy link
Member Author

@swift-ci Please test

@@ -771,7 +771,7 @@ extension AST.Atom {
/// range.
public var isValidCharacterClassRangeBound: Bool {
// If we have a literal character value for this, it can be used as a bound.
if literalCharacterValue != nil { return true }
if literalCharacterValue?.hasExactlyOneScalar == true { return true }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does normalization affect this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does for sure, it would help if we normalized all characters as we stored them in the custom character class model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which normalization?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NFC would allow us to permit as many characters as possible to act as endpoints.

@@ -19,6 +19,13 @@ extension Substring {
var string: String { String(self) }
}

extension Character {
/// Whether this character is made up of exactly one Unicode scalar value.
public var hasExactlyOneScalar: Bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this public?

Copy link
Member Author

@natecook1000 natecook1000 May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was previously in the _StringProcessing module; we need it in _RegexParser for the compile-time validation.

return input[curIdx].lowercased() == c.lowercased()
? nextIndex
: nil
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove else

return input[curIdx] == c
? nextIndex
: nil
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove else

? nextIndex
: nil
} else {
// FIXME: How do multi-scalar characters match in case insensitive mode?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an issue or something to clarify this?

Comment on lines +295 to +297
let nextIndex = isCharacterSemantic
? input.index(after: curIdx)
: input.unicodeScalars.index(after: curIdx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really feel like this should be a helper function or fully refactored. It seems ripe for bugs if vigilance is required to remember to support scalar semantics

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make it a helper function, but the real issue is that the two views share the same index type, and we can't enforce that you call the helper function. Maybe we could look at designing a wrapper type that would distinguish between the two? Seems like it add a lot of friction…

struct ViewLockedIndex<View> {
    var index: String.Index
}

extension String {
    func index(after i: ViewLockedIndex<String>) -> ViewLockedIndex<String> {
        .init(index: index(after: i.index))
    }

    func index(after i: ViewLockedIndex<String.UnicodeScalarView>) -> ViewLockedIndex<String.UnicodeScalarView>
        .init(index: self.unicodeScalars.index(after: i.index))
    }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is rather than have a bunch of inner-most ifs, what would the code look like to have an outer-if? That is a more bottoms-up refactoring

let nextIndex = isCharacterSemantic
? input.index(after: curIdx)
: input.unicodeScalars.index(after: curIdx)
if isCharacterSemantic && !input[curIdx].hasExactlyOneScalar {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do to non-NFC characters?

&& stringRange.contains(scalar.properties.lowercaseMapping))
|| (scalar.properties.changesWhenUppercased
&& stringRange.contains(scalar.properties.uppercaseMapping)) {
return nextIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this going back to lexicographical contains?

// Ranges can be formed with single-scalar characters, and can only
// match as such.
// FIXME: Convert to canonical composed version before testing?
guard character.hasExactlyOneScalar else { return false }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just wrong and breaks canonical equivalence.

firstMatchTest(#"e[\u{301}]$"#, input: eComposed, match: nil)
firstMatchTest(#"e[\u{300}-\u{320}]$"#, input: eComposed, match: nil)
firstMatchTest(#"[e][\u{300}-\u{320}]$"#, input: eComposed, match: nil)
firstMatchTest(#"[a-z][\u{300}-\u{320}]$"#, input: eComposed, match: nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need positive tests for grapheme-semantics under canonical equivalence.

@hamishknight
Copy link
Contributor

Picking up this work in #570

@stephentyrone
Copy link
Contributor

Is this obsoleted by #570?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants