-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit custom character class ranges to single scalars #422
base: main
Are you sure you want to change the base?
Conversation
Character class ranges don't work well with multi-scalar inputs, in either the range or the matched character. This change limits range endpoints to single-scalar characters and matches only characters that are themselves a single scalar. Fixes issue swiftlang#407, which now displays this behavior: ``` try /[1-2]/.wholeMatch(in: "1️⃣") // nil try /[12]/.wholeMatch(in: "1️⃣") // nil try /(?U)[\d]/.wholeMatch(in: "1️⃣") // nil ```
This applies the current matching semantics for character classes, matching either characters or Unicode scalars depending on the current options.
@swift-ci Please test |
The prior implementation didn't make a lot of sense, and couldn't handle cases like `/(?i)[X-c]/`. This new approach uses simple case matching to test if the character is within the range, then tests if the uppercase or lowercase mappings are within the range. Fixes swiftlang#395
@swift-ci Please test |
@@ -771,7 +771,7 @@ extension AST.Atom { | |||
/// range. | |||
public var isValidCharacterClassRangeBound: Bool { | |||
// If we have a literal character value for this, it can be used as a bound. | |||
if literalCharacterValue != nil { return true } | |||
if literalCharacterValue?.hasExactlyOneScalar == true { return true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does normalization affect this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does for sure, it would help if we normalized all characters as we stored them in the custom character class model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which normalization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NFC would allow us to permit as many characters as possible to act as endpoints.
@@ -19,6 +19,13 @@ extension Substring { | |||
var string: String { String(self) } | |||
} | |||
|
|||
extension Character { | |||
/// Whether this character is made up of exactly one Unicode scalar value. | |||
public var hasExactlyOneScalar: Bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was previously in the _StringProcessing module; we need it in _RegexParser for the compile-time validation.
return input[curIdx].lowercased() == c.lowercased() | ||
? nextIndex | ||
: nil | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove else
return input[curIdx] == c | ||
? nextIndex | ||
: nil | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove else
? nextIndex | ||
: nil | ||
} else { | ||
// FIXME: How do multi-scalar characters match in case insensitive mode? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an issue or something to clarify this?
let nextIndex = isCharacterSemantic | ||
? input.index(after: curIdx) | ||
: input.unicodeScalars.index(after: curIdx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really feel like this should be a helper function or fully refactored. It seems ripe for bugs if vigilance is required to remember to support scalar semantics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make it a helper function, but the real issue is that the two views share the same index type, and we can't enforce that you call the helper function. Maybe we could look at designing a wrapper type that would distinguish between the two? Seems like it add a lot of friction…
struct ViewLockedIndex<View> {
var index: String.Index
}
extension String {
func index(after i: ViewLockedIndex<String>) -> ViewLockedIndex<String> {
.init(index: index(after: i.index))
}
func index(after i: ViewLockedIndex<String.UnicodeScalarView>) -> ViewLockedIndex<String.UnicodeScalarView>
.init(index: self.unicodeScalars.index(after: i.index))
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is rather than have a bunch of inner-most ifs, what would the code look like to have an outer-if? That is a more bottoms-up refactoring
let nextIndex = isCharacterSemantic | ||
? input.index(after: curIdx) | ||
: input.unicodeScalars.index(after: curIdx) | ||
if isCharacterSemantic && !input[curIdx].hasExactlyOneScalar { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do to non-NFC characters?
&& stringRange.contains(scalar.properties.lowercaseMapping)) | ||
|| (scalar.properties.changesWhenUppercased | ||
&& stringRange.contains(scalar.properties.uppercaseMapping)) { | ||
return nextIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this going back to lexicographical contains?
// Ranges can be formed with single-scalar characters, and can only | ||
// match as such. | ||
// FIXME: Convert to canonical composed version before testing? | ||
guard character.hasExactlyOneScalar else { return false } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just wrong and breaks canonical equivalence.
firstMatchTest(#"e[\u{301}]$"#, input: eComposed, match: nil) | ||
firstMatchTest(#"e[\u{300}-\u{320}]$"#, input: eComposed, match: nil) | ||
firstMatchTest(#"[e][\u{300}-\u{320}]$"#, input: eComposed, match: nil) | ||
firstMatchTest(#"[a-z][\u{300}-\u{320}]$"#, input: eComposed, match: nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need positive tests for grapheme-semantics under canonical equivalence.
Picking up this work in #570 |
Is this obsoleted by #570? |
As shown by issue #401, the standard lexicographic-ordering-based comparisons for characters yield very unexpected results when matching with ranges in custom character classes. This change resolves that unexpected behavior by only matching single-scalar characters within ranges and only allowing single-scalar characters to be range endpoints.
Multi-scalar characters are still allowed within custom character classes, and meta characters and built-in character classes continue to function in the same way as before. With this change, we have the following behavior: