Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exhaustive tests for RegExp Unicode property escapes #971

Merged
merged 1 commit into from
Apr 13, 2017
Merged

Add exhaustive tests for RegExp Unicode property escapes #971

merged 1 commit into from
Apr 13, 2017

Conversation

mathiasbynens
Copy link
Member

Proposal: https://github.com/tc39/proposal-regexp-unicode-property-escapes

These tests have been generated by the script at https://github.com/mathiasbynens/unicode-property-escapes-tests. They check all the properties and values that should be supported by implementations against the symbols they’re supposed to match. False positives are detected as well.

Ref. #950.
Ref. tc39/proposal-regexp-unicode-property-escapes#4.

@anba
Copy link
Contributor

anba commented Apr 12, 2017

Initial observations:

  • There is a bug in the code generation for Noncharacter_Code_Point, it tests 10FFFE..10FFFF are not Noncharacter_Code_Point.
  • Code point ranges must not be consecutive when surrogate ranges are involved, e.g. when testing General_Category=Private_Use, the string "\uDBFF\uDC00" is generated (U+DBFF is the last private-use high surrogate and U+DC00 is the first low surrogate).
  • The buildString function should probably be optimized, otherwise debug builds will spent most of the time assembling the test string. For example when I changed it to this version
const buildString = ({ loneCodePoints, ranges }) => {
    let result = String.fromCodePoint(...loneCodePoints);
    for (const [start, end] of ranges) {
        let codePoints = [];
        for (let i = 0, codePoint = start; codePoint <= end; codePoint++) {
            codePoints[i++] = codePoint;
            if (i === 10000) {
                result += String.fromCodePoint(...codePoints);
                codePoints.length = i = 0;
            }
        }
        result += String.fromCodePoint(...codePoints);
    }
    return result;
};

a v8-debug build created the string for ASCII_Hex_Digit in 750ms compared to 9750ms with the current buildString function.

@mathiasbynens
Copy link
Member Author

@anba Thanks for taking a look!

There is a bug in the code generation for Noncharacter_Code_Point, it tests 10FFFE..10FFFF [which] are not Noncharacter_Code_Point.

Did you mean here? That’s in nonMatchSymbols, i.e. the symbols that are supposed to not match — so that’s fine. Am I overlooking something?

I’ll address your other feedback (excellent points/suggestions!) tomorrow — if you don’t beat me to it with a PR, that is.

}
}
return result;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if buildString is the same among all tests, we should add a new js file in the /harness/ folder and get it through the includes tag. i.e.:

esid: sec-static-semantics-unicodematchproperty-p
features: [regexp-unicode-property-escapes]
includes: [buildString.js]
---*/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@anba
Copy link
Contributor

anba commented Apr 12, 2017

Did you mean here? That’s in nonMatchSymbols, i.e. the symbols that are supposed to not match — so that’s fine. Am I overlooking something?

PropList.txt has:

10FFFE..10FFFF; Noncharacter_Code_Point # Cn [2] <noncharacter-10FFFE>..<noncharacter-10FFFF>

So U+10FFFE and U+10FFFF should match \p{Noncharacter_Code_Point} instead of \P{Noncharacter_Code_Point}, shouldn't it?

@littledan
Copy link
Member

These tests are great, and they seem to test the feature in a pretty thorough way. A couple thoughts:

  • The tests are somewhat slow. These tests took 5m30s to run on my machine for V8, whereas the rest of the test262 tests took 15m39s. That makes these tests not the slowest thing ever, but it could still be nice to cut down on them a little bit (and I wouldn't mind if more thought went into use of the test generation tool to ensure that the matrix of combinations really does add value...).
  • Nit: "matches all proper symbols" is a funny error message. At least, it could read "failed to match certain symbols". The best would be if, in the failure case, you could go through the code points and find the errant one.
  • FWIW V8 passes most (97%) of these if you turn the feature flag on, but fails some of these tests by disagreeing on the set of characters (different Unicode version?), and others because they are missing as property names: Emoji_Component, Prepended_Concatenation_Mark, and Composition_Exclusion. This is based on looking through all the properties in ICU 58; we're upgrading to 59, and I should be able to re-test afterwards. The first of these properties is new, but the second two are not. cc @hashseed @schuay @jungshik

@leobalter
Copy link
Member

leobalter commented Apr 12, 2017

Nit: "matches all proper symbols" is a funny error message. At least, it could read "failed to match certain symbols". The best would be if, in the failure case, you could go through the code points and find the errant one.

I'm accountable for this message. We avoid to use messages as failure messages, thou. They are actually a message of what we expect, a description of the assertion.

While we want to improve "matches all proper symbols", it should be using some better description of each assertion. We also agree it should be a different message for each part so it helps finding the errant point.

mathiasbynens added a commit to mathiasbynens/unicode-property-escapes-tests that referenced this pull request Apr 13, 2017
mathiasbynens added a commit to mathiasbynens/unicode-property-escapes-tests that referenced this pull request Apr 13, 2017
@mathiasbynens
Copy link
Member Author

So U+10FFFE and U+10FFFF should match \p{Noncharacter_Code_Point} instead of \P{Noncharacter_Code_Point}, shouldn't it?

@anba You’re absolutely right! Fixed.

The tests are somewhat slow. These tests took 5m30s to run on my machine for V8, whereas the rest of the test262 tests took 15m39s. That makes these tests not the slowest thing ever, but it could still be nice to cut down on them a little bit (and I wouldn't mind if more thought went into use of the test generation tool to ensure that the matrix of combinations really does add value...).

@littledan I wonder if they’re any faster now that buildString has been improved to use chunking.

We also agree it should be a different message for each part so it helps finding the errant point.

@littledan & @leobalter The string currently isn’t checked in parts — it’s all in one go.

  1. We could test the string in parts, i.e. create a separate string per X code points and test it, but that would also be slower than just checking once and would still not produce code-point-specific error messages.
  2. We could test each code point individually, with code-point-specific error messages, but that would likely be very slow.

I’m tempted to leave it as-is: let’s optimize for the common case, where engines run these tests at build time long after implementing support for this feature. Nicer, more specific error messages only help engineers that are currently implementing the feature, but do so at a performance cost for everyone.

@littledan
Copy link
Member

@mathiasbynens They took 1m32s with the new version on my machine. A huge improvement, but that's still 10% as much time as the rest of test262--I'd say this is a bit borderline, and I'd be interested in input from others.

@littledan
Copy link
Member

For code point specific error messages, just a thought, you could search through the individual code points when one of the bigger strings finds a mismatch.

@hashseed
Copy link

because they are missing as property names: Emoji_Component, Prepended_Concatenation_Mark, and Composition_Exclusion

ICU 59 doesn't seem to include these properties either.

@leobalter
Copy link
Member

I’m tempted to leave it as-is: let’s optimize for the common case, where engines run these tests at build time long after implementing support for this feature.

Yes, I can help with that in the source as a follow up.

Nicer, more specific error messages only help engineers that are currently implementing the feature, but do so at a performance cost for everyone.

I believe we can fix it just by having less generic strings. I'll follow up with a PR in the source as well.

@mathiasbynens
Copy link
Member Author

mathiasbynens/unicode-property-escapes-tests#6

Proposal: https://github.com/tc39/proposal-regexp-unicode-property-escapes

These tests have been generated by the script at https://github.com/mathiasbynens/unicode-property-escapes-tests. They check all the properties and values that should be supported by implementations against the symbols they’re supposed to match. False positives are detected as well.

Ref. #950.
Ref. tc39/proposal-regexp-unicode-property-escapes#4.
@leobalter
Copy link
Member

Running on V8:

[~/dev/test262] (mathiasbynens-add-unicode-property-escape-tests)
$ time test262-harness --hostPath=`which d8` --hostType=d8 --hostArgs='--harmony_regexp_property' $(git diff --cached tc39/master --name-only --diff-filter=AM test/) -t 4
FAIL test/built-ins/RegExp/property-escapes/Assigned.js
  `\P{Assigned}` should match U+00DFFF (`���`)

FAIL test/built-ins/RegExp/property-escapes/Assigned.js
  `\P{Assigned}` should match U+00DFFF (`���`)

FAIL test/built-ins/RegExp/property-escapes/Composition_Exclusion.js
  Expected no error, got SyntaxError: Invalid regular expression: /^\p{Composition_Exclusion}+$/: Invalid property name

FAIL test/built-ins/RegExp/property-escapes/Composition_Exclusion.js
  Expected no error, got SyntaxError: Invalid regular expression: /^\p{Composition_Exclusion}+$/: Invalid property name

FAIL test/built-ins/RegExp/property-escapes/Emoji.js
  `\p{Emoji}` should match U+01F9E6 (`🧦`)

FAIL test/built-ins/RegExp/property-escapes/Emoji.js
  `\p{Emoji}` should match U+01F9E6 (`🧦`)

FAIL test/built-ins/RegExp/property-escapes/Emoji_Component.js
  Expected no error, got SyntaxError: Invalid regular expression: /^\p{Emoji_Component}+$/: Invalid property name

FAIL test/built-ins/RegExp/property-escapes/Emoji_Component.js
  Expected no error, got SyntaxError: Invalid regular expression: /^\p{Emoji_Component}+$/: Invalid property name

FAIL test/built-ins/RegExp/property-escapes/Emoji_Presentation.js
  `\p{Emoji_Presentation}` should match U+01F9E6 (`🧦`)

FAIL test/built-ins/RegExp/property-escapes/Emoji_Presentation.js
  `\p{Emoji_Presentation}` should match U+01F9E6 (`🧦`)

FAIL test/built-ins/RegExp/property-escapes/Emoji_Modifier_Base.js
  `\P{Emoji_Modifier_Base}` should match U+01F93C (`🤼`)

FAIL test/built-ins/RegExp/property-escapes/Emoji_Modifier_Base.js
  `\P{Emoji_Modifier_Base}` should match U+01F93C (`🤼`)

FAIL test/built-ins/RegExp/property-escapes/General_Category_-_Other.js
  `\P{C}` should match U+00DFFF (`���`)

FAIL test/built-ins/RegExp/property-escapes/General_Category_-_Other.js
  `\P{C}` should match U+00DFFF (`���`)

FAIL test/built-ins/RegExp/property-escapes/General_Category_-_Surrogate.js
  `\P{Cs}` should match U+00DFFF (`���`)

FAIL test/built-ins/RegExp/property-escapes/General_Category_-_Surrogate.js
  `\P{Cs}` should match U+00DFFF (`���`)

FAIL test/built-ins/RegExp/property-escapes/Prepended_Concatenation_Mark.js
  Expected no error, got SyntaxError: Invalid regular expression: /^\p{Prepended_Concatenation_Mark}+$/: Invalid property name

FAIL test/built-ins/RegExp/property-escapes/Prepended_Concatenation_Mark.js
  Expected no error, got SyntaxError: Invalid regular expression: /^\p{Prepended_Concatenation_Mark}+$/: Invalid property name

FAIL test/built-ins/RegExp/property-escapes/Script_Extensions_-_Inherited.js
  `\p{scx=Qaai}` should match U+00309A (`゚`)

FAIL test/built-ins/RegExp/property-escapes/Script_Extensions_-_Inherited.js
  `\p{scx=Qaai}` should match U+00309A (`゚`)

Ran 732 tests
712 passed
20 failed

real	1m4.783s
user	2m30.454s
sys	0m53.827s

I'm compiling SpiderMonkey to check these tests as well and the running time.

@leobalter leobalter merged commit 44b40e0 into tc39:master Apr 13, 2017
@anba
Copy link
Contributor

anba commented Apr 13, 2017

Assigned.js, General_Category_-_Other.js, and General_Category_-_Surrogate.js:
Has [0x00DC00, 0x00DFFF] in matchSymbols and nonMatchSymbols.

@anba
Copy link
Contributor

anba commented Apr 13, 2017

I'm compiling SpiderMonkey to check these tests as well and the running time.

SpiderMonkey doesn't yet support Unicode property escapes.

@leobalter
Copy link
Member

SpiderMonkey doesn't yet support Unicode property escapes.

Thanks for letting me know. I needed to build a new version of SM anyway. :)

@mathiasbynens
Copy link
Member Author

Looking into the unexpected failures in V8. (This is a problem with the tests — not V8’s implementation!)

@mathiasbynens mathiasbynens deleted the add-unicode-property-escape-tests branch April 13, 2017 15:50
@mathiasbynens
Copy link
Member Author

Fix for those three broken tests: #974

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants