Add exhaustive tests for RegExp Unicode property escapes #971

mathiasbynens · 2017-04-12T14:45:57Z

Proposal: https://github.com/tc39/proposal-regexp-unicode-property-escapes

These tests have been generated by the script at https://github.com/mathiasbynens/unicode-property-escapes-tests. They check all the properties and values that should be supported by implementations against the symbols they’re supposed to match. False positives are detected as well.

Ref. #950.
Ref. tc39/proposal-regexp-unicode-property-escapes#4.

anba · 2017-04-12T17:37:28Z

Initial observations:

There is a bug in the code generation for Noncharacter_Code_Point, it tests 10FFFE..10FFFF are not Noncharacter_Code_Point.
Code point ranges must not be consecutive when surrogate ranges are involved, e.g. when testing General_Category=Private_Use, the string "\uDBFF\uDC00" is generated (U+DBFF is the last private-use high surrogate and U+DC00 is the first low surrogate).
The buildString function should probably be optimized, otherwise debug builds will spent most of the time assembling the test string. For example when I changed it to this version

const buildString = ({ loneCodePoints, ranges }) => {
    let result = String.fromCodePoint(...loneCodePoints);
    for (const [start, end] of ranges) {
        let codePoints = [];
        for (let i = 0, codePoint = start; codePoint <= end; codePoint++) {
            codePoints[i++] = codePoint;
            if (i === 10000) {
                result += String.fromCodePoint(...codePoints);
                codePoints.length = i = 0;
            }
        }
        result += String.fromCodePoint(...codePoints);
    }
    return result;
};

a v8-debug build created the string for ASCII_Hex_Digit in 750ms compared to 9750ms with the current buildString function.

mathiasbynens · 2017-04-12T17:51:19Z

@anba Thanks for taking a look!

There is a bug in the code generation for Noncharacter_Code_Point, it tests 10FFFE..10FFFF [which] are not Noncharacter_Code_Point.

Did you mean here? That’s in nonMatchSymbols, i.e. the symbols that are supposed to not match — so that’s fine. Am I overlooking something?

I’ll address your other feedback (excellent points/suggestions!) tomorrow — if you don’t beat me to it with a PR, that is.

leobalter · 2017-04-12T17:56:52Z

test/built-ins/RegExp/property-escapes/ASCII.js

+		}
+	}
+	return result;
+};


if buildString is the same among all tests, we should add a new js file in the /harness/ folder and get it through the includes tag. i.e.:

esid: sec-static-semantics-unicodematchproperty-p features: [regexp-unicode-property-escapes] includes: [buildString.js] ---*/

anba · 2017-04-12T18:12:21Z

Did you mean here? That’s in nonMatchSymbols, i.e. the symbols that are supposed to not match — so that’s fine. Am I overlooking something?

PropList.txt has:

10FFFE..10FFFF; Noncharacter_Code_Point # Cn [2] <noncharacter-10FFFE>..<noncharacter-10FFFF>

So U+10FFFE and U+10FFFF should match \p{Noncharacter_Code_Point} instead of \P{Noncharacter_Code_Point}, shouldn't it?

littledan · 2017-04-12T21:55:10Z

These tests are great, and they seem to test the feature in a pretty thorough way. A couple thoughts:

The tests are somewhat slow. These tests took 5m30s to run on my machine for V8, whereas the rest of the test262 tests took 15m39s. That makes these tests not the slowest thing ever, but it could still be nice to cut down on them a little bit (and I wouldn't mind if more thought went into use of the test generation tool to ensure that the matrix of combinations really does add value...).
Nit: "matches all proper symbols" is a funny error message. At least, it could read "failed to match certain symbols". The best would be if, in the failure case, you could go through the code points and find the errant one.
FWIW V8 passes most (97%) of these if you turn the feature flag on, but fails some of these tests by disagreeing on the set of characters (different Unicode version?), and others because they are missing as property names: Emoji_Component, Prepended_Concatenation_Mark, and Composition_Exclusion. This is based on looking through all the properties in ICU 58; we're upgrading to 59, and I should be able to re-test afterwards. The first of these properties is new, but the second two are not. cc @hashseed @schuay @jungshik

leobalter · 2017-04-12T22:50:34Z

Nit: "matches all proper symbols" is a funny error message. At least, it could read "failed to match certain symbols". The best would be if, in the failure case, you could go through the code points and find the errant one.

I'm accountable for this message. We avoid to use messages as failure messages, thou. They are actually a message of what we expect, a description of the assertion.

While we want to improve "matches all proper symbols", it should be using some better description of each assertion. We also agree it should be a different message for each part so it helps finding the errant point.

@anba

Thanks to @anba for the suggestion! tc39/test262#971 (comment)

@anba

Thanks to @anba for spotting this! tc39/test262#971 (comment)

mathiasbynens · 2017-04-13T10:00:59Z

So U+10FFFE and U+10FFFF should match \p{Noncharacter_Code_Point} instead of \P{Noncharacter_Code_Point}, shouldn't it?

@anba You’re absolutely right! Fixed.

The tests are somewhat slow. These tests took 5m30s to run on my machine for V8, whereas the rest of the test262 tests took 15m39s. That makes these tests not the slowest thing ever, but it could still be nice to cut down on them a little bit (and I wouldn't mind if more thought went into use of the test generation tool to ensure that the matrix of combinations really does add value...).

@littledan I wonder if they’re any faster now that buildString has been improved to use chunking.

We also agree it should be a different message for each part so it helps finding the errant point.

@littledan & @leobalter The string currently isn’t checked in parts — it’s all in one go.

We could test the string in parts, i.e. create a separate string per X code points and test it, but that would also be slower than just checking once and would still not produce code-point-specific error messages.
We could test each code point individually, with code-point-specific error messages, but that would likely be very slow.

I’m tempted to leave it as-is: let’s optimize for the common case, where engines run these tests at build time long after implementing support for this feature. Nicer, more specific error messages only help engineers that are currently implementing the feature, but do so at a performance cost for everyone.

littledan · 2017-04-13T10:39:52Z

@mathiasbynens They took 1m32s with the new version on my machine. A huge improvement, but that's still 10% as much time as the rest of test262--I'd say this is a bit borderline, and I'd be interested in input from others.

littledan · 2017-04-13T10:40:41Z

For code point specific error messages, just a thought, you could search through the individual code points when one of the bigger strings finds a mismatch.

hashseed · 2017-04-13T12:38:08Z

because they are missing as property names: Emoji_Component, Prepended_Concatenation_Mark, and Composition_Exclusion

ICU 59 doesn't seem to include these properties either.

leobalter · 2017-04-13T14:44:14Z

I’m tempted to leave it as-is: let’s optimize for the common case, where engines run these tests at build time long after implementing support for this feature.

Yes, I can help with that in the source as a follow up.

Nicer, more specific error messages only help engineers that are currently implementing the feature, but do so at a performance cost for everyone.

I believe we can fix it just by having less generic strings. I'll follow up with a PR in the source as well.

mathiasbynens · 2017-04-13T14:49:01Z

mathiasbynens/unicode-property-escapes-tests#6

Proposal: https://github.com/tc39/proposal-regexp-unicode-property-escapes These tests have been generated by the script at https://github.com/mathiasbynens/unicode-property-escapes-tests. They check all the properties and values that should be supported by implementations against the symbols they’re supposed to match. False positives are detected as well. Ref. #950. Ref. tc39/proposal-regexp-unicode-property-escapes#4.

leobalter · 2017-04-13T15:22:06Z

Running on V8:

[~/dev/test262] (mathiasbynens-add-unicode-property-escape-tests)
$ time test262-harness --hostPath=`which d8` --hostType=d8 --hostArgs='--harmony_regexp_property' $(git diff --cached tc39/master --name-only --diff-filter=AM test/) -t 4
FAIL test/built-ins/RegExp/property-escapes/Assigned.js
  `\P{Assigned}` should match U+00DFFF (`���`)

FAIL test/built-ins/RegExp/property-escapes/Assigned.js
  `\P{Assigned}` should match U+00DFFF (`���`)

FAIL test/built-ins/RegExp/property-escapes/Composition_Exclusion.js
  Expected no error, got SyntaxError: Invalid regular expression: /^\p{Composition_Exclusion}+$/: Invalid property name

FAIL test/built-ins/RegExp/property-escapes/Composition_Exclusion.js
  Expected no error, got SyntaxError: Invalid regular expression: /^\p{Composition_Exclusion}+$/: Invalid property name

FAIL test/built-ins/RegExp/property-escapes/Emoji.js
  `\p{Emoji}` should match U+01F9E6 (`🧦`)

FAIL test/built-ins/RegExp/property-escapes/Emoji.js
  `\p{Emoji}` should match U+01F9E6 (`🧦`)

FAIL test/built-ins/RegExp/property-escapes/Emoji_Component.js
  Expected no error, got SyntaxError: Invalid regular expression: /^\p{Emoji_Component}+$/: Invalid property name

FAIL test/built-ins/RegExp/property-escapes/Emoji_Component.js
  Expected no error, got SyntaxError: Invalid regular expression: /^\p{Emoji_Component}+$/: Invalid property name

FAIL test/built-ins/RegExp/property-escapes/Emoji_Presentation.js
  `\p{Emoji_Presentation}` should match U+01F9E6 (`🧦`)

FAIL test/built-ins/RegExp/property-escapes/Emoji_Presentation.js
  `\p{Emoji_Presentation}` should match U+01F9E6 (`🧦`)

FAIL test/built-ins/RegExp/property-escapes/Emoji_Modifier_Base.js
  `\P{Emoji_Modifier_Base}` should match U+01F93C (`🤼`)

FAIL test/built-ins/RegExp/property-escapes/Emoji_Modifier_Base.js
  `\P{Emoji_Modifier_Base}` should match U+01F93C (`🤼`)

FAIL test/built-ins/RegExp/property-escapes/General_Category_-_Other.js
  `\P{C}` should match U+00DFFF (`���`)

FAIL test/built-ins/RegExp/property-escapes/General_Category_-_Other.js
  `\P{C}` should match U+00DFFF (`���`)

FAIL test/built-ins/RegExp/property-escapes/General_Category_-_Surrogate.js
  `\P{Cs}` should match U+00DFFF (`���`)

FAIL test/built-ins/RegExp/property-escapes/General_Category_-_Surrogate.js
  `\P{Cs}` should match U+00DFFF (`���`)

FAIL test/built-ins/RegExp/property-escapes/Prepended_Concatenation_Mark.js
  Expected no error, got SyntaxError: Invalid regular expression: /^\p{Prepended_Concatenation_Mark}+$/: Invalid property name

FAIL test/built-ins/RegExp/property-escapes/Prepended_Concatenation_Mark.js
  Expected no error, got SyntaxError: Invalid regular expression: /^\p{Prepended_Concatenation_Mark}+$/: Invalid property name

FAIL test/built-ins/RegExp/property-escapes/Script_Extensions_-_Inherited.js
  `\p{scx=Qaai}` should match U+00309A (`゚`)

FAIL test/built-ins/RegExp/property-escapes/Script_Extensions_-_Inherited.js
  `\p{scx=Qaai}` should match U+00309A (`゚`)

Ran 732 tests
712 passed
20 failed

real	1m4.783s
user	2m30.454s
sys	0m53.827s

I'm compiling SpiderMonkey to check these tests as well and the running time.

anba · 2017-04-13T15:27:35Z

Assigned.js, General_Category_-_Other.js, and General_Category_-_Surrogate.js:
Has [0x00DC00, 0x00DFFF] in matchSymbols and nonMatchSymbols.

anba · 2017-04-13T15:30:28Z

I'm compiling SpiderMonkey to check these tests as well and the running time.

SpiderMonkey doesn't yet support Unicode property escapes.

leobalter · 2017-04-13T15:32:11Z

SpiderMonkey doesn't yet support Unicode property escapes.

Thanks for letting me know. I needed to build a new version of SM anyway. :)

mathiasbynens · 2017-04-13T15:42:04Z

Looking into the unexpected failures in V8. (This is a problem with the tests — not V8’s implementation!)

mathiasbynens · 2017-04-13T16:01:38Z

Fix for those three broken tests: #974

mathiasbynens mentioned this pull request Apr 12, 2017

Advance to stage 4 tc39/proposal-regexp-unicode-property-escapes#4

Closed

leobalter suggested changes Apr 12, 2017

View reviewed changes

leobalter added the coverage label Apr 12, 2017

mathiasbynens added a commit to mathiasbynens/unicode-property-escapes-tests that referenced this pull request Apr 13, 2017

Use chunking in buildString

54bc320

Thanks to @anba for the suggestion! tc39/test262#971 (comment)

mathiasbynens added a commit to mathiasbynens/unicode-property-escapes-tests that referenced this pull request Apr 13, 2017

Add missing Noncharacter_Code_Point code points

c173191

Thanks to @anba for spotting this! tc39/test262#971 (comment)

mathiasbynens mentioned this pull request Apr 13, 2017

Ensure lone high surrogates aren’t followed by lone low surrogates mathiasbynens/unicode-property-escapes-tests#5

Closed

leobalter approved these changes Apr 13, 2017

View reviewed changes

leobalter merged commit 44b40e0 into tc39:master Apr 13, 2017

mathiasbynens deleted the add-unicode-property-escape-tests branch April 13, 2017 15:50

jugglinmike mentioned this pull request Sep 30, 2021

[Question] Format for Unicode property escape test suite? #950

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add exhaustive tests for RegExp Unicode property escapes #971

Add exhaustive tests for RegExp Unicode property escapes #971

mathiasbynens commented Apr 12, 2017

anba commented Apr 12, 2017

mathiasbynens commented Apr 12, 2017

leobalter Apr 12, 2017

mathiasbynens Apr 13, 2017

anba commented Apr 12, 2017

littledan commented Apr 12, 2017

leobalter commented Apr 12, 2017 •

edited

Loading

mathiasbynens commented Apr 13, 2017

littledan commented Apr 13, 2017

littledan commented Apr 13, 2017

hashseed commented Apr 13, 2017

leobalter commented Apr 13, 2017

mathiasbynens commented Apr 13, 2017

leobalter commented Apr 13, 2017

anba commented Apr 13, 2017

anba commented Apr 13, 2017

leobalter commented Apr 13, 2017

mathiasbynens commented Apr 13, 2017

mathiasbynens commented Apr 13, 2017

Add exhaustive tests for RegExp Unicode property escapes #971

Add exhaustive tests for RegExp Unicode property escapes #971

Conversation

mathiasbynens commented Apr 12, 2017

anba commented Apr 12, 2017

mathiasbynens commented Apr 12, 2017

leobalter Apr 12, 2017

Choose a reason for hiding this comment

mathiasbynens Apr 13, 2017

Choose a reason for hiding this comment

anba commented Apr 12, 2017

littledan commented Apr 12, 2017

leobalter commented Apr 12, 2017 • edited Loading

mathiasbynens commented Apr 13, 2017

littledan commented Apr 13, 2017

littledan commented Apr 13, 2017

hashseed commented Apr 13, 2017

leobalter commented Apr 13, 2017

mathiasbynens commented Apr 13, 2017

leobalter commented Apr 13, 2017

anba commented Apr 13, 2017

anba commented Apr 13, 2017

leobalter commented Apr 13, 2017

mathiasbynens commented Apr 13, 2017

mathiasbynens commented Apr 13, 2017

leobalter commented Apr 12, 2017 •

edited

Loading