-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Format for Unicode property escape test suite? #950
Comments
@dilijev may have some useful input here. |
In general I like to have exhaustive tests available, but anything that runs for a significant amount of time should definitely be optional. My main concerns for file format would be a balance between parallelizability and not having too many files. Do we have an estimate of how long it would take to run the two different organization proposals of the options you listed? As for whether this belongs in 262, I lean towards yes but it would definitely have to be optional. I don't imagine people would want to run this too often. By that logic though, this could be a separate repo or optional submodule. Would also appreciate @bterlson's thoughts. |
I agree it should be part of test262, yes. We can tackle some strategies to allow filtering tests if it takes too long to test them, like:
I'm pretty bad using submodules and would avoid it if possible, same for another repo. On generating the tests: I need to look better what is the current plan you have, @mathiasbynens, but maybe we could use the current test generation tool with templates and cases in the |
I wouldn't mind having a flag just for long running tests either. |
As I mentioned above, template/cases are still useful for parsing the same unicode in different places. For each individual property escape we can identify assertions with respective messages, like this raw example: assert.sameValue(actual, expected, '<actual> should parse to <expected>');
assert.sameValue(actual, expected, '<actual> should parse to <expected>');
assert.sameValue(actual, expected, '<actual> should parse to <expected>'); |
A few thoughts:
|
How many tests and how long a time are we talking about here? While SpiderMonkey has a "slow" flag, in practice this means the tests marked slow are very rarely, if ever, run. The default on CI and the local test harness skips slow tests. Nobody I know really goes out of their way to run slow tests when running the whole suite. |
@syg can you give me the link where I can find the tests marked as slow? It would be a nice base to make plans for a shared slow flag on test262 |
@leobalter Oh, it's quite unsophisticated. A test is marked as slow with a comment on the top of tests, for example on the top of this test. Then the harness, unless specifically given |
I somewhat like the idea of having a fully-exhaustive test which will not run by default and will be explicitly opt-in; however, I have some concerns. Since it's only one test (and effectively testing only one feature / section of spec), a long-running test would eat up a large proportion of CPU time compared to the representative chunk of the standard under test. Therefore, I like @littledan's approach to having a smaller test which is a representative set of points which belong and do not belong to each set, and have that be part of the suite by default, with a more exhaustive test belonging to another repo or somehow off-by-default with an explicit opt-in. This explicit opt-in to me means that someone needs to know about the test and have to indicate to the test runner that this test should be run, knowing full well the time requirement. (i.e. if I run I had a similar idea for testing our case-insensitive match table in ChakraCore. With my recent update of that table I generated a test of all of the specifically-expected (or known by the generator to be potentially problematic) positive and negative matches expected by the generator of the table for case-insensitive matching. I also manually created a smaller test which would test the problematic areas and representative regions from the larger test. These tests do not include all expected negative matches (which would be truly exhaustive) and therefore do not check for false positive matches. Using an approach of "what the generator expected is what the engine did (and that's all we tested)" also doesn't let us check whether other engines return true in more cases than we do.
To solve this problem, my plan was to write a test which would take a similarly-generated table (but in Javascript this time) and exhaustively test all 0x110000 codepoints against every other codepoint ((2^20+2^16)^2 tests). Knowing the implementation, the likelihood of this turning up something unexpected is very small, but not zero, so it's a marginally valuable test. However, running such a test would take approximately* 18.5 process-days. * I took a few wall-clock-runtime data points for doing this many regex matches from my machine and extrapolated based on number of alternatives, so this won't be super accurate but gives a good ballpark. Here's the test I used to get the data. Doesn't include a check against expected result. const UNICODE_MAX = 0x10FFFF; // 21 bits (actually 20 bits + 16 bits)
const START = 0;
const END = 0x2000; // vary this number to collect some data points
// const END = UNICODE_MAX;
function compareCodepoints(a, b) {
const re = new RegExp("\\" + String.fromCodePoint(a), 'iu');
const toMatch = String.fromCodePoint(b);
const result = re.test(toMatch);
}
for (let a = START; a <= END; ++a) {
for (let b = START; b <= END; ++b) {
compareCodepoints(a, b);
}
} The question is whether such a test would be worth the run time. The conclusion @tcare and I have reached is that it would be better to run this like a fuzzer and choose random code points and run for a fixed amount of time every day to catch issues early without a lot of runtime overhead. Also maybe run the whole exhaustive test once per release or on a similarly long cadence. Also useful (probably as a manual exercise) would be to check the behavior for the same tests against v8 and sm to ensure we match behavior. |
@syg @littledan @mathiasbynens @tcare Obviously testing all 0x110000 code points against all others is way more than is being asked for here. My experiment shows that (for Chakra at least) the runtime we'd expect for testing all codepoints against a single flag would be in the realm of 1.5-2 seconds (assuming performance on par with checking a single character under 300 seconds is not an insanely long test, but I still wouldn't want to run it by accident. |
@mathiasbynens Based on the above, if these tests were split apart I think I'd prefer this option.
In that case (a few seconds per test), perhaps marking them as slow would not even be necessary. |
@dilijev Thanks for bringing up case-insensitive match--seems like a great idea to use a similar approach there. Another couple cases which could similarly benefit would be toUpperCase/toLowerCase and the definition of ID_Start/ID_Continue. (I'd leave out locale-dependent functionality, though.) |
@littledan That’s a good point. I’d be willing to contribute |
For some of these tests it simply depends on how the test is performed. For example iterating over all individual code points could be too slow, so instead we should test multiple code points at once. In https://github.com/anba/es6draft/tree/master/src/test/scripts/suite/syntax/identifier, I also have tests for |
Quick WIP test generator script that demonstrates what I had in mind originally: https://github.com/mathiasbynens/unicode-property-escapes-tests The generated tests check every possible |
I think that a quick test like the However, I think we should still have a slow variant of the test which detects false negatives to really call it truly covered. |
With the current approach, each generated test file is then ~13 MB, resulting in 4.5 GB of output. Instead of hardcoding the generated strings, we could output a list of character ranges to test. That would result in smaller generated files, possibly at the cost of run-time performance. |
Looks like these tests check all code points. Even if it's grouped together into one string, I think that's still too big and slow to include in test262 to run all the time. What if you made a mode of those tests which took a smaller number (3? 100?) random elements from the included set, and three from the excluded set, to generate a much shorter string? |
I’ve given that a try. The script now generates test files containing something like the following, instead of the long hardcoded strings: const buildString = ({ loneCodePoints, ranges }) => {
let result = String.fromCodePoint(...loneCodePoints);
for (const [start, end] of ranges) {
for (let codePoint = start; codePoint <= end; codePoint++) {
result += String.fromCodePoint(codePoint);
}
}
return result;
};
// \p{Bidi_Control}
const matchSymbols = buildString({
loneCodePoints: [
0x00061C
],
ranges: [
[0x00200E, 0x00200F],
[0x00202A, 0x00202E],
[0x002066, 0x002069]
]
}); This results in much smaller test files — the entire suite now takes up only 1.3 MB. Much better than the 4.5 GB we had previously, at the cost of test readability. |
Before I do that, I’d like to see how much of a problem the current output (with fully exhaustive tests) is. I’d prefer the test generation script to have deterministic output for any given Unicode version. The plan is to contribute an additional (not-autogenerated) test file to test262 which performs a few random checks, similar to what you mentioned. This file would also test things that the generated tests do not check for, such as invalid property names, combining multiple escapes in a character class, etc. (Similar to the tests that V8 has right now.) |
Possibly silly idea: You could make things deterministic without being manual by using a simple pseudo-random number generator with a constant seed. |
Proposal: https://github.com/tc39/proposal-regexp-unicode-property-escapes These tests have been generated by the script at https://github.com/mathiasbynens/unicode-property-escapes-tests. They check all the properties and values that should be supported by implementations against the symbols they’re supposed to match. False positives are detected as well. Ref. #950. Ref. tc39/proposal-regexp-unicode-property-escapes#4.
Definitely a cost to readability. It's not possible to know which checks are run by reading the source -- only by actually running the test. |
@dilijev The PRNG could be run as part of the build script that generates the test. In that case, the generated test source would be readable. Anyhow, the current test output is exhaustive and surprisingly compact and quick to execute, so hopefully we don’t need to worry about any of that. |
Do other people who run the tests regularly have a point of view of these tests' runtime, which increases the total runtime of test262 tests by about 10%? cc @syg |
10% is a fair chunk of time. I believe it took me about an hour to run the entire suite previously. Usually I run a set of tests targeted at what area I'm investigating. shrug |
gh-971 reflects this approach, and it's been further refined since it was merged back in April of 2017. For anyone reading this discussion to understand the practical considerations behind this decision: you may be interested in gh-3199 which has a bit of discussion on the peer review process. In any case, it seems to me like this issue is resolved! |
As part of my work on the Unicode property escapes in regular expressions proposal, I’m planning to contribute tests for this feature. (Ref. tc39/proposal-regexp-unicode-property-escapes#4.)
My plan was to create a script (hosted in its own, separate repo) that generates the tests based on a given Unicode version, so they could easily be updated + upstreamed to test262. (Note that the Unicode standard is updated every year, and thus, so should the tests.) I would be maintaining this script and submitting PRs with the updated tests to test262 every year.
Now, because these tests will be generated programmatically, they can be as exhaustive as we want them to be. We can do better than randomly testing a couple of properties and symbols, which doesn’t guarantee 100% spec compliance. Ideally, every Unicode symbol would be tested against every supported property escape. This would ensure 100% interoperability (assuming all major engines run tests262 tests). The downside would be that these tests would potentially take a long time to run. @syg, @tcare: thoughts?
Here’s an exhaustive list of the supported properties per this proposal (as per the current version): https://github.com/mathiasbynens/regexpu-core/blob/master/property-escapes.md
My questions:
*.js
file for each individual property escape?0xFFFF
code points (i.e. every Unicode plane)?Any feedback is welcome.
cc @leobalter @jugglinmike @littledan
The text was updated successfully, but these errors were encountered: