Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Format for Unicode property escape test suite? #950

Closed
mathiasbynens opened this issue Apr 6, 2017 · 27 comments
Closed

[Question] Format for Unicode property escape test suite? #950

mathiasbynens opened this issue Apr 6, 2017 · 27 comments

Comments

@mathiasbynens
Copy link
Member

mathiasbynens commented Apr 6, 2017

As part of my work on the Unicode property escapes in regular expressions proposal, I’m planning to contribute tests for this feature. (Ref. tc39/proposal-regexp-unicode-property-escapes#4.)

My plan was to create a script (hosted in its own, separate repo) that generates the tests based on a given Unicode version, so they could easily be updated + upstreamed to test262. (Note that the Unicode standard is updated every year, and thus, so should the tests.) I would be maintaining this script and submitting PRs with the updated tests to test262 every year.

Now, because these tests will be generated programmatically, they can be as exhaustive as we want them to be. We can do better than randomly testing a couple of properties and symbols, which doesn’t guarantee 100% spec compliance. Ideally, every Unicode symbol would be tested against every supported property escape. This would ensure 100% interoperability (assuming all major engines run tests262 tests). The downside would be that these tests would potentially take a long time to run. @syg, @tcare: thoughts?

Here’s an exhaustive list of the supported properties per this proposal (as per the current version): https://github.com/mathiasbynens/regexpu-core/blob/master/property-escapes.md

My questions:

  1. Would such a test suite make sense as part of test262?
  2. If so, what would the directory/file structure look like? Some options:
    • A separate *.js file for each individual property escape?
    • A separate file for every 0xFFFF code points (i.e. every Unicode plane)?
    • Is there a better idea?

Any feedback is welcome.

cc @leobalter @jugglinmike @littledan

@tcare
Copy link
Member

tcare commented Apr 6, 2017

@dilijev may have some useful input here.

@tcare
Copy link
Member

tcare commented Apr 6, 2017

In general I like to have exhaustive tests available, but anything that runs for a significant amount of time should definitely be optional.

My main concerns for file format would be a balance between parallelizability and not having too many files. Do we have an estimate of how long it would take to run the two different organization proposals of the options you listed?

As for whether this belongs in 262, I lean towards yes but it would definitely have to be optional. I don't imagine people would want to run this too often. By that logic though, this could be a separate repo or optional submodule.

Would also appreciate @bterlson's thoughts.

@leobalter
Copy link
Member

I agree it should be part of test262, yes.

We can tackle some strategies to allow filtering tests if it takes too long to test them, like:

  1. My favorite: add a features: [unicode] flag in the frontmatter.
  2. Use a specific folder for these tests

I'm pretty bad using submodules and would avoid it if possible, same for another repo.


On generating the tests: I need to look better what is the current plan you have, @mathiasbynens, but maybe we could use the current test generation tool with templates and cases in the src/ folder here. This also allow us to assert values for many cases where we expect unicode to be properly parsed. e.g.: strings, template literals, regexps, etc.

@tcare
Copy link
Member

tcare commented Apr 6, 2017

I wouldn't mind having a flag just for long running tests either.

@leobalter
Copy link
Member

If so, what would the directory/file structure look like? Some options:

  • A separate *.js file for each individual property escape?
  • A separate file for every 0xFFFF code points (i.e. every Unicode plane)?

As I mentioned above, template/cases are still useful for parsing the same unicode in different places. For each individual property escape we can identify assertions with respective messages, like this raw example:

assert.sameValue(actual, expected, '<actual> should parse to <expected>');
assert.sameValue(actual, expected, '<actual> should parse to <expected>');
assert.sameValue(actual, expected, '<actual> should parse to <expected>');

@littledan
Copy link
Member

A few thoughts:

  • It sounds like a good idea to check some tests from this process to test262, both the output and, if you want, the code to generate the tests. There's other test generation code already checked in. I bet you'd be using significantly somewhat logic to generate the tests here, compared to the existing test generation code.
  • I don't think flags or a subdirectory are all we'd have in an optimal solution, though I do like the idea of a slow flag. Ideally, there should be a decent way to run tests that get a good sense of correctness relatively rapidly, that users can afford to run interactively and in continuous integration. For example, this could be, for each flag, three arbitrarily chosen codepoints in the set and three outside of the set. This could be one of multiple modes for the test generator.
  • File per plane doesn't make that much sense to me, since there's a lot more going on in the BMP than other planes, and there aren't that many planes. I like the idea of a file per property escape. Within the files, if you do decide to check exhaustively, hopefully you don't generate one huge file, and instead iterate through datastructures that hold the output.

@syg
Copy link
Contributor

syg commented Apr 6, 2017

How many tests and how long a time are we talking about here?

While SpiderMonkey has a "slow" flag, in practice this means the tests marked slow are very rarely, if ever, run. The default on CI and the local test harness skips slow tests. Nobody I know really goes out of their way to run slow tests when running the whole suite.

@leobalter
Copy link
Member

@syg can you give me the link where I can find the tests marked as slow?

It would be a nice base to make plans for a shared slow flag on test262

@syg
Copy link
Contributor

syg commented Apr 6, 2017

@leobalter Oh, it's quite unsophisticated. A test is marked as slow with a comment on the top of tests, for example on the top of this test. Then the harness, unless specifically given --run-slow-tests, skips all of them.

@dilijev
Copy link
Contributor

dilijev commented Apr 6, 2017

I somewhat like the idea of having a fully-exhaustive test which will not run by default and will be explicitly opt-in; however, I have some concerns.

Since it's only one test (and effectively testing only one feature / section of spec), a long-running test would eat up a large proportion of CPU time compared to the representative chunk of the standard under test.

Therefore, I like @littledan's approach to having a smaller test which is a representative set of points which belong and do not belong to each set, and have that be part of the suite by default, with a more exhaustive test belonging to another repo or somehow off-by-default with an explicit opt-in.

This explicit opt-in to me means that someone needs to know about the test and have to indicate to the test runner that this test should be run, knowing full well the time requirement. (i.e. if I run test262-harness test262\test\**\*.js this test should not run unless I also do something like pass a -includeVerySlowTests flag).

I had a similar idea for testing our case-insensitive match table in ChakraCore. With my recent update of that table I generated a test of all of the specifically-expected (or known by the generator to be potentially problematic) positive and negative matches expected by the generator of the table for case-insensitive matching. I also manually created a smaller test which would test the problematic areas and representative regions from the larger test. These tests do not include all expected negative matches (which would be truly exhaustive) and therefore do not check for false positive matches. Using an approach of "what the generator expected is what the engine did (and that's all we tested)" also doesn't let us check whether other engines return true in more cases than we do.

To solve this problem, my plan was to write a test which would take a similarly-generated table (but in Javascript this time) and exhaustively test all 0x110000 codepoints against every other codepoint ((2^20+2^16)^2 tests). Knowing the implementation, the likelihood of this turning up something unexpected is very small, but not zero, so it's a marginally valuable test.

However, running such a test would take approximately* 18.5 process-days.

* I took a few wall-clock-runtime data points for doing this many regex matches from my machine and extrapolated based on number of alternatives, so this won't be super accurate but gives a good ballpark. Here's the test I used to get the data. Doesn't include a check against expected result.

const UNICODE_MAX = 0x10FFFF; // 21 bits (actually 20 bits + 16 bits)
const START = 0;
const END = 0x2000; // vary this number to collect some data points
// const END = UNICODE_MAX;

function compareCodepoints(a, b) {
    const re = new RegExp("\\" + String.fromCodePoint(a), 'iu');
    const toMatch = String.fromCodePoint(b);
    const result = re.test(toMatch);
}

for (let a = START; a <= END; ++a) {
    for (let b = START; b <= END; ++b) {
        compareCodepoints(a, b);
    }
}

The question is whether such a test would be worth the run time. The conclusion @tcare and I have reached is that it would be better to run this like a fuzzer and choose random code points and run for a fixed amount of time every day to catch issues early without a lot of runtime overhead. Also maybe run the whole exhaustive test once per release or on a similarly long cadence. Also useful (probably as a manual exercise) would be to check the behavior for the same tests against v8 and sm to ensure we match behavior.

@dilijev
Copy link
Contributor

dilijev commented Apr 6, 2017

@syg @littledan @mathiasbynens @tcare Obviously testing all 0x110000 code points against all others is way more than is being asked for here. My experiment shows that (for Chakra at least) the runtime we'd expect for testing all codepoints against a single flag would be in the realm of 1.5-2 seconds (assuming performance on par with checking a single character under /iu). To allow for this type of match being slower, let's call it 3 seconds? I don't know how many different possibilities for \p{...} there are but let's call it 100.

300 seconds is not an insanely long test, but I still wouldn't want to run it by accident.

@dilijev
Copy link
Contributor

dilijev commented Apr 6, 2017

@mathiasbynens Based on the above, if these tests were split apart I think I'd prefer this option.

A separate *.js file for each individual property escape

In that case (a few seconds per test), perhaps marking them as slow would not even be necessary.

@littledan
Copy link
Member

@dilijev Thanks for bringing up case-insensitive match--seems like a great idea to use a similar approach there. Another couple cases which could similarly benefit would be toUpperCase/toLowerCase and the definition of ID_Start/ID_Continue. (I'd leave out locale-dependent functionality, though.)

@mathiasbynens
Copy link
Member Author

@littledan That’s a good point. I’d be willing to contribute Identifier / IdentifierStart / IdentifierPart tests as well. My old tests are here: https://mathias.html5.org/tests/javascript/identifiers/ They’re definitely similar in that the tests are large in size and slow to run.

@anba
Copy link
Contributor

anba commented Apr 7, 2017

For some of these tests it simply depends on how the test is performed. For example iterating over all individual code points could be too slow, so instead we should test multiple code points at once. In https://github.com/anba/es6draft/tree/master/src/test/scripts/suite/syntax/identifier, I also have tests for IdentifierStart and IdentifierPart, but I'm always testing (consecutive) ranges of code points to reduce the total execution time of the tests.

@mathiasbynens
Copy link
Member Author

Quick WIP test generator script that demonstrates what I had in mind originally: https://github.com/mathiasbynens/unicode-property-escapes-tests

The generated tests check every possible \p{…} combination against all the symbols they’re supposed to match. Then, the negated \P{…} form is checked against some (not all) other symbols. As such, the tests are not fully exhaustive, and do not detect false positives. (This could be tweaked by setting MAX_NON_MATCH_LENGTH to 0x10FFFF but the tests would become very large and slow to generate.)

@dilijev
Copy link
Contributor

dilijev commented Apr 11, 2017

I think that a quick test like the MAX_NON_MATCH_LENGTH 256 variant to check for moderate coverage of match and non-match to detect any false negatives and ensure generally good working order of negated matches is good for speed and size.

However, I think we should still have a slow variant of the test which detects false negatives to really call it truly covered.

@mathiasbynens
Copy link
Member Author

mathiasbynens commented Apr 12, 2017

However, I think we should still have a slow variant of the test which detects false negatives to really call it truly covered.

With the current approach, each generated test file is then ~13 MB, resulting in 4.5 GB of output.

Instead of hardcoding the generated strings, we could output a list of character ranges to test. That would result in smaller generated files, possibly at the cost of run-time performance.

@littledan
Copy link
Member

littledan commented Apr 12, 2017

Looks like these tests check all code points. Even if it's grouped together into one string, I think that's still too big and slow to include in test262 to run all the time. What if you made a mode of those tests which took a smaller number (3? 100?) random elements from the included set, and three from the excluded set, to generate a much shorter string?

@mathiasbynens
Copy link
Member Author

mathiasbynens commented Apr 12, 2017

Instead of hardcoding the generated strings, we could output a list of character ranges to test. That would result in smaller generated files, possibly at the cost of run-time performance.

I’ve given that a try. The script now generates test files containing something like the following, instead of the long hardcoded strings:

const buildString = ({ loneCodePoints, ranges }) => {
	let result = String.fromCodePoint(...loneCodePoints);
	for (const [start, end] of ranges) {
		for (let codePoint = start; codePoint <= end; codePoint++) {
			result += String.fromCodePoint(codePoint);
		}
	}
	return result;
};

// \p{Bidi_Control}
const matchSymbols = buildString({
	loneCodePoints: [
		0x00061C
	],
	ranges: [
		[0x00200E, 0x00200F],
		[0x00202A, 0x00202E],
		[0x002066, 0x002069]
	]
});

This results in much smaller test files — the entire suite now takes up only 1.3 MB. Much better than the 4.5 GB we had previously, at the cost of test readability.

@mathiasbynens
Copy link
Member Author

mathiasbynens commented Apr 12, 2017

What if you made a mode of those tests which took a smaller number (3? 100?) random elements from the included set, and three from the excluded set, to generate a much shorter string?

Before I do that, I’d like to see how much of a problem the current output (with fully exhaustive tests) is.

I’d prefer the test generation script to have deterministic output for any given Unicode version.

The plan is to contribute an additional (not-autogenerated) test file to test262 which performs a few random checks, similar to what you mentioned. This file would also test things that the generated tests do not check for, such as invalid property names, combining multiple escapes in a character class, etc. (Similar to the tests that V8 has right now.)

@littledan
Copy link
Member

Possibly silly idea: You could make things deterministic without being manual by using a simple pseudo-random number generator with a constant seed.

leobalter pushed a commit that referenced this issue Apr 13, 2017
Proposal: https://github.com/tc39/proposal-regexp-unicode-property-escapes

These tests have been generated by the script at https://github.com/mathiasbynens/unicode-property-escapes-tests. They check all the properties and values that should be supported by implementations against the symbols they’re supposed to match. False positives are detected as well.

Ref. #950.
Ref. tc39/proposal-regexp-unicode-property-escapes#4.
@dilijev
Copy link
Contributor

dilijev commented Apr 14, 2017

simple pseudo-random number generator with a constant seed.

Definitely a cost to readability. It's not possible to know which checks are run by reading the source -- only by actually running the test.

@mathiasbynens
Copy link
Member Author

@dilijev The PRNG could be run as part of the build script that generates the test. In that case, the generated test source would be readable. Anyhow, the current test output is exhaustive and surprisingly compact and quick to execute, so hopefully we don’t need to worry about any of that.

@littledan
Copy link
Member

littledan commented Apr 14, 2017

Do other people who run the tests regularly have a point of view of these tests' runtime, which increases the total runtime of test262 tests by about 10%? cc @syg

@dilijev
Copy link
Contributor

dilijev commented Apr 14, 2017

10% is a fair chunk of time. I believe it took me about an hour to run the entire suite previously. Usually I run a set of tests targeted at what area I'm investigating. shrug

@jugglinmike
Copy link
Contributor

gh-971 reflects this approach, and it's been further refined since it was merged back in April of 2017.

For anyone reading this discussion to understand the practical considerations behind this decision: you may be interested in gh-3199 which has a bit of discussion on the peer review process.

In any case, it seems to me like this issue is resolved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants