-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Property-based testing for the parser #91
Comments
I did some initial investigation and I may be missing something :( For what I can see, the generated examples for the code does not really generate many possibilities in the grammar space that are exercising the parser as it mostly generates empty module objects. For instance, I tried this very simple test that fails if the code generates an "ast.With" node:
and after several minutes, the test did not find anything. Inspecting the generated ast for all the examples I can see that almost all examples generate an empty module:
Any idea of what is happening? Is this expected? |
Hmm. Turning up the The first thing I'd try is just adding The second is that I might need to rework hypothesmith - it's really at proof-of-concept stage at the moment and deserves its 0.0.x version number! This could be as simple as sprinkling a bunch of filters through that reject empty productions (inelegant and slower, but works), or I might have to write that custom translation of the PEG grammar. |
@hypothesis.given(hypothesmith.from_grammar("file_input").filter(str.strip))
@hypothesis.settings.get_profile("slow")
def test_parser(code):
x = ast.dump(ast.parse(code))
# assume (in-test filter) away any functionally empty inputs
assume(x != "Module(body=[])")
n_instructions = float(len(list(dis.Bytecode(compile(code, "<string>", "exec")))))
assume(n_instructions > 0)
# target larger inputs - the Hypothesis engine will do a rough multi-objective
# hill-climbing search using these scores to generate 'better' examples.
target(n_instructions, label="number of instructions in bytecode")
target(float(len(x) - len("Module(body=[])")), label="length of dumped ast body")
assert "withitem" not in x Just adding some well-chosen You don't need to change the test though, as I've just released Longer term, we'll still need a new strategy based on a CST (e.g. the prototype at Zac-HD/hypothesmith#2) to get much further 😞 |
@pablogsal - I've just released Hypothesmith 0.1, with a new CST-based @hypothesis.given(hypothesmith.from_node())
@hypothesis.settings.get_profile("slow")
def test_parser(code):
x = ast.dump(ast.parse(code))
assert "withitem" not in x Personally I'd use |
@pablogsal - just saw this twitter thread and wanted to clarify (and don't have twitter) - the main reason Longer and more interesting strings are correspondingly more likely to include something which is permitted by the grammar but not actually valid syntax; and the engine therefore ends up avoiding those areas of the input space because there are few valid examples to try variations on. Unfortunately, this means it was spending most of the time exploring trivial examples like comments and whitespace. The |
Thanks for the clarification @Zac-HD! I am very excited with the latest improvements in
There is also the problem that although the density of valid strings is a set if measure zero is still an incredibly big and complex space to explore. For instance, some of the errors we found were in nested f-strings and when I played with the Sadly the new PEG grammar won't help with this matter much because PEG parsers are not generative (the grammar is not written in a way that is easy to generate valid strings - as oppose to CFGs) but analytical (the grammar is written in a way that makes easy to check if a given string is part of the grammar). It may still be possible, but certainly not as easy. Another thing to consider is that actually testing almost-correct strings is very important because we also want to make sure that our parser doesn't allow constructs we don't want. Now the only way to detect these cases is looking enough time at the grammar or trying to break it on purpose.
When I have more time I will give it another go to the new version, but I fear that it will not be as much help as hypothesis normally is given the aforementioned limitations. I think |
Thanks for the feedback - it sounds like we're on the same page. Personally I'd still leave Hypothesmith tests running somewhere - while the expected and ideal case is that they don't find anything, CPU time is pretty cheap! Some notes-to-self, comments from anyone welcome:
|
I don't believe there are concrete action items for the CPython parser here. I have added "play with Hypothesis" on my personal TO-DO list, but it's a looong list... |
Hi everyone!
I gave a talk about property-based testing at the Language Summit yesterday, and people seemed interested. Since it's particularly useful when you have two implementations that can be compared, I thought I should drop by and offer to help apply it to help test the new parser 🙂
We've tested against the stdlib and many popular packages - why might this find additional bugs?
Popular and well-written code may have zero examples of many weird syntactic constructs, since it's typically designed for clarity and compatibility. In my demo here, I discovered several strings that can be
compile
d but nottokenize.tokenize
d - involving odd use of newlines, backslashes, non-ascii characters, etc.It's also nice to automatically get minimal examples of any failing inputs.
Say we decide to use this. What's the catch?
I'll assume here that you're familiar with the basic idea of Hypothesis (e.g. saw my talk) - if not, the example above should be helpful and I've collected all the links you could want here: https://github.com/Zac-HD/stdlib-property-tests#further-reading
I've already written a tool to generate Python source code from 3.8's grammar, called
hypothesmith
- it's a standard strategy, which generates strings which can be passed to thecompile
builtin (as a convenient predicate for "is this actually valid source code").hypothesmith
is implemented from a reasonably good EBNF grammar, and "inverting" it using Hypothesis' built-in support for such cases (and thest.from_regex()
strategy for terminals). The catch is that, for exactly the same reasons as PEP617 exists, I have a bunch of hacks where I filter substrings using thecompile
builtin - and so there's no way to generate anything not accepted by that function. Pragmatically, I think it's still worth using, since it's pretty quick to set up!Extensions after vanilla
hypothesmith
is running against the parserFirst, updating
hypothesmith
to treat the PEG grammar as authoritative is a no-brainer - I'm aiming to finish that in the next week or two. This would remove the "compile
is trusted" limitation entirely. It also shouldn't be too hard - strategies are literally parser combinators, and to generate a string we just "run it backwards": start at the root, and choose random productions at each branch or generate random terminals, writing out the string that would drive such transitions.Second, Hypothesis tests can be used as fuzz targets for coverage-guided fuzzers such as AFL or libfuzzer. Since we can also generate a wide variety of inputs from scratch, this can be a really powerful technique for ratcheting up our coverage of really weird edge cases. Or just
hypothesis.target(...)
things like the length of the generated code, the number of ast nodes, number of node types, nodes-per-length, etc.Please be very precise about what you suggest
@hypothesis.given(hypothesmith.from_grammar())
to generate inputs to a copy of your existing tests which run against stdlib / pypi / other python modules.Something I didn't think of.
You probably have more questions, and I hope I haven't been too presumptuous opening this issue. Please ask, and I'm happy to answer and help out (including reviewing and/or writing code) however I can, subject to other commitments and my Australian timezone!
Finally, thanks for all your work on this project - I'm really looking forward to what a precise and complete grammar will do for the core and wider Python community.
The text was updated successfully, but these errors were encountered: