-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strings generated by default should have more usual characters #99
Comments
This isn't a bad idea. It used to be ASCII only. But clearly, that is also an extreme. I often find myself defining new types that restrict the set of possible values to make debugging easier. See also: #77 |
Copying my idea from #77:
|
@shepmaster, Maybe first uncommon/tricky (zero-width things, BOM, left-to-right), then numerous common? Each Unicode codepoint has equal weight (time required to test with it), but unequal usefullness (probability that this codepoint catches some bug). |
It's an interesting thing - what is the most useful order to iterate though test cases? I'd think most people using quickcheck would want it to find things that they haven't thought of (at least it's true for me!). However, once something is found, we want it to reduce it to something that we can wrap our brains around. I think that "simple ASCII" will often be the easy-to-understand group of characters. The problem is going to be that different usages of |
But I expect that a problem will rarely come up with, for example, character U+12345 CUNEIFORM SIGN URU TIMES KI (𒍅) exactly (and not with other high-plane characters). Yet including all high-plane characters significantly increase the testing space and outnumbers more useful characters. So for "lesser" strings you can leave just one high-plane character. Imagine the table:
Small character classes (control characters, whitespace) should be included entirely. |
I somewhat feel like the obvious behavior for With that said, there's no reason why |
What does mean "a fair game"? In my idea any codepoint can appear, but probability should be drastically different. Useful subset may fail to find a problem even if running long enough.
For example, if a function breaks just when being fed a string with three spaces in a row, I expect it to find it fast. If a function only breaks when being fed with tree 𒍅s in a row, that is expected to be found out slower (because of space is must more popular for bugs that some arbitrary character). |
@vi Ah, I see, I misunderstood. I think I'm fine with a smarter impl of |
@BurntSushi, Maybe smarter impl of |
I think that might be OK. |
Can such logic be also applied to Probably |
Sounds like a good idea to me! |
I wonder if it'd be worth looking at what other ports of quickcheck do. Does the Haskell quickcheck do anything fancy like this? If not, did they consider it? |
Asked on IRC. http://haddock.stackage.org/lts-3.17/QuickCheck-2.8.1/src/Test-QuickCheck-Arbitrary.html#line-471
|
This is a known crappy Haskell QuickCheck default that has bitten many people. The standard workaround is http://hackage.haskell.org/package/quickcheck-unicode |
Shall I try submitting a pull request about this, making it generate chars a bit like aforementioned quickcheck-unicode (but with some emphasis on whitespace, special characters and specific tricky Unicode characters).? |
@vi That would be lovely! |
That's how generated strings typically looks now: O�ò»¹��[.? }'셥-�91(]ª!ñ�·�� #* "9ô�£´�:{乸0%㯓9똁⁔Rz릉¤tó£±�? (]>� <nf)*ᖯ'��ñ��°6>¦ó¤�¡匈�#$'`맽ô���cHX)�[r莅3*A ð¹�§7] G_媣<ꉟต8~^i7䱄釱fh)+��{G�0� ﵽ❔K/5‴9[꤅X1J[M&4[¥" ⇉Ɩ©�42폨ĒUñ�¸�5.`'O§)�-���*ñ·�¼r '@/@�骲6!ñ�§��,&E e?!�fó � 󶱬V�_ (]>el󯣿o+狪*="⁅ ñ���肖<{ó¿¿½\+巤 {T��*ô�¿½⁆?ó¿¿½ ꡯ칵쫨C}1<ʼn��*���..#ñ��º& J:,j=‹3“褙`}j¬ñ���+‐󾬲¦bO©S�ñ¡��~~�.�ª =㍃�&f�E&Q@ð¾�±R�笹 ⁁�D 6')�m9m�)�sqT�3H㹵035蹈\>^鯅ñ��»ó���ó¨£��‰쩻8 ⁋0�N\WGô�¡�¥�®��5UWñª���1钟[!�X��+<󿬹難"4�ó�®³ᔵ"ó¬�®!G 揟’O�1'ñ�¿��+髾@$Zvó�¹�䵃�;ð»�¸�h뢚ស9Yó¿¿¾_L蛇�AjpⰚ�㤩 ©揪)ò�®�-d�A){¥攝剟>~ó���=" «ó¿¿½1賬‟z⁉�VOô�¿¾�2I!mô�´¿N4;,ñ�¾»i>-\B��)裉᷈�f륯 +ाX~9[u 樴m‿ñ°��!=�=�C[ ط57_£=‴�`5�_�4}⁃‥�у灼 ¥1:�>ð�©» <$)>@, -"♄f<��ð¶¾�
QuickCheck's string generator's motto should be "I love characters you hate". |
@vi Haha, I like it! |
Strings typically generted look like this:
It fails (or takes long) to find simple case like a when something starts with a space, special character or like this.
I think normal characters should be preferred to deeply Unicode things.
I think by default there should be tiers of characters classes:
And character generator may aim for, for example, 10% of buggy chars, 40% of [a-zA-Z0-9_], 40% of basic Unicode characters, 10% of everything else.
The text was updated successfully, but these errors were encountered: