Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document that String is ASCII alphanumerics-only, maybe have a way of generating more Unicode? #77

Closed
shepmaster opened this issue May 7, 2015 · 5 comments

Comments

@shepmaster
Copy link
Contributor

Right now, Strings only have ASCII alphanumeric characters. I tried to use it to generate strings with spaces in it and was surprised when my code was never called 😺

It would be cool if there were more of the Unicode set included. Even awesomer might be if we had some tuning knobs where we could say "more punctuation, less letters", perhaps using the Unicode character classes?

Thanks for the fun-to-use library!

@BurntSushi
Copy link
Owner

Yeah, this definitely needs to be fixed. I think the ASCII limitation has been there and nobody has had cause enough to fix it.

There's basically two concerns here. The first is ergonomics. It's really nice when your witnesses are limited to characters that are likely to be meaningful in your terminal (i.e., printable characters). The second is that, of course, strings outside the ASCII range should be tested. When I first wrote quickcheck, I didn't know how to balance these so I just chose to address the first concern.

It seems like the principle of least surprise should apply here: String should be fuzzed with the set of Unicode scalar values and some other type (perhaps defined in quickcheck, i.e., AsciiString) should be more limited when you want a guarantee of nicer witnesses. I think it could deref to &String, which should mostly provide the ergonomics of a plain String.

@shepmaster
Copy link
Contributor Author

It seems like shrinking could also come into play. I could conceive of a string with only ascii as being "smaller" than one with ascii + ascii punctuation. Then it could "grow" to include more common unicode, then "grow" towards uncommon. Just thinking outloud, really.

@bluss
Copy link
Contributor

bluss commented Jun 15, 2015

Has this changed? I think I saw wildly different results than this.

Just as a data point, this is what I ended up using for a string search testing / fuzzing:

#[derive(Clone, Debug)]
struct Text(String);

//static ALPHABET: &'static str = "ABCD";
static ALPHABET: &'static str = "\0\u{1}\u{2}\u{3}";

impl Arbitrary for Text {
    fn arbitrary<G: qc::Gen>(g: &mut G) -> Self {
        let len = u16::arbitrary(g);
        let mut s = String::with_capacity(len as usize);
        for _ in 0..len {
            let i = usize::arbitrary(g);
            let i = i % ALPHABET.len();
            s.push(ALPHABET.as_bytes()[i] as char)
        }
        Text(s)
    }
    fn shrink(&self) -> Box<Iterator<Item=Self>> {
        Box::new(self.0.shrink().map(Text))
    }
}

Since the algorithm was working on just the byte slice, I switched to an alphabet of 0-4 for even easier to read intermediate debug outputs..

@shepmaster
Copy link
Contributor Author

@bluss it appears so! This could be considered closed via 848acee.

@BurntSushi
Copy link
Owner

Errm, right, I forgot to update this issue, but I grew tired of the ASCII-only behavior so I just changed it.

ASCII-only is still useful though, so we might want an AsciiString type or something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants