Validated strings #4

quinnj · 2015-06-19T21:27:38Z

This has been mentioned a few places, (most recently by @ScottPJones here).

I like the idea of validating strings on construction by default, but I think there also needs to be an escape hatch for performance in cases where you're certain you're getting valid code units/points (e.g. from a database, etc.). This issue is to brainstorm how this should work, here are a few ideas:

have immutable String{E<:Encoding,V}where V is true/false indicating whether the string has been validated or not. By default, strings would be validated, but maybe there'd be a single inner constructor that would allow for skipping validation. I'm not sure what benefit we'd really get from having this be a type parameter in terms of being able to dispatch on whether a string was valid or not
Have a valid flag as a part of the String type definition; I'm not a huge fan of this because String could no longer be immutable (unless validating created a new String instance pointed at the same data, but that seems wasteful). I guess this same argument would go against the 1st option above by needing to create a new instance when going from unvalidated -> validated.
We could create separate Raw8, Raw16, and Raw32 (or change @ScottPJones's Binary encoding to Binary8 Binary16, etc) Encoding types that would represent non-validated code units of the corresponding size 8, 16, or 32. That way, ASCII, UTF8, UTF16, etc would always be validated no matter what, and to get non-validated, you'd have to use the Raw encodings. This might be nice, but worry people might feel skittish if they're pulling data from ODBC and see String{Raw8} as the result, just because the data wasn't validated (even though it's pretty much guaranteed valid data).
Another option is we don't store whether a string has been validated or not; Strings would be validated on construction by default, with an optional constructor to avoid validating if desired. This means given an arbitrary string, you're not sure it's been validated or not (but probably unless it came from a process using the non-default constructor)

I think I kind of lean towards option 4 since it's minimal.

The text was updated successfully, but these errors were encountered:

ScottPJones · 2015-06-20T02:12:10Z

I'd recommend one thing at the start, don't begin by trying to decide on implementation details or even architecture yet, but first come up with a list of the goals/requirements that people want to have with a new string implementation. (Not that having these starting ideas isn't good!)
Once we know those, it becomes a lot easier to sift through all of the ideas on what architectures would best achieve them.

Also, it's not just validation, there's also converting of variants of UTF-8, and the possibility of using a replacement character/string to handle truly invalid data (currently, there is a convert method for UTF8String that has a replacement option, but none for ASCIIString, UTF16String, or UTF32String).

One of my goals would be that immutable strings are always 100% valid, and that invalid substrings are not allowed (error thrown). I think that is also important for security reasons, and also allows for much more efficient code, by avoiding run-time validity checks.

For mutable strings (which is another thing I think should be a goal), then I think a valid flag would be good... if the string is written to, then the flag gets cleared, and an explicitly called validity checker sets it if it passes.

This is great that you've opened this up!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validated strings #4

Validated strings #4

quinnj commented Jun 19, 2015

ScottPJones commented Jun 20, 2015

Validated strings #4

Validated strings #4

Comments

quinnj commented Jun 19, 2015

ScottPJones commented Jun 20, 2015