Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validated strings #4

Open
quinnj opened this issue Jun 19, 2015 · 1 comment
Open

Validated strings #4

quinnj opened this issue Jun 19, 2015 · 1 comment

Comments

@quinnj
Copy link
Owner

quinnj commented Jun 19, 2015

This has been mentioned a few places, (most recently by @ScottPJones here).

I like the idea of validating strings on construction by default, but I think there also needs to be an escape hatch for performance in cases where you're certain you're getting valid code units/points (e.g. from a database, etc.). This issue is to brainstorm how this should work, here are a few ideas:

  1. have immutable String{E<:Encoding,V}where V is true/false indicating whether the string has been validated or not. By default, strings would be validated, but maybe there'd be a single inner constructor that would allow for skipping validation. I'm not sure what benefit we'd really get from having this be a type parameter in terms of being able to dispatch on whether a string was valid or not
  2. Have a valid flag as a part of the String type definition; I'm not a huge fan of this because String could no longer be immutable (unless validating created a new String instance pointed at the same data, but that seems wasteful). I guess this same argument would go against the 1st option above by needing to create a new instance when going from unvalidated -> validated.
  3. We could create separate Raw8, Raw16, and Raw32 (or change @ScottPJones's Binary encoding to Binary8 Binary16, etc) Encoding types that would represent non-validated code units of the corresponding size 8, 16, or 32. That way, ASCII, UTF8, UTF16, etc would always be validated no matter what, and to get non-validated, you'd have to use the Raw encodings. This might be nice, but worry people might feel skittish if they're pulling data from ODBC and see String{Raw8} as the result, just because the data wasn't validated (even though it's pretty much guaranteed valid data).
  4. Another option is we don't store whether a string has been validated or not; Strings would be validated on construction by default, with an optional constructor to avoid validating if desired. This means given an arbitrary string, you're not sure it's been validated or not (but probably unless it came from a process using the non-default constructor)

I think I kind of lean towards option 4 since it's minimal.

@ScottPJones
Copy link
Collaborator

I'd recommend one thing at the start, don't begin by trying to decide on implementation details or even architecture yet, but first come up with a list of the goals/requirements that people want to have with a new string implementation. (Not that having these starting ideas isn't good!)
Once we know those, it becomes a lot easier to sift through all of the ideas on what architectures would best achieve them.

Also, it's not just validation, there's also converting of variants of UTF-8, and the possibility of using a replacement character/string to handle truly invalid data (currently, there is a convert method for UTF8String that has a replacement option, but none for ASCIIString, UTF16String, or UTF32String).

One of my goals would be that immutable strings are always 100% valid, and that invalid substrings are not allowed (error thrown). I think that is also important for security reasons, and also allows for much more efficient code, by avoiding run-time validity checks.

For mutable strings (which is another thing I think should be a goal), then I think a valid flag would be good... if the string is written to, then the flag gets cleared, and an explicitly called validity checker sets it if it passes.

This is great that you've opened this up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants