You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This has been mentioned a few places, (most recently by @ScottPJoneshere).
I like the idea of validating strings on construction by default, but I think there also needs to be an escape hatch for performance in cases where you're certain you're getting valid code units/points (e.g. from a database, etc.). This issue is to brainstorm how this should work, here are a few ideas:
have immutable String{E<:Encoding,V}where V is true/false indicating whether the string has been validated or not. By default, strings would be validated, but maybe there'd be a single inner constructor that would allow for skipping validation. I'm not sure what benefit we'd really get from having this be a type parameter in terms of being able to dispatch on whether a string was valid or not
Have a valid flag as a part of the String type definition; I'm not a huge fan of this because String could no longer be immutable (unless validating created a new String instance pointed at the same data, but that seems wasteful). I guess this same argument would go against the 1st option above by needing to create a new instance when going from unvalidated -> validated.
We could create separate Raw8, Raw16, and Raw32 (or change @ScottPJones's Binary encoding to Binary8Binary16, etc) Encoding types that would represent non-validated code units of the corresponding size 8, 16, or 32. That way, ASCII, UTF8, UTF16, etc would always be validated no matter what, and to get non-validated, you'd have to use the Raw encodings. This might be nice, but worry people might feel skittish if they're pulling data from ODBC and see String{Raw8} as the result, just because the data wasn't validated (even though it's pretty much guaranteed valid data).
Another option is we don't store whether a string has been validated or not; Strings would be validated on construction by default, with an optional constructor to avoid validating if desired. This means given an arbitrary string, you're not sure it's been validated or not (but probably unless it came from a process using the non-default constructor)
I think I kind of lean towards option 4 since it's minimal.
The text was updated successfully, but these errors were encountered:
I'd recommend one thing at the start, don't begin by trying to decide on implementation details or even architecture yet, but first come up with a list of the goals/requirements that people want to have with a new string implementation. (Not that having these starting ideas isn't good!)
Once we know those, it becomes a lot easier to sift through all of the ideas on what architectures would best achieve them.
Also, it's not just validation, there's also converting of variants of UTF-8, and the possibility of using a replacement character/string to handle truly invalid data (currently, there is a convert method for UTF8String that has a replacement option, but none for ASCIIString, UTF16String, or UTF32String).
One of my goals would be that immutable strings are always 100% valid, and that invalid substrings are not allowed (error thrown). I think that is also important for security reasons, and also allows for much more efficient code, by avoiding run-time validity checks.
For mutable strings (which is another thing I think should be a goal), then I think a valid flag would be good... if the string is written to, then the flag gets cleared, and an explicitly called validity checker sets it if it passes.
This has been mentioned a few places, (most recently by @ScottPJones here).
I like the idea of validating strings on construction by default, but I think there also needs to be an escape hatch for performance in cases where you're certain you're getting valid code units/points (e.g. from a database, etc.). This issue is to brainstorm how this should work, here are a few ideas:
immutable String{E<:Encoding,V}
whereV
istrue
/false
indicating whether the string has been validated or not. By default, strings would be validated, but maybe there'd be a single inner constructor that would allow for skipping validation. I'm not sure what benefit we'd really get from having this be a type parameter in terms of being able to dispatch on whether a string was valid or notvalid
flag as a part of theString
type definition; I'm not a huge fan of this becauseString
could no longer beimmutable
(unless validating created a newString
instance pointed at the same data, but that seems wasteful). I guess this same argument would go against the 1st option above by needing to create a new instance when going from unvalidated -> validated.Raw8
,Raw16
, andRaw32
(or change @ScottPJones'sBinary
encoding toBinary8
Binary16
, etc)Encoding
types that would represent non-validated code units of the corresponding size 8, 16, or 32. That way,ASCII
,UTF8
,UTF16
, etc would always be validated no matter what, and to get non-validated, you'd have to use theRaw
encodings. This might be nice, but worry people might feel skittish if they're pulling data from ODBC and seeString{Raw8}
as the result, just because the data wasn't validated (even though it's pretty much guaranteed valid data).I think I kind of lean towards option 4 since it's minimal.
The text was updated successfully, but these errors were encountered: