-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend symbolic string support to 8-bit and 16-bit strings #6
Comments
To follow up, Java in fact uses UTF-16 encoding for its strings. From the Java
|
Also Haskell |
Indeed, I was mistaken about UTF-8. However, note that Haskell |
Further follow up: as best I can tell, Windows wide character support is also based on UTF-16 |
Additional follow-up: the Java classfile format uses a nonstandard UTF-8-like encoding for its string constants. https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.4.7 I suppose we should ensure that we are doing the correct decoding here. |
I've started work on a more nuanced approach to strings, which can be found on the |
With the merge of GaloisInc/crucible#333, we now have realistic symbolic string support for 8-bit, 16-bit and unicode strings. However, SMT solver support only exists for 8-bit strings currently. Z3 has arbitrary sequence types, so it would be relative straightforward to add 16-bit and unicode support. CVC4, it seems, only supports 8-bit strings. I'm going to leave this bug open to track the task of hooking up other character sequence support, and also perhaps using encoding tricks to reduce other strings to CVC4's 8-bit string support. |
Update. Z3 and CVC4 now have settled on unicode support for their string theory, and #147 updates What4 to match. As a result, what we actually need to add support for is 8-bit and 16-bit strings. |
#177 has now been merged, which updates string support for Z3 and CVC4 to the SMTLib unicode theory. |
I started a refactoring to implement more realistic symbolic string support in
what4
, which can be found on therwd/strings
branch.As I was doing this refactoring, I (re)discovered that string handling is stupid and complicated. The available SMT theories for strings treat strings as a sequence of bytes (as does raw C), but Java wants strings to be sequences of 16-byte unicode codepoints (with some custom encoding for longer codepoints, IIRC), and Haskell's
Text
datatype assumes UTF-8 encoding, etc.Doing this correctly will require a lot of care sigh. We'll probably need to implement a family of string representations to actually get the semantics right and it's going to be kind of a mess.
The text was updated successfully, but these errors were encountered: