Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend symbolic string support to 8-bit and 16-bit strings #6

Open
robdockins opened this issue Aug 13, 2019 · 9 comments
Open

Extend symbolic string support to 8-bit and 16-bit strings #6

robdockins opened this issue Aug 13, 2019 · 9 comments
Labels
enhancement New feature or request

Comments

@robdockins
Copy link
Contributor

I started a refactoring to implement more realistic symbolic string support in what4, which can be found on the rwd/strings branch.

As I was doing this refactoring, I (re)discovered that string handling is stupid and complicated. The available SMT theories for strings treat strings as a sequence of bytes (as does raw C), but Java wants strings to be sequences of 16-byte unicode codepoints (with some custom encoding for longer codepoints, IIRC), and Haskell's Text datatype assumes UTF-8 encoding, etc.

Doing this correctly will require a lot of care sigh. We'll probably need to implement a family of string representations to actually get the semantics right and it's going to be kind of a mess.

@robdockins
Copy link
Contributor Author

To follow up, Java in fact uses UTF-16 encoding for its strings. From the Java String class documentation:

A String represents a string in the UTF-16 format in which supplementary
characters are represented by surrogate pairs (see the section
Unicode Character Representations in the Character class for more information).
Index values refer to char code units, so a supplementary character uses 
two positions in a String.

@travitch
Copy link
Contributor

Also Haskell Text is currently UTF-16 internally

@robdockins
Copy link
Contributor Author

Indeed, I was mistaken about UTF-8. However, note that Haskell Text is codepoint oriented, so the length function computes the number of codepoints. Thus, a UTF-16 surrogate character pair counts as 1 character, whereas Java would count it as two.

@robdockins
Copy link
Contributor Author

Further follow up: as best I can tell, Windows wide character support is also based on UTF-16

@robdockins
Copy link
Contributor Author

Additional follow-up: the Java classfile format uses a nonstandard UTF-8-like encoding for its string constants. https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.4.7

I suppose we should ensure that we are doing the correct decoding here.

@robdockins
Copy link
Contributor Author

I've started work on a more nuanced approach to strings, which can be found on the rwd/strings2 branch. It introduces three different flavors of strings, corresponding to the various situations I've actually seen in the wild: sequences of bytes, sequences of 16-bit code units, and sequences of Unicode code points.

@robdockins
Copy link
Contributor Author

With the merge of GaloisInc/crucible#333, we now have realistic symbolic string support for 8-bit, 16-bit and unicode strings. However, SMT solver support only exists for 8-bit strings currently. Z3 has arbitrary sequence types, so it would be relative straightforward to add 16-bit and unicode support. CVC4, it seems, only supports 8-bit strings.

I'm going to leave this bug open to track the task of hooking up other character sequence support, and also perhaps using encoding tricks to reduce other strings to CVC4's 8-bit string support.

@robdockins robdockins changed the title Strings are dumb and complicated Extend symbolic string support to 16-bit and unicode strings Nov 26, 2019
@robdockins robdockins transferred this issue from GaloisInc/crucible Feb 26, 2020
@robdockins robdockins added the enhancement New feature or request label Apr 17, 2020
@robdockins robdockins self-assigned this Jul 27, 2021
@robdockins
Copy link
Contributor Author

Update. Z3 and CVC4 now have settled on unicode support for their string theory, and #147 updates What4 to match. As a result, what we actually need to add support for is 8-bit and 16-bit strings.

@robdockins robdockins changed the title Extend symbolic string support to 16-bit and unicode strings Extend symbolic string support to 8-bit and 16-bit strings Aug 3, 2021
@robdockins robdockins removed their assignment Aug 4, 2021
@robdockins
Copy link
Contributor Author

#177 has now been merged, which updates string support for Z3 and CVC4 to the SMTLib unicode theory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants