Extend symbolic string support to 8-bit and 16-bit strings #6

robdockins · 2019-08-13T16:34:14Z

I started a refactoring to implement more realistic symbolic string support in what4, which can be found on the rwd/strings branch.

As I was doing this refactoring, I (re)discovered that string handling is stupid and complicated. The available SMT theories for strings treat strings as a sequence of bytes (as does raw C), but Java wants strings to be sequences of 16-byte unicode codepoints (with some custom encoding for longer codepoints, IIRC), and Haskell's Text datatype assumes UTF-8 encoding, etc.

Doing this correctly will require a lot of care sigh. We'll probably need to implement a family of string representations to actually get the semantics right and it's going to be kind of a mess.

The text was updated successfully, but these errors were encountered:

robdockins · 2019-08-21T18:25:12Z

To follow up, Java in fact uses UTF-16 encoding for its strings. From the Java String class documentation:

A String represents a string in the UTF-16 format in which supplementary
characters are represented by surrogate pairs (see the section
Unicode Character Representations in the Character class for more information).
Index values refer to char code units, so a supplementary character uses 
two positions in a String.

travitch · 2019-08-21T20:02:13Z

Also Haskell Text is currently UTF-16 internally

robdockins · 2019-08-21T21:50:14Z

Indeed, I was mistaken about UTF-8. However, note that Haskell Text is codepoint oriented, so the length function computes the number of codepoints. Thus, a UTF-16 surrogate character pair counts as 1 character, whereas Java would count it as two.

robdockins · 2019-08-29T21:02:17Z

Further follow up: as best I can tell, Windows wide character support is also based on UTF-16

robdockins · 2019-08-30T21:04:23Z

Additional follow-up: the Java classfile format uses a nonstandard UTF-8-like encoding for its string constants. https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.4.7

I suppose we should ensure that we are doing the correct decoding here.

robdockins · 2019-09-03T16:57:13Z

I've started work on a more nuanced approach to strings, which can be found on the rwd/strings2 branch. It introduces three different flavors of strings, corresponding to the various situations I've actually seen in the wild: sequences of bytes, sequences of 16-bit code units, and sequences of Unicode code points.

robdockins · 2019-11-26T22:07:30Z

With the merge of GaloisInc/crucible#333, we now have realistic symbolic string support for 8-bit, 16-bit and unicode strings. However, SMT solver support only exists for 8-bit strings currently. Z3 has arbitrary sequence types, so it would be relative straightforward to add 16-bit and unicode support. CVC4, it seems, only supports 8-bit strings.

I'm going to leave this bug open to track the task of hooking up other character sequence support, and also perhaps using encoding tricks to reduce other strings to CVC4's 8-bit string support.

robdockins · 2021-08-03T19:42:56Z

Update. Z3 and CVC4 now have settled on unicode support for their string theory, and #147 updates What4 to match. As a result, what we actually need to add support for is 8-bit and 16-bit strings.

robdockins · 2022-01-11T21:01:56Z

#177 has now been merged, which updates string support for Z3 and CVC4 to the SMTLib unicode theory.

robdockins changed the title ~~Strings are dumb and complicated~~ Extend symbolic string support to 16-bit and unicode strings Nov 26, 2019

robdockins transferred this issue from GaloisInc/crucible Feb 26, 2020

robdockins added the enhancement New feature or request label Apr 17, 2020

robdockins self-assigned this Jul 27, 2021

robdockins changed the title ~~Extend symbolic string support to 16-bit and unicode strings~~ Extend symbolic string support to 8-bit and 16-bit strings Aug 3, 2021

robdockins removed their assignment Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend symbolic string support to 8-bit and 16-bit strings #6

Extend symbolic string support to 8-bit and 16-bit strings #6

robdockins commented Aug 13, 2019

robdockins commented Aug 21, 2019

travitch commented Aug 21, 2019

robdockins commented Aug 21, 2019

robdockins commented Aug 29, 2019

robdockins commented Aug 30, 2019

robdockins commented Sep 3, 2019

robdockins commented Nov 26, 2019

robdockins commented Aug 3, 2021

robdockins commented Jan 11, 2022

Extend symbolic string support to 8-bit and 16-bit strings #6

Extend symbolic string support to 8-bit and 16-bit strings #6

Comments

robdockins commented Aug 13, 2019

robdockins commented Aug 21, 2019

travitch commented Aug 21, 2019

robdockins commented Aug 21, 2019

robdockins commented Aug 29, 2019

robdockins commented Aug 30, 2019

robdockins commented Sep 3, 2019

robdockins commented Nov 26, 2019

robdockins commented Aug 3, 2021

robdockins commented Jan 11, 2022