strlen #468

weaversa · 2020-04-10T13:49:28Z

I'm really not sure what is wrong here. Is something wrong with saw's notion of strlen? Or, am I doing something wrong? Any help would be appreciated.

strlen_test.c

#include <string.h>

size_t strlen_test (char *string) {
  size_t nLength = strlen(string);
  return nLength;
}

strlen_test.saw

let alloc_init ty v = do {
  p <- crucible_alloc ty;
  crucible_points_to p v;
  return p;
};

let ptr_to_fresh n ty = do {
  x <- crucible_fresh_var n ty;
  p <- alloc_init ty (crucible_term x);
  return (x, p);
};

let strlen_test_spec = do {
  (string, pstring) <- ptr_to_fresh "string" (llvm_array 10 (llvm_int 8));

  crucible_precond {{ string!0 == 0 }}; //NULL terminated string

  crucible_execute_func [pstring];
};

strlen_test_bc <- llvm_load_module "strlen_test.bc";

strlen_test_result <- crucible_llvm_verify strlen_test_bc "strlen_test" [] false strlen_test_spec z3;

$ clang --version
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin19.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
$ clang -emit-llvm strlen_test.c -c -o strlen_test.bc -g
$ saw --version
0.4.0.99 (d87d0d73)
$ saw strlen_test.saw
[13:45:56.549] Loading file "strlen_test.saw"
[13:45:56.576] Verifying strlen_test ...
[13:45:56.578] Simulating strlen_test ...
[13:45:56.579] "crucible_llvm_verify" (strlen_test.saw:23:23-23:43):
Symbolic execution failed.
Abort due to false assumption:
  Error during memory load
  in strlen_test at strlen_test.c:4:20
  Details:
  No previous write to this location was found
    No matching writes to this memory were found.
Stack frame
  Allocations:
    StackAlloc 6 0x8:[64] Mutable 8-byte-aligned "strlen_test.c:4:10"
    StackAlloc 5 0x8:[64] Mutable 8-byte-aligned "strlen_test.c:3:27"
  Writes:
    Indexed chunk:
      5 |->
        *(5, 0x0:[64]) := (4, 0x0:[64])
Base memory
  Allocations:
    HeapAlloc 4 0xa:[64] Mutable 1-byte-aligned strlen_test.saw:2:8
    GlobalAlloc 3 0x0:[64] Immutable 1-byte-aligned strlen_test
    GlobalAlloc 2 0x0:[64] Immutable 1-byte-aligned strlen
    GlobalAlloc 1 0x0:[64] Immutable 1-byte-aligned llvm.dbg.declare
  Writes:
    Indexed chunk:
      4 |->
        *(4, 0x0:[64]) := [c@6:bv
                          ,c@7:bv
                          ,c@8:bv
                          ,c@9:bv
                          ,c@10:bv
                          ,c@11:bv
                          ,c@12:bv
                          ,c@13:bv
                          ,c@14:bv
                          ,c@15:bv]

The text was updated successfully, but these errors were encountered:

robdockins · 2020-04-10T16:24:37Z

Ah, this is an interesting one. The code for strlen walks down an allocation byte-by-byte, constructing a if/then/else tree searching for the first 0 byte. In this case, you've assumed that a particular byte is 0 rather than writing an explicit 0... this means the strlen loop doesn't immediately realize it is 0 and keeps going until it reads off the end of the allocation. This should be OK because a concrete run would only do that if all the previous bytes were 0... and by assumption at least one will be 0.

However, we haven't correctly piped the relevant assumptions into the path condition, so the simulator believes that this load must succeed regardless and emits an impossible proof obligation.

As a workaround, you can insert a concrete 0 at the end of the array instead of assuming the final byte is 0; I believe that should work. Meantime, I'll see about getting this bug fixed.

robdockins · 2020-04-10T16:39:55Z

Here is a crux-llvm input demonstrating the same issue:

#include <stdlib.h>
#include <string.h>

#include "crucible.h"

#define MAX 10

char* mkstr() {
  char* x = malloc(MAX);
  for( int i=0; i<MAX; i++ ) {
    x[i] = crucible_int8_t( "x" );
  }

  //x[MAX-1] = 0;
  assuming( x[MAX-1] == 0 );

  return x;
}

int main() {
  char *str = mkstr(); 
  size_t sz = strlen(str);
  check( sz < MAX );
}

weaversa · 2020-04-10T17:24:54Z

Changing the saw-script to this still gives the same error.

let strlen_test_spec = do {
  string <- crucible_fresh_var "string" (llvm_array 10 (llvm_int 8));
  pstring <- alloc_init (llvm_array 11 (llvm_int 8)) (crucible_term {{ string # [0] }});

  crucible_execute_func [pstring];
};

weaversa · 2020-04-10T17:38:31Z

@robdockins ...something sinister is happening here...

weaversa · 2020-04-10T17:41:09Z

Prepending zeros -- still broken

let strlen_test_spec = do {
  string <- crucible_fresh_var "string" (llvm_array 1 (llvm_int 8));
  pstring <- alloc_init (llvm_array 11 (llvm_int 8)) (crucible_term {{ (zero : [10][8]) # string }});

  crucible_execute_func [pstring];
};

All zeroes -- works!

let strlen_test_spec = do {
  pstring <- alloc_init (llvm_array 11 (llvm_int 8)) (crucible_term {{ (zero : [11][8]) }});

  crucible_execute_func [pstring];
};

robdockins · 2020-04-10T17:42:19Z

Yeah, after little more thinking about it I know why that happens; it's quite annoying. Basically, the information about the final array value being concrete doesn't get piped down to the symbolic simulator. Only fully concrete values get piped into the simulator, everything else is abstracted. It's something that's difficult to improve, but something I've wanted for some time.

Previously, `strlen` would recurse down a string until it ran into a concrete zero. This is fine, except that it would assert that each load along the way would succeed, without taking into account the previous zero tests that had to fail. To correct this we need to emulate a path condition such that to load at position `n` each of the `n-1` previous locations must have been non-zero, and the loads are only required to succeed under that condition. This allows one, for example, to assume that some value in a string is 0 (but not specify which one). Previously, `strlen` would index off the end of the allocation of such a string and fail. Now it will succeed because the path condition required to index out-of-bounds is inconsistent. Fixes #468

weaversa · 2020-04-10T17:50:30Z

I figured that, at the very least, using 'sat branches' is supposed to bridge this gap --- that is, the SAT solver is supposed to kick off during symbolic execution at conditional branches, taking preconditions into account to alleviate these exact kind of false-positive error conditions and symbolic termination conditions. Maybe the crucible 'strlen' built-in isn't talking to the SAT solver?

robdockins · 2020-04-10T17:54:59Z

Yes, the built-in strlen doesn't invoke a full symbolic branch at each zero test. That would be a lot of simulator bookkeeping for even moderately-sized strings.

Previously, `strlen` would recurse down a string until it ran into a concrete zero. This is fine, except that it would assert that each load along the way would succeed, without taking into account the previous zero tests that had to fail. To correct this we need to emulate a path condition such that to load at position `n` each of the `n-1` previous locations must have been non-zero, and the loads are only required to succeed under that condition. This allows one, for example, to assume that some value in a string is 0 (but not specify which one). Previously, `strlen` would index off the end of the allocation of such a string and fail. Now it will succeed because the path condition required to index out-of-bounds is inconsistent. Fixes #468

weaversa · 2020-05-20T14:16:13Z

Something happened that caused strlen to not work correctly anymore. It worked on the 13th, but not on the 19th. Example here: string_issue.tar.gz

$ ~/local/packages/saw-0.4.0.99-2020-05-13-OSX-64/bin/saw bitvector.saw
[14:12:19.703] Loading file "bitvector.saw"
[14:12:19.704] Loading file "llvm.saw"
[14:12:20.479] Verifying bitvector_t_fromHexString ...
[14:12:20.484] Simulating bitvector_t_fromHexString ...
i = 0
i = 1
i = 2
i = 3
i = 4
i = 5
i = 6
i = 7
i = 8
i = 9
i = 10
i = 11
i = 12
i = 13
i = 14
i = 15
i = 16
i = 17
[14:12:20.973] Symbolic simulation completed with side conditions.
[14:12:20.998] Checking proof obligations bitvector_t_fromHexString ...
[14:12:37.057] Proof succeeded! bitvector_t_fromHexString

versus

$ ~/local/packages/saw-0.4.0.99-2020-05-19-OSX-64/bin/saw bitvector.saw
[14:13:29.447] Loading file "bitvector.saw"
[14:13:29.448] Loading file "llvm.saw"
[14:13:30.221] Verifying bitvector_t_fromHexString ...
[14:13:30.225] Simulating bitvector_t_fromHexString ...
i = 0
i = 1
i = 2
i = 3
i = 4
i = 5
i = 6
i = 7
i = 8
i = 9
i = 10
i = 11
i = 12
i = 13
i = 14
i = 15
i = 16
i = 17
i = 18
i = 19
i = 20
i = 21
i = 22
i = 23
i = 24
i = 25
i = 26
i = 27
i = 28
i = 29
i = 30
i = 31
i = 32
i = 33
i = 34
i = 35
i = 36
i = 37
i = 38
i = 39
i = 40
i = 41
i = 42
i = 43
i = 44
i = 45
i = 46
i = 47
i = 48
i = 49
i = 50
i = 51
i = 52
i = 53
i = 54
i = 55
i = 56
i = 57
i = 58
i = 59
i = 60
i = 61
i = 62
i = 63
i = 64
i = 65
i = 66
i = 67
i = 68
i = 69
i = 70
i = 71
^C[14:13:31.915] user interrupt

robdockins · 2020-05-20T17:55:59Z

The crucible unit test for strlen is still working. However, I bisected SAW and it looks like the first commit with the regression is d679eabe89939d5f61e555ec3129d06181c15081. This is a submodule update commit, so I'll have to dig in and see what happened.

weaversa · 2020-05-20T20:30:57Z

A small example is attached. Works with 5/13, doesn't with 5/19.

strlen_iter.tar.gz

weaversa · 2020-05-20T20:32:26Z

It looks like 5/13 realizes when the string is accessed out of bounds and stops the symbolic execution. 5/19 is oblivious. 5/13 is also way faster at symbolic execution (but maybe that's another ticket).

robdockins · 2020-05-20T20:40:38Z

Further bisection reveals that the following What4 patch is the proximate cause of this regression: cce137e0247ef9ee970ae13d280dea1ec9449413

That patch changes the way the bitvector abstract domains are implemented. The fact that this causes a regression in this strlen case is deeply mysterious to me, especially given that the corresponding crux test case succeeds even after applying this patch.

I conclude that there is some nontrivial interaction between the SAW glue code, the What4 bitvector abstract domains, and the LLVM memory model that gives rise to this behavior.

robdockins · 2020-05-20T21:14:38Z

The minimized strlen_iter test case likewise breaks at the same commit.

robdockins · 2020-05-21T16:42:53Z

This issue here turns out to be a really interesting combination of effects.

First, the only reason this program terminated quickly before was that we are able to learn abstract domain interval information about the computed value of strlen. Because of the way the result of strlen was constructed, we learned that it must be in the (signed) interval from -1 to 16. (Why -1? Well, that is a bogus sentinel value returned for the case you walk of the end off the array; originally, I didn't think this value mattered, as it should only occur in dead code branches). This is eventually enough to show that the memory access in the body of the loop will definitely fail, which aborts the loop (subsequently we can prove the failed memory access can only occur in an inconsistent state, but this requires information only available on the SAW side, so the simulator doesn't notice it during symbolic simulation).

Now, with the change to the bitvector domain implementation, I made another arbitrary choice that I didn't think would matter: I started singleton values in the "bitwise" abstract domain mode. In retrospect, this is probably the wrong choice. When you union together a contiguous sequence of singleton values in the arithmetic mode, you get a nice dense interval; however, if you union together such a sequence in the bitwise mode, you get a quite less accurate approximation, especially if your interval crosses 0 (then it goes all the way to top). Thus: instead of a nice, small interval we forget all abstract information about the result of strlen. As a result, we cannot abort or exit the loop early.

So, there are several things to fix here, any of which makes this issue solved. First, the sentinel value for strlen should be 0 instead of -1 to avoid spurious values in our abstract domains and path satisfiability checks. Second, the default mode for bitevector abstract domains should be the arithmetic one to properly handle the (relatively common) case of short intervals generated by repeatedly adjoining singleton values. Third, we should improve the flow of information from SAW preconditions into the symbolic simulator so that both abstract domain information and path satisfiability checks are more useful.

robdockins · 2020-05-22T00:21:50Z

This should be fixed via #485

robdockins mentioned this issue Apr 10, 2020

Fix the semantics of strlen on symbolic strings. #469

Merged

robdockins added bug llvm labels Apr 16, 2020

robdockins closed this as completed in #469 Apr 20, 2020

weaversa mentioned this issue Apr 22, 2020

SAT branches not working? GaloisInc/saw-script#683

Closed

weaversa reopened this May 20, 2020

robdockins closed this as completed Jun 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strlen #468

strlen #468

weaversa commented Apr 10, 2020

robdockins commented Apr 10, 2020

robdockins commented Apr 10, 2020

weaversa commented Apr 10, 2020

weaversa commented Apr 10, 2020

weaversa commented Apr 10, 2020

robdockins commented Apr 10, 2020

weaversa commented Apr 10, 2020

robdockins commented Apr 10, 2020

weaversa commented May 20, 2020

robdockins commented May 20, 2020

weaversa commented May 20, 2020

weaversa commented May 20, 2020

robdockins commented May 20, 2020

robdockins commented May 20, 2020

robdockins commented May 21, 2020

robdockins commented May 22, 2020

strlen #468

strlen #468

Comments

weaversa commented Apr 10, 2020

robdockins commented Apr 10, 2020

robdockins commented Apr 10, 2020

weaversa commented Apr 10, 2020

weaversa commented Apr 10, 2020

weaversa commented Apr 10, 2020

robdockins commented Apr 10, 2020

weaversa commented Apr 10, 2020

robdockins commented Apr 10, 2020

weaversa commented May 20, 2020

robdockins commented May 20, 2020

weaversa commented May 20, 2020

weaversa commented May 20, 2020

robdockins commented May 20, 2020

robdockins commented May 20, 2020

robdockins commented May 21, 2020

robdockins commented May 22, 2020