Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strlen #468

Closed
weaversa opened this issue Apr 10, 2020 · 16 comments · Fixed by #469
Closed

strlen #468

weaversa opened this issue Apr 10, 2020 · 16 comments · Fixed by #469

Comments

@weaversa
Copy link

I'm really not sure what is wrong here. Is something wrong with saw's notion of strlen? Or, am I doing something wrong? Any help would be appreciated.

strlen_test.c

#include <string.h>

size_t strlen_test (char *string) {
  size_t nLength = strlen(string);
  return nLength;
}

strlen_test.saw

let alloc_init ty v = do {
  p <- crucible_alloc ty;
  crucible_points_to p v;
  return p;
};

let ptr_to_fresh n ty = do {
  x <- crucible_fresh_var n ty;
  p <- alloc_init ty (crucible_term x);
  return (x, p);
};

let strlen_test_spec = do {
  (string, pstring) <- ptr_to_fresh "string" (llvm_array 10 (llvm_int 8));

  crucible_precond {{ string!0 == 0 }}; //NULL terminated string

  crucible_execute_func [pstring];
};

strlen_test_bc <- llvm_load_module "strlen_test.bc";

strlen_test_result <- crucible_llvm_verify strlen_test_bc "strlen_test" [] false strlen_test_spec z3;
$ clang --version
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin19.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
$ clang -emit-llvm strlen_test.c -c -o strlen_test.bc -g
$ saw --version
0.4.0.99 (d87d0d73)
$ saw strlen_test.saw
[13:45:56.549] Loading file "strlen_test.saw"
[13:45:56.576] Verifying strlen_test ...
[13:45:56.578] Simulating strlen_test ...
[13:45:56.579] "crucible_llvm_verify" (strlen_test.saw:23:23-23:43):
Symbolic execution failed.
Abort due to false assumption:
  Error during memory load
  in strlen_test at strlen_test.c:4:20
  Details:
  No previous write to this location was found
    No matching writes to this memory were found.
Stack frame
  Allocations:
    StackAlloc 6 0x8:[64] Mutable 8-byte-aligned "strlen_test.c:4:10"
    StackAlloc 5 0x8:[64] Mutable 8-byte-aligned "strlen_test.c:3:27"
  Writes:
    Indexed chunk:
      5 |->
        *(5, 0x0:[64]) := (4, 0x0:[64])
Base memory
  Allocations:
    HeapAlloc 4 0xa:[64] Mutable 1-byte-aligned strlen_test.saw:2:8
    GlobalAlloc 3 0x0:[64] Immutable 1-byte-aligned strlen_test
    GlobalAlloc 2 0x0:[64] Immutable 1-byte-aligned strlen
    GlobalAlloc 1 0x0:[64] Immutable 1-byte-aligned llvm.dbg.declare
  Writes:
    Indexed chunk:
      4 |->
        *(4, 0x0:[64]) := [c@6:bv
                          ,c@7:bv
                          ,c@8:bv
                          ,c@9:bv
                          ,c@10:bv
                          ,c@11:bv
                          ,c@12:bv
                          ,c@13:bv
                          ,c@14:bv
                          ,c@15:bv]
@robdockins
Copy link
Contributor

Ah, this is an interesting one. The code for strlen walks down an allocation byte-by-byte, constructing a if/then/else tree searching for the first 0 byte. In this case, you've assumed that a particular byte is 0 rather than writing an explicit 0... this means the strlen loop doesn't immediately realize it is 0 and keeps going until it reads off the end of the allocation. This should be OK because a concrete run would only do that if all the previous bytes were 0... and by assumption at least one will be 0.

However, we haven't correctly piped the relevant assumptions into the path condition, so the simulator believes that this load must succeed regardless and emits an impossible proof obligation.

As a workaround, you can insert a concrete 0 at the end of the array instead of assuming the final byte is 0; I believe that should work. Meantime, I'll see about getting this bug fixed.

@robdockins
Copy link
Contributor

Here is a crux-llvm input demonstrating the same issue:

#include <stdlib.h>
#include <string.h>

#include "crucible.h"

#define MAX 10

char* mkstr() {
  char* x = malloc(MAX);
  for( int i=0; i<MAX; i++ ) {
    x[i] = crucible_int8_t( "x" );
  }

  //x[MAX-1] = 0;
  assuming( x[MAX-1] == 0 );

  return x;
}

int main() {
  char *str = mkstr(); 
  size_t sz = strlen(str);
  check( sz < MAX );
}

@weaversa
Copy link
Author

Changing the saw-script to this still gives the same error.

let strlen_test_spec = do {
  string <- crucible_fresh_var "string" (llvm_array 10 (llvm_int 8));
  pstring <- alloc_init (llvm_array 11 (llvm_int 8)) (crucible_term {{ string # [0] }});

  crucible_execute_func [pstring];
};

@weaversa
Copy link
Author

@robdockins ...something sinister is happening here...

@weaversa
Copy link
Author

Prepending zeros -- still broken

let strlen_test_spec = do {
  string <- crucible_fresh_var "string" (llvm_array 1 (llvm_int 8));
  pstring <- alloc_init (llvm_array 11 (llvm_int 8)) (crucible_term {{ (zero : [10][8]) # string }});

  crucible_execute_func [pstring];
};

All zeroes -- works!

let strlen_test_spec = do {
  pstring <- alloc_init (llvm_array 11 (llvm_int 8)) (crucible_term {{ (zero : [11][8]) }});

  crucible_execute_func [pstring];
};

@robdockins
Copy link
Contributor

Yeah, after little more thinking about it I know why that happens; it's quite annoying. Basically, the information about the final array value being concrete doesn't get piped down to the symbolic simulator. Only fully concrete values get piped into the simulator, everything else is abstracted. It's something that's difficult to improve, but something I've wanted for some time.

robdockins added a commit that referenced this issue Apr 10, 2020
Previously, `strlen` would recurse down a string until it
ran into a concrete zero.  This is fine, except that it would
assert that each load along the way would succeed, without taking
into account the previous zero tests that had to fail.  To correct
this we need to emulate a path condition such that to load
at position `n` each of the `n-1` previous locations must have
been non-zero, and the loads are only required to succeed under
that condition.  This allows one, for example, to assume that
some value in a string is 0 (but not specify which one).  Previously,
`strlen` would index off the end of the allocation of such a string
and fail.  Now it will succeed because the path condition required
to index out-of-bounds is inconsistent.

Fixes #468
@weaversa
Copy link
Author

I figured that, at the very least, using 'sat branches' is supposed to bridge this gap --- that is, the SAT solver is supposed to kick off during symbolic execution at conditional branches, taking preconditions into account to alleviate these exact kind of false-positive error conditions and symbolic termination conditions. Maybe the crucible 'strlen' built-in isn't talking to the SAT solver?

@robdockins
Copy link
Contributor

Yes, the built-in strlen doesn't invoke a full symbolic branch at each zero test. That would be a lot of simulator bookkeeping for even moderately-sized strings.

robdockins added a commit that referenced this issue Apr 20, 2020
Previously, `strlen` would recurse down a string until it
ran into a concrete zero.  This is fine, except that it would
assert that each load along the way would succeed, without taking
into account the previous zero tests that had to fail.  To correct
this we need to emulate a path condition such that to load
at position `n` each of the `n-1` previous locations must have
been non-zero, and the loads are only required to succeed under
that condition.  This allows one, for example, to assume that
some value in a string is 0 (but not specify which one).  Previously,
`strlen` would index off the end of the allocation of such a string
and fail.  Now it will succeed because the path condition required
to index out-of-bounds is inconsistent.

Fixes #468
@weaversa
Copy link
Author

Something happened that caused strlen to not work correctly anymore. It worked on the 13th, but not on the 19th. Example here: string_issue.tar.gz

$ ~/local/packages/saw-0.4.0.99-2020-05-13-OSX-64/bin/saw bitvector.saw
[14:12:19.703] Loading file "bitvector.saw"
[14:12:19.704] Loading file "llvm.saw"
[14:12:20.479] Verifying bitvector_t_fromHexString ...
[14:12:20.484] Simulating bitvector_t_fromHexString ...
i = 0
i = 1
i = 2
i = 3
i = 4
i = 5
i = 6
i = 7
i = 8
i = 9
i = 10
i = 11
i = 12
i = 13
i = 14
i = 15
i = 16
i = 17
[14:12:20.973] Symbolic simulation completed with side conditions.
[14:12:20.998] Checking proof obligations bitvector_t_fromHexString ...
[14:12:37.057] Proof succeeded! bitvector_t_fromHexString

versus

$ ~/local/packages/saw-0.4.0.99-2020-05-19-OSX-64/bin/saw bitvector.saw
[14:13:29.447] Loading file "bitvector.saw"
[14:13:29.448] Loading file "llvm.saw"
[14:13:30.221] Verifying bitvector_t_fromHexString ...
[14:13:30.225] Simulating bitvector_t_fromHexString ...
i = 0
i = 1
i = 2
i = 3
i = 4
i = 5
i = 6
i = 7
i = 8
i = 9
i = 10
i = 11
i = 12
i = 13
i = 14
i = 15
i = 16
i = 17
i = 18
i = 19
i = 20
i = 21
i = 22
i = 23
i = 24
i = 25
i = 26
i = 27
i = 28
i = 29
i = 30
i = 31
i = 32
i = 33
i = 34
i = 35
i = 36
i = 37
i = 38
i = 39
i = 40
i = 41
i = 42
i = 43
i = 44
i = 45
i = 46
i = 47
i = 48
i = 49
i = 50
i = 51
i = 52
i = 53
i = 54
i = 55
i = 56
i = 57
i = 58
i = 59
i = 60
i = 61
i = 62
i = 63
i = 64
i = 65
i = 66
i = 67
i = 68
i = 69
i = 70
i = 71
^C[14:13:31.915] user interrupt

@weaversa weaversa reopened this May 20, 2020
@robdockins
Copy link
Contributor

The crucible unit test for strlen is still working. However, I bisected SAW and it looks like the first commit with the regression is d679eabe89939d5f61e555ec3129d06181c15081. This is a submodule update commit, so I'll have to dig in and see what happened.

@weaversa
Copy link
Author

A small example is attached. Works with 5/13, doesn't with 5/19.

strlen_iter.tar.gz

@weaversa
Copy link
Author

It looks like 5/13 realizes when the string is accessed out of bounds and stops the symbolic execution. 5/19 is oblivious. 5/13 is also way faster at symbolic execution (but maybe that's another ticket).

@robdockins
Copy link
Contributor

Further bisection reveals that the following What4 patch is the proximate cause of this regression: cce137e0247ef9ee970ae13d280dea1ec9449413

That patch changes the way the bitvector abstract domains are implemented. The fact that this causes a regression in this strlen case is deeply mysterious to me, especially given that the corresponding crux test case succeeds even after applying this patch.

I conclude that there is some nontrivial interaction between the SAW glue code, the What4 bitvector abstract domains, and the LLVM memory model that gives rise to this behavior.

@robdockins
Copy link
Contributor

The minimized strlen_iter test case likewise breaks at the same commit.

@robdockins
Copy link
Contributor

This issue here turns out to be a really interesting combination of effects.

First, the only reason this program terminated quickly before was that we are able to learn abstract domain interval information about the computed value of strlen. Because of the way the result of strlen was constructed, we learned that it must be in the (signed) interval from -1 to 16. (Why -1? Well, that is a bogus sentinel value returned for the case you walk of the end off the array; originally, I didn't think this value mattered, as it should only occur in dead code branches). This is eventually enough to show that the memory access in the body of the loop will definitely fail, which aborts the loop (subsequently we can prove the failed memory access can only occur in an inconsistent state, but this requires information only available on the SAW side, so the simulator doesn't notice it during symbolic simulation).

Now, with the change to the bitvector domain implementation, I made another arbitrary choice that I didn't think would matter: I started singleton values in the "bitwise" abstract domain mode. In retrospect, this is probably the wrong choice. When you union together a contiguous sequence of singleton values in the arithmetic mode, you get a nice dense interval; however, if you union together such a sequence in the bitwise mode, you get a quite less accurate approximation, especially if your interval crosses 0 (then it goes all the way to top). Thus: instead of a nice, small interval we forget all abstract information about the result of strlen. As a result, we cannot abort or exit the loop early.

So, there are several things to fix here, any of which makes this issue solved. First, the sentinel value for strlen should be 0 instead of -1 to avoid spurious values in our abstract domains and path satisfiability checks. Second, the default mode for bitevector abstract domains should be the arithmetic one to properly handle the (relatively common) case of short intervals generated by repeatedly adjoining singleton values. Third, we should improve the flow of information from SAW preconditions into the symbolic simulator so that both abstract domain information and path satisfiability checks are more useful.

@robdockins
Copy link
Contributor

This should be fixed via #485

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants