add Cstring/Cwstring types for safe passing of NUL-terminated strings to ccall #10994

stevengj · 2015-04-24T18:47:06Z

This adds two new types, Cstring and Cwstring, which act just like Ptr{UInt8} and Ptr{Cwchar_t} as arguments to ccall, except that they throw an error if the string contains an embedded NUL character. I went through the Julia code and used these types wherever they seemed necessary.

This is necessary for safe passing of strings to C routines expecting NUL-terminated strings, to avoid silent truncation. See #10958, #10991.

Note that you can avoid the test, for strings known not to contain NUL, simply by using Ptr{UInt8} as before.

Note also that I changed several of our own C API routines to accept a char* and a size_t length parameter, rather than assuming NUL termination.

Still missing:

Regression tests
Documentation
Check for NUL in Cmd strings (command-line params)
Change Cstring to a bitstype

cc: @JeffBezanson, @StefanKarpinski, @ScottPJones, @jiahao

StefanKarpinski · 2015-04-24T19:03:41Z

base/c.jl

+# C NUL-terminated string pointers; these can be used in ccall
+# instead of Ptr{UInt8} and Ptr{Cwchar_t}, respectively, to enforce
+# a check for embedded NUL chars in the string (to avoid silent truncation).
+immutable Cstring; p::Ptr{UInt8}; end


Should this be Cchar although I'm not familiar with any systems where Cchar != UInt8 and we're likely to be broken for many reasons on any such system?

Eh, that's the least convincing argument ever, pls ignore.

sizeof(char) is always one.

that was an amusing read.

we have had some signedness-of-char issues with unicode on XP in the past

stevengj · 2015-04-24T22:08:00Z

Along the way, I noticed (and fixed + test) an unrelated bug where parse and tryparse were using length(s) instead of endof(s), and hence were giving incorrect answers for Unicode whitespace chars. Also I improved BigInt parsing so that it no longer makes a copy of the string when parsing a whole bytestring.

ScottPJones · 2015-04-24T23:01:33Z

Yes, it should be Ptr{Cchar}, not Ptr{UInt8}, if you really want to be correct...
In most C implementations, by default char is signed, although on IBM systems the default is unsigned (because they needed to handle EBCDIC, which is 8-bit, not 7-bit like ASCII).
I remember how horrible it was back in 1989 when I first ported the product of the company I was consulting for (ended up consulting there for 29 years) to AIX... I had been careful in my changes to not assume anything about char, but other developers had not, and there was probably 20-30MB of C code then... (it was around 56MB as of the beginning of this year). What was worse, the IBM compiler had no flag at the time to switch char from unsigned to signed, and C89 (that added things like void, signed, volatile) was not available yet (and it took several years before we had it on all platforms we supported)

ScottPJones · 2015-04-24T23:02:55Z

This is great, @stevengj !

stevengj · 2015-04-24T23:18:34Z

Ptr{Cchar} vs. Ptr{Uint8} doesn't make a difference, regardless of whether char is signed; the same pointer address gets passed regardless.

stevengj · 2015-04-24T23:29:41Z

@tkelman, any idea why I would be getting an undefined-var error on Windows for _getenvlen? Cwstring is defined in c.jl, which should have been loaded by this point.

jl_unprotect_stack at c:\projects\julia\usr\bin\libjulia.dll (unknown line)
jl_throw at c:\projects\julia\usr\bin\libjulia.dll (unknown line)
jl_undefined_var_error at c:\projects\julia\usr\bin\libjulia.dll (unknown line)
_getenvlen at env.jl:27
get at env.jl:49
tty_size at env.jl:195
term at markdown/render/terminal/render.jl:15

ScottPJones · 2015-04-24T23:52:02Z

@StevenG There isn't anything in Julia that would look at whether Cchar is signed or unsigned on the platform, when dereferencing a Ptr{Cchar} from C? (I know, my newbieness with Julia is showing! 😀)

stevengj · 2015-04-25T00:01:29Z

@ScottPJones, we are just talking about the argument type declared in ccall; at the binary level, all that is being passed is a memory address, no type information. It is equivalent in C to linking to a function compiled as void foo(char *c), but calling it from a prototype mis-declared as void foo(unsigned char *c). If we call foo((unsigned char*) a) with a char *a array, the same pointer to the same data is passed to the underlying foo routine regardless of whether char is signed — the ABI doesn't care what the pointer type declaration is, and nothing we do in the ccall affects the underlying foo machine code.

Not that it would hurt to change the declarations to Ptr{Cchar}, of course. We currently aren't very consistent, but I didn't have the patience to go through and change everything. I'll go ahead and change it in Cstring itself for tidiness' sake.

tkelman · 2015-04-25T00:06:32Z

base/string.jl

+    if findfirst(s.data, 0) != length(s.data)
+        throw(ArgumentError("embedded NUL chars are not allowed in C strings"))
+    end
+    return Cwstring(unsafe_convert(Ptr{wchar_t}, s))


return Cwstring(unsafe_convert(Ptr{Cwchar_t}, s)) ^

vtjnash · 2015-04-25T02:23:25Z

base/c.jl

+# instead of Ptr{Cchar} and Ptr{Cwchar_t}, respectively, to enforce
+# a check for embedded NUL chars in the string (to avoid silent truncation).
+immutable Cstring; p::Ptr{Cchar}; end
+immutable Cwstring; p::Ptr{Cwchar_t}; end


this isn't consistent with some platform ABIs, unfortunately, and will break ccall on some platforms. it must be declared bitstype WORD_SIZE Cstring. i realize that creating new bits types is quite opaque currently and should be better documented. one extremely limited example is

julia/test/core.jl

Lines 2255 to 2257 in 54d8cc7

bitstype 24 Int24

Int24(x::Int) = Intrinsics.box(Int24,Intrinsics.trunc_int(Int24,Intrinsics.unbox(Int,x)))

Int(x::Int24) = Intrinsics.box(Int,Intrinsics.zext_int(Int,Intrinsics.unbox(Int24,x)))

pointer.jl is a more complex example (https://github.com/JuliaLang/julia/blob/master/base/pointer.jl)

(fwiw, and iirc, the only place it will actually matter (and thus segfault) is if this is used as a return type on x86 linux)

Good to know. Unfortunately, WORD_SIZE is defined in int.jl, but I need it in c.jl which is included before int.jl in sysimg.jl. Any suggestions? I guess I can ~~use Ptr.size*8~~ move the WORD_SIZE definition to base.jl.

No, I'm getting ErrorException("invalid number of bits in type Cstring") ... does the 2nd argument to bitstype have to be a numeric literal?

It won't let me define a bitstype inside an if WORD_SIZE === 64 statement either; it says type definition not allowed inside a local scope, which is confusing since normally if does not start a new scope. (Hmm, later on it seems to work; I'm not sure what I was doing wrong.)

I am getting pretty frustrated with bitstype.

Okay, I have it working as a bitstype if I hard-code the size as 64. I'm trying to figure out how to use WORD_SIZE, but it is tricky because a lot of things aren't defined yet at this point in the bootstrap process.

For example,

if Ptr.size === 8 bitstype 64 Cstring bitstype 64 Cwstring else bitstype 32 Cstring bitstype 32 Cwstring end

doesn't work for some reason (I get a box: argument is of incorrect size error later in the build process).

... aha, the problem is that .size is an Int32.

And

if box(Int, unbox(Int32, Int.size)) === 8 bitstype 64 Cstring bitstype 64 Cwstring else bitstype 32 Cstring bitstype 32 Cwstring end

is giving the mysterious type definition not allowed inside a local scope error.

Okay,

const sizeof_Int = box(UInt32, unbox(Int32, Int.size)) if sizeof_Int === 0x00000008 # need === since == doesn't exist yet bitstype 64 Cstring bitstype 64 Cwstring else bitstype 32 Cstring bitstype 32 Cwstring end

seems to work, though why I need the const declaration to avoid type definition not allowed inside a local scope is mysterious to me.

vtjnash · 2015-04-25T02:32:41Z

with this commit, i think you can also get rid of the following check in SubString-to-Ptr conversion which is equivalent to this test for an embedded null:

julia/base/string.jl

Lines 650 to 652 in 54d8cc7

    
           if s.offset+s.endof < endof(s.string) 
        
               throw(ArgumentError("a SubString must coincide with the end of the original string to be convertible to pointer")) 
        
           end

stevengj · 2015-04-25T02:48:34Z

@vtjnash, no, that test is not equivalent. The point of that test is that if you have a substring whose end does not coincide with the end of the string, then the substring is not NUL-terminated. This is true regardless of whether there are embedded NULs anywhere.

Or maybe your point is that people will stop using Ptr{UInt8} for things that are supposed to be NUL-terminated, in which case that unsafe_convert logic can be moved to a Cstring conversion? It still can't be deleted, in any case. Unfortunately, for the foreseeable future, lots of code will continue to use Ptr{UInt8} even when they should be using Cstring.

vtjnash · 2015-04-25T03:45:51Z

it's equivalent in the sense that it is a verification of whether the SubString can be used as a c-style null-terminated string.

but yes, i don't see any way of making this transition, since it is leaving the existing common case unsafe, while adding an alternative safe method.

vtjnash · 2015-04-25T04:33:28Z

base/c.jl

+# instead of Ptr{Cchar} and Ptr{Cwchar_t}, respectively, to enforce
+# a check for embedded NUL chars in the string (to avoid silent truncation).
+const sizeof_Int = box(UInt32, unbox(Int32, Int.size))
+if sizeof_Int === 0x00000008 # need === since == doesn't exist yet


wow, this is low level bit twiddling. the way boot.jl does this is probably cleaner (if is(Int,Int64) ... end)

Duh, yes, that would be cleaner...

stevengj · 2015-04-25T04:45:10Z

We might as well delete the unsafe_convert for SubString, since it tries to guarantee that it is NUL-terminated but actually is wrong because it doesn't check for embedded NUL. It's not clear whether this case (a substring that coincides with the end of a string) is worth optimizing anyway.

tkelman · 2015-04-25T08:15:46Z

I can reproduce the file test giving rmdir: Directory not empty on a local Windows build of this branch.

stevengj · 2015-04-25T13:08:10Z

Thanks @tkelman, what is the file that is in the directory?

… to ccall (see JuliaLang#10958, JuliaLang#10991)

StefanKarpinski · 2015-04-28T19:52:32Z

Does this work safely for non-ByteString strings? I.e. suppose we write

ccall(:func_that_takes_a_cstring, Cint, (Cstring,), str)

but str is some kind of abstract string type that doesn't actually store a contiguous bunch of bytes that Cstring can point at. You can materialize that contiguous bunch of bytes in the conversion process, but how is it rooted? Does cconvert take care of that somehow?

vtjnash · 2015-04-28T20:11:34Z

yes, in v0.4 cconvert was introduced largely for the purpose of introducing gcroots for arbitrary conversions

StefanKarpinski · 2015-04-28T20:15:36Z

That's what I thought, just wanted to make sure. In that case I'm cool with this. @JeffBezanson's opinion would be good though. I have a feeling he might have one (he's at LL today, so it may take a bit).

tkelman · 2015-04-30T17:06:59Z

This isn't merged onto master yet and I want to tag 0.3.8 today, so the sendfile bugfix can probably wait until 0.3.9.

stevengj · 2015-04-30T20:54:08Z

@tkelman, note that there are two bugfix patches in this PR: one to sendfile, and one to tryparse.

tkelman · 2015-04-30T20:58:26Z

Thanks, noted, but tryparse doesn't exist on release-0.3, does it?

stevengj · 2015-05-01T13:41:29Z

Oh, right.

vtjnash · 2015-05-01T15:56:32Z

fwiw, this seems like a good way to help avoid zero-day exploits such as described in http://lucumr.pocoo.org/2010/12/24/common-mistakes-as-web-developer/

add Cstring/Cwstring types for safe passing of NUL-terminated strings to ccall

stevengj · 2015-05-02T18:25:58Z

@JeffBezanson said it was okay to merge.

tkelman · 2015-05-04T01:12:54Z

I'm not sure exactly why, but this seems to cause an MSVC build to freeze while starting the second stage of bootstrap - no exports.jl appears, julia takes an entire core but an unchanging amount of memory (117,836 K to be specific, that might change from run to run though).

(cherry picked from commit 4ff969f) ref PR #10994 Conflicts: base/fs.jl use error instead of ArgumentError for release-0.3 to avoid changing the thrown exception type here

tkelman · 2015-05-04T05:28:42Z

backported the sendfile bugfix in f5e0074

tkelman · 2015-05-12T02:51:23Z

Whatever my bootstrapping problem with an msvc build was seems to have gone away. Now just hitting the segfault from 8b8b261...

stevengj added the unicode Related to unicode characters and encodings label Apr 24, 2015

StefanKarpinski reviewed Apr 24, 2015
View reviewed changes

stevengj force-pushed the nullsafe2 branch 2 times, most recently from ade3444 to df1aa34 Compare April 24, 2015 22:06

tkelman reviewed Apr 25, 2015
View reviewed changes

stevengj force-pushed the nullsafe2 branch 4 times, most recently from 7841275 to 71ac26c Compare April 25, 2015 01:36

vtjnash reviewed Apr 25, 2015
View reviewed changes

stevengj force-pushed the nullsafe2 branch from 71ac26c to 71ab84a Compare April 25, 2015 03:32

stevengj force-pushed the nullsafe2 branch from 71ab84a to 6f60b3c Compare April 25, 2015 04:31

vtjnash reviewed Apr 25, 2015
View reviewed changes

stevengj force-pushed the nullsafe2 branch 2 times, most recently from 874e63c to f53d861 Compare April 25, 2015 04:48

tkelman mentioned this pull request Apr 25, 2015

AppVeyor build is unhappy #11001

Closed

stevengj added 3 commits April 28, 2015 11:30

bugfix: incorrect use of length rather than endof in tryparse

74c2f72

bugfix: sendfile did not close files on exception

4ff969f

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

896b1a6

… to ccall (see JuliaLang#10958, JuliaLang#10991)

stevengj force-pushed the nullsafe2 branch from 2384f23 to 896b1a6 Compare April 28, 2015 15:32

vtjnash mentioned this pull request May 1, 2015

Fix Unicode bugs with UTF-16/UTF-32 conversions (#10959) #11004

Closed

stevengj added a commit that referenced this pull request May 2, 2015

Merge pull request #10994 from stevengj/nullsafe2

803193e

add Cstring/Cwstring types for safe passing of NUL-terminated strings to ccall

stevengj merged commit 803193e into JuliaLang:master May 2, 2015

tkelman removed the backport pending label May 4, 2015

stevengj added a commit to JuliaLang/Compat.jl that referenced this pull request May 4, 2015

added Cstring (JuliaLang/julia#10994)

923613c

stevengj mentioned this pull request May 4, 2015

add Cstring/Cwstring JuliaLang/Compat.jl#83

Merged

stevengj deleted the nullsafe2 branch May 4, 2015 14:24

stevengj added a commit to JuliaLang/Compat.jl that referenced this pull request May 5, 2015

added Cstring (JuliaLang/julia#10994)

dea6750

rsrock mentioned this pull request May 11, 2015

OSX: convert error when reading png file JuliaImages/Images.jl#302

Closed

stevengj added a commit to JuliaLang/Compat.jl that referenced this pull request May 12, 2015

added Cstring (JuliaLang/julia#10994)

1d52f5f

stevengj mentioned this pull request May 21, 2015

Package system on libgit2 #11196

Merged

2 tasks

yuyichao mentioned this pull request May 22, 2015

Do not require new line or ";" at the end of import expression #11333

Merged

tshort mentioned this pull request Mar 1, 2022

Null termination issues in StaticString brenhinkeller/StaticTools.jl#2

Closed

	bitstype 24 Int24
	Int24(x::Int) = Intrinsics.box(Int24,Intrinsics.trunc_int(Int24,Intrinsics.unbox(Int,x)))
	Int(x::Int24) = Intrinsics.box(Int,Intrinsics.zext_int(Int,Intrinsics.unbox(Int24,x)))

add Cstring/Cwstring types for safe passing of NUL-terminated strings to ccall #10994

add Cstring/Cwstring types for safe passing of NUL-terminated strings to ccall #10994

Conversation

stevengj commented Apr 24, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevengj commented Apr 24, 2015

ScottPJones commented Apr 24, 2015

ScottPJones commented Apr 24, 2015

stevengj commented Apr 24, 2015

stevengj commented Apr 24, 2015

ScottPJones commented Apr 24, 2015

stevengj commented Apr 25, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vtjnash commented Apr 25, 2015

stevengj commented Apr 25, 2015

vtjnash commented Apr 25, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevengj commented Apr 25, 2015

tkelman commented Apr 25, 2015

stevengj commented Apr 25, 2015

StefanKarpinski commented Apr 28, 2015

vtjnash commented Apr 28, 2015

StefanKarpinski commented Apr 28, 2015

tkelman commented Apr 30, 2015

stevengj commented Apr 30, 2015

tkelman commented Apr 30, 2015

stevengj commented May 1, 2015

vtjnash commented May 1, 2015

stevengj commented May 2, 2015

tkelman commented May 4, 2015

tkelman commented May 4, 2015

tkelman commented May 12, 2015