-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
incorrect handling of NUL characters in strings #10958
Comments
Looks like julia> length("\0\uff")
0
julia> length(UTF8String("\0w"))
0 |
I think we need to stop relying on the u8 support functions for UTF-8 various things. It assumes that strings are NUL-terminated, which in Julia they are not. There may be other assumptions it makes that are invalid. @JeffBezanson might know what those assumptions may be. |
We could use "modified UTF-8", like Java, and use the overlong encoding for the NUL codepoint if it occurs in the string. The problem with encoding NUL as the 0 byte is that any Julia code that calls external C code expecting NUL-terminated strings could silently produce unexpected/truncated results if an arbitrary Or we could just disallow NUL bytes in UTF8 or ASCII strings. Or we could allow NUL bytes, making sure all our own code is "NUL-clean" (i.e. pass an explicit length to any C string routines), and caveat emptor for other packages calling C string routines. |
Note that all of the The more I look at this, the more I think that allowing NUL characters in strings is a gaping security problem. If you want an arbitrary sequence of bytes, use a |
Similarly for all the Windows (e.g. |
See also RFC 3629, the discussion of invalid-byte-sequence handling in the Wikipedia UTF-8 article, and PEP 0383 on non-decodable bytes. It seems like a subtle issue to decide what to do with things like this, because may be really hard to be sure that no invalid byte-sequences crop up. |
One possibility might be to have a special |
@JeffBezanson, you wrote a lot of the original UTF-8 code, right? What was the general philosophy on handling of embedded NUL characters, or of invalid UTF-8? Note that |
Please! Don’t make it so that you can’t have \0 characters in Julia strings! Also, I can bend the UTF-8 rules a bit to accept a single technically invalid sequence, but my intention was to never produce any invalid UTF-8, UTF-16, or UTF-32 strings. Also, ASCIIStrings also can have embedded \0 bytes in them... are you going to make that invalid as well? |
@ScottPJones, the point is that you think you are passing What is the use-case for strings with embedded NUL bytes? (Most POSIX and ISO C functions don't allow them.) Note that we are talking about strings, not |
Delimited strings... people have been using them for many, many, many years. I always avoided using any functions that relied on \0 termination... it kills performance compared to having fixed width characters and a length. |
About security risks: creating overlong UTF-8 sequences causes a security problem. |
I’d also thought of having a Cstring / Cwstring type that did the checks, only for use when you really needed it (which is less and less an issue, since more APIs now use length/pointer instead). |
I might call it CZstring, and CZwstring... to make it clear that these are JUST for the case where nul termination is important... not all C strings are \0 terminated anyway (just literals, and ones you are making to pass to particular APIs that still expect them) |
@ScottPJones, it's totally not true that all modern APIs (defined as APIs in current widespread use, or even defined as APIs of libraries created in the last 10-15 years) avoid NUL-terminated strings, for better or for worse. In both Julia base and in the packages you can find lots of examples calling external C libraries which pass NUL-terminated strings. Essentially all of these need to have checks if we continue to allow strings with embedded NUL. A Whether we should accept (and silently convert) modified UTF-8 to standard UTF-8 is a separate issue; I tend to agree, but let's keep that out of this discussion. After reading the RFCs, I agree that we shouldn't produce the overlong NUL encoding ourselves. Using NUL as a delimiter inside of a string is cute, but is it really that useful? Is there any popular library that returns strings in this format, for example? |
My intent was to support NUL characters inside julia strings. If you pass such a string to a C function that accepts NUL-terminated strings, that's a separate problem we can't do anything about. I definitely don't want to use "modified UTF-8". Not all the functions in utf8.c assume nul termination. Several accept lengths or have non-nul-terminated variants. For length we should just call |
If we pass such a string to a C function that accepts NUL-terminated strings (and we potentially do that, in many places), that is our problem and we do need a check. Some of the functions in |
I think the way forward for passing strings to C APIs that expect NUL-terminated strings is to ensures that the string is properly NUL-terminated and that it has no embedded NULs either – embedded NULs should be an error. This guarantees that the Julia and C sides agree on what the string is. This can be accomplished either with a type on the Julia side that guarantees this property upon construction (potentially sharing data with the original Julia string when possible), or with a function that the user is expected to call on the string. |
@stevengj I didn't say that all modern APIs do, just that they tend that way now... Also, as a C programmer for 35 years, I disagree with saying that in common parlance string means NUL-terminated... may be it depends on what particular community of C (and not C++) programmers you are in. |
@StefanKarpinski, adding yet another set of string types would be a huge hassle for everyone. We could add a |
Leaving aside NULs for a moment, I don't think it's sane to expect every operation on invalid utf-8 data to have some defined behavior. Currently our |
Examples like this one: |
@StefanKarpinski Yes, a new Cstring (and Cwstring) type with their own constructors would do the trick... I'd also say that the normal String types should actually not require a terminating \0. |
@ScottPJones, there is a difference between a new string type and a new type. I think I'm skeptical that there is a significant performance cost to the current approach for UTF-8 nul-termination, in which all |
I suspect that the current approach of sneaking a NUL byte at the end of every UInt8 vector is not a performance issue. It is, however, annoying for other reasons and should probably, IMO, go away. C APIs that take NUL-terminated strings tend not to be used for very long strings, so I think that checking that the passed string has no embedded NULs and ensuring the trailing NUL (possibly by creating a copy with the NUL appended) will not be a big performance issue either. Having a Cstring type that's just a pointer with special conversion behavior would be ok, but do we also use it for NUL-terminated UTF-16 and UTF-32 strings? Those are not what one would classically consider "C strings". Do NUL-terminated UTF-16 and UTF-32 strings require only a single trailing NUL byte? Or do they require an entire NUL code unit? |
@JeffBezanson, I agree, but I just want to make sure Julia never crashes on invalid UTF-8 data, even if it gives junk results. |
Yes, I'd like to contribute to the core of Julia there, and also with decimal floating point support, improved ODBC support, and maybe an Aerospike package... |
@ScottPJones a sidenote: as a part of what has turned out to be a syntax decision with hilarious consequences, please try to always code-quote the names of macros with backticks. Otherwise, you create a GitHub notification to a (sometimes-irritable) user with the same name as the macro. |
@pao Oops!!! Thanks :-) |
np, took us a while to figure that out 😄 |
fix #10958: buggy handling of embedded NUL chars
This gets a BoundsError: attempt to access 1-element Array{Char,1}
in convert at utf32.jl:37
The text was updated successfully, but these errors were encountered: