Incorrect encoding of null character in buffer #394

brianc · 2010-11-02T16:07:58Z

In node v0.3.0 buffers are not encoding null character '\0' correctly

~$ nvm use v0.2.4
Now using node v0.2.4
~$ node
> Buffer('\0')
<Buffer 00>

~$ nvm use v0.3.0
Now using node v0.3.0
~$ node
> Buffer('\0')
<Buffer >

Totally breaks creation of null terminated strings within buffers

The text was updated successfully, but these errors were encountered:

Sannis · 2010-11-02T16:35:43Z

You should use Buffer('\0', 'binary') instead of Buffer('\0') with Node.js v0.3.x, because default encoding is 'utf8' now.

brianc · 2010-11-02T17:47:02Z

In v0.2.x this worked: Buffer('!\0', 'utf8') => <Buffer 33 00> and now same thing results in <Buffer 33>

mernst-github · 2010-12-27T15:04:47Z

I stumbled over this, too. It feels like an arbitrary workaround for a string terminator problem somewhere else. Buffer.byteLength('\u0000') actually does return 1 but new Buffer('\u0000') cuts off a trailing null byte: https://github.com/ry/node/blob/master/src/node_buffer.cc#L430

'binary' encoding works, but "this encoding method is depreciated and should be avoided in favor of Buffer objects where possible. This encoding will be removed in future versions of Node." (http://nodejs.org/docs/v0.3.2/api/buffers.html#buffers)

mattcg · 2011-06-10T01:47:39Z

This is still a problem. In v0.4.7, new Buffer('\0', 'ascii') results in <Buffer 20>.

bnoordhuis · 2011-06-12T21:38:33Z

It's a bug / feature in v8::String::WriteAscii() that translates nul bytes into spaces[1]. V8 has always done that but Node didn't trigger that code path in 0.2.x.

[1] https://github.com/v8/v8/blob/e3319f4/src/api.cc#L3615 (warning, big file)

v8::String::Utf8Write() appends '\0' if there is room left in the output buffer. Pass in the exact decoded length obtained with v8::String::Utf8Length() to make it stop doing that. Fixes nodejs#394.

Assert that the trailing '\0' is not swallowed in UTF-8 input to the Buffer constructor.

bnoordhuis · 2011-07-10T23:32:15Z

The problem is that V8 appends a nul byte if there is still room left in the output buffer. We can work around that by passing in the exact byte length as returned by v8::String::Utf8Length(). It's slower but correct (and not even that much slower). Ad-hoc benchmark:

buf = Buffer(64);
for (var i = 0; i < 10e6; ++i) {
  buf.write('ö日本語abcdefäëö', 0, 'utf8');
}

Current master:

$ time ./node tmp/utf8.js 
real    0m9.346s
user    0m9.350s
sys     0m0.050s

With 22fabd4 applied:

$ time ./node tmp/utf8.js 
real    0m11.213s
user    0m11.230s
sys     0m0.020s

An alternative take is to patch V8 to not append '\0' if requested. That might be necessary for #297 anyway (0x00 is converted to 0x20 in ASCII input) because there is no way right now to side-step that.

@ry Can you review?

koichik · 2011-07-11T01:18:17Z

with 6079f1a applied?

I have tested using long long string when I wrote my first patch (e50a751), it was twice slower if I called v8::String::Utf8Length() in my environment (v0.4/V8 3.1).

koichik · 2011-07-11T08:26:31Z

long string version:

var a = [];
for (var i = 0; i < 2000; ++i) {
  a.push('蛯原友里'); // 3bytes * 4
}
var s = a.join(''); 
var buf = new Buffer(24000);
for (i = 0; i < 10e4; ++i) {
  buf.write(s, 0, 'utf8');
}

current master:

real    0m9.739s
user    0m10.660s
sys     0m4.480s

with 22fabd4 applied:

real    0m12.709s
user    0m14.390s
sys     0m0.970s

with 6079f1a applied:

real    0m10.170s
user    0m10.580s
sys     0m5.110s

koichik · 2011-07-11T08:40:59Z

I can't wait HINT_NO_NULL_TERMINATOR!
http://code.google.com/p/v8/issues/detail?id=1537

bnoordhuis · 2011-07-11T11:00:14Z

I see a similar ~40% performance drop with Utf8Length() on long strings so that's out the window. But having a workaround for a workaround doesn't strike me as a great idea either, no offense to your fine work, koichik.

I suppose we should lobby the V8 guys to fix it but I suspect the patch from that issue you linked to won't pass muster.

koichik · 2011-07-11T12:09:47Z

Never mind, I agree Ben. My patch is workaround to avoid calling Utf8Lenght().
But many Web application especially using template engine such as EJS makes long string.
So we need this workaround I think.

May I tag "v8" on this issue? :-)

bnoordhuis · 2011-07-11T22:39:56Z

@koichik Go for it. :)

Refs #394. Fixes #1902.

bnoordhuis added a commit to bnoordhuis/node that referenced this issue Jul 10, 2011

Test cases for nodejs#394, graciously stolen from Koichi Kobayash.

6c46304

Assert that the trailing '\0' is not swallowed in UTF-8 input to the Buffer constructor.

koichik closed this as completed in 5208abe Jul 13, 2011

koichik mentioned this issue Oct 17, 2011

Use NO_NULL_TERMINATION flag for the conversion from String to bytes #1902

Closed

koichik added a commit that referenced this issue Oct 19, 2011

buffer: use NO_NULL_TERMINATION flag

6a72e52

Refs #394. Fixes #1902.

koichik mentioned this issue Jan 5, 2012

Buffer#write adds trailing 0x00 for utf-8 but not ascii #2446

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect encoding of null character in buffer #394

Incorrect encoding of null character in buffer #394

brianc commented Nov 2, 2010

Sannis commented Nov 2, 2010

brianc commented Nov 2, 2010

mernst-github commented Dec 27, 2010

mattcg commented Jun 10, 2011

bnoordhuis commented Jun 12, 2011

bnoordhuis commented Jul 10, 2011

koichik commented Jul 11, 2011

koichik commented Jul 11, 2011

koichik commented Jul 11, 2011

bnoordhuis commented Jul 11, 2011

koichik commented Jul 11, 2011

bnoordhuis commented Jul 11, 2011

Incorrect encoding of null character in buffer #394

Incorrect encoding of null character in buffer #394

Comments

brianc commented Nov 2, 2010

Sannis commented Nov 2, 2010

brianc commented Nov 2, 2010

mernst-github commented Dec 27, 2010

mattcg commented Jun 10, 2011

bnoordhuis commented Jun 12, 2011

bnoordhuis commented Jul 10, 2011

koichik commented Jul 11, 2011

koichik commented Jul 11, 2011

koichik commented Jul 11, 2011

bnoordhuis commented Jul 11, 2011

koichik commented Jul 11, 2011

bnoordhuis commented Jul 11, 2011