Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash deserializing IPC message using advanced serialization #34797

Closed
novemberborn opened this issue Aug 16, 2020 · 20 comments
Closed

Crash deserializing IPC message using advanced serialization #34797

novemberborn opened this issue Aug 16, 2020 · 20 comments
Labels
child_process Issues and PRs related to the child_process subsystem. confirmed-bug Issues with confirmed bugs. v8 module Issues and PRs related to the "v8" subsystem.

Comments

@novemberborn
Copy link

  • Version: 14.8.0
  • Platform: MacOS
  • Subsystem: child_process

What steps will reproduce the bug?

I start a child process using fork() and { serialization: 'advanced' }. In the child process I synchronously call process.send() until it returns false. I keep process.channel referenced until all my send() callbacks have been called. Every so often, the main process crashes. Other times it exits gracefully, though it does not log all received messages. Presumably because the child process exits without flushing its IPC buffer. That's manageable and probably not a bug.

main.js:

const childProcess = require('child_process')

const channel = childProcess.fork('./child.js', [], { serialization: 'advanced' })
channel.on('message', message => {
  console.error('main:', message.count)
})

child.js:

// Keep the process alive until all messages have been sent.
process.channel.ref()

let pending = 0
const drain = () => {
  if (--pending === 0) {
    console.error('all messages sent')
    if (!process.connected) {
      console.error('already disconnected')
    } else {
      console.error('unref channel')
      // Allow the process to exit.
      process.channel.unref()
    }
  }
}

// Fill up any internal buffers.
const filler = Buffer.alloc(2 ** 12, 1)

// Enqueue as many messages as possible until we're told to back off.
let count = 0
let ok
do {
  pending++
  ok = process.send({count: ++count, filler}, drain)
  console.error('child:', count, ok)
} while (ok)

And then run node main.js.

How often does it reproduce? Is there a required condition?

It's intermittent.

What is the expected behavior?

The main process does not crash.

What do you see instead?

The main process crashes with the following error:

internal/child_process/serialization.js:69
      deserializer.readHeader();
                   ^

Error: Unable to deserialize cloned data due to invalid or unsupported version.
    at parseChannelMessages (internal/child_process/serialization.js:69:20)
    at parseChannelMessages.next (<anonymous>)
    at Pipe.channel.onread (internal/child_process.js:595:18)

Additional information

If I modify child.process to schedule sends using setImmediate() there is no crash, and the main process receives all messages:

let count = 0
do {
  pending++
  setImmediate(() => {
    process.send({count: ++count, filler}, drain)
    console.error('child:', count)
  })
} while (pending < 100)
@targos
Copy link
Member

targos commented Aug 16, 2020

@nodejs/workers

@targos targos added v8 module Issues and PRs related to the "v8" subsystem. child_process Issues and PRs related to the child_process subsystem. labels Aug 16, 2020
novemberborn added a commit to avajs/ava that referenced this issue Aug 16, 2020
* Use advanced serialization (when available) for test worker communication

Use advanced IPC on 12.17.0. This is currently our future minimal Node.js 12 version.

* Buffer IPC sends to avoid crashes

Workaround for nodejs/node#34797

* Tweak reporter integration tests

Maybe it's the IPC changes, but they started to fail on Windows.
@addaleax
Copy link
Member

Fwiw, I can’t seem to reproduce this locally. (And it’s more of a case for @nodejs/child_process, if we go by teams.)

@targos
Copy link
Member

targos commented Aug 17, 2020

And it’s more of a case for @nodejs/child_process, if we go by teams.

Yeah, sorry. I initially thought it was maybe a general issue with our implementation of the serialization API and AFAIK there's no team for the v8 module.

Now that I'm looking again at the issue, I wonder if it may be related to the use of different incompatible Node.js/V8 versions between parent and child processes.

@addaleax
Copy link
Member

Now that I'm looking again at the issue, I wonder if it may be related to the use of different incompatible Node.js/V8 versions between parent and child processes.

I think that would lead to more consistent failures? My best guess would be that we somehow mess up the message boundaries and start parsing in the middle of another message, but I’m not really sure how that would happen…

@novemberborn
Copy link
Author

I'm using fork() so it should select the same version. It feels like a data corruption issue to me.

@novemberborn
Copy link
Author

When I was digging into the JS code that drives this there's a bunch of array buffers, so maybe the copying of data off the channel is corrupting a shared buffer? (I'm just thinking out loud, not really sure what the code is doing.)

@addaleax
Copy link
Member

@novemberborn There aren’t any SharedArrayBuffer instances involved, if that’s what you’re referring to.

The format for messages is relatively simple: It’s 4 bytes that contain the length of the message, plus the rest of the message. We use that first field to determine the message boundaries, so if that calculation goes wrong at some point, that might be the cause of a crash like this (the code does look okay to me, though). Another possibility would be that data is indeed corrupted during transfer, but I’m not sure how that would happen either.

If you’re up for building Node.js to debug this, you could probably print the contents of the Buffers on the sending and the receiving side to compare them (or at least that’s what I would do if I could reproduce this).

@novemberborn
Copy link
Author

@addaleax what OS are you using? #30009 also hints at a macOS issue, though that example isn't respecting backpressure.

@addaleax
Copy link
Member

I’m on x64 Linux, so it’s definitely possible that this is platform-specific, yes.

@lpinca
Copy link
Member

lpinca commented Aug 17, 2020

I can reproduce described issue on macOS 10.15.6

@lpinca
Copy link
Member

lpinca commented Aug 17, 2020

FWIW it seems to be fixed on master.

@lundibundi
Copy link
Member

lundibundi commented Aug 17, 2020

@addaleax @lpinca I also thought it was fixed on master but then it randomly crashed.
It looks like a very intermittent bug. I ran like 1000 times and it didn't crash and then after 20 more in a new run it crashed.
I'm on Linux x64 as well.

Edit: got it to crash on current master after like 3000 runs =)
for i in {1..1000}; do if ! ./node test-1-worker.js; then break; fi; done

@lpinca
Copy link
Member

lpinca commented Aug 17, 2020

Hmm no crash on master with 10000 runs. I guess this will be hard to debug.

@novemberborn
Copy link
Author

Increasing the filler size might make it appear more quickly.

futpib added a commit to futpib/tglib-execa that referenced this issue Sep 21, 2020
@d0b3
Copy link

d0b3 commented May 15, 2021

This makes v16.1.0 reliably crash on my linux machines (x64 and rpi/armv7). It works on windows though.

const assert = require('assert'),
      cluster = require('cluster');

const SIZE = 7 * 1024 * 1024;

if (cluster.isPrimary) {

  cluster.setupPrimary({ serialization: 'advanced' });
  
  const worker = cluster.fork();

  worker.send(new Uint8Array(SIZE).fill(0x00));
  worker.send(new Uint8Array(SIZE).fill(0x01));

} else {

  process.on('message', data => {
    console.log(Buffer.from(data));
    assert(data.length === SIZE, 'invalid size');
    for (let i = 1; i < SIZE; i++) assert(data[0] === data[i], 'invalid byte at offset '+i);
  });

}

Either the worker receives the first message, but it's corrupt. The beginning and size is correct but after a couple of bytes the second message starts (including the size-header and the serialization-header).

AssertionError [ERR_ASSERTION]: invalid byte at offset 219252
    at process.<anonymous> (/home/dobe/Work/nodejs-adv-ser-fix/easy-crash.js:21:36)
    at process.emit (node:events:377:35)
    at emit (node:internal/child_process:928:12)
    at processTicksAndRejections (node:internal/process/task_queues:84:21)

The invalid byte offset on the raspberry pi is 183027, though it looks more or less random.

Or no message gets received at all (but no infinite loop).

I tried digging into it, and when I check the arguments for .oncomplete that are currently begin ignored here

req.oncomplete = () => {
control.unrefCounted();
if (typeof callback === 'function')
callback(null);
};

 req.oncomplete = (...args) => { 
   console.trace('req.oncomplete', args);
   control.unrefCounted(); 
   if (typeof callback === 'function') 
     callback(null); 
 }; 

node delivers an error code

Trace: req.oncomplete [
  -14,
  Pipe {
    buffering: false,
    pendingHandle: null,
    sockets: { got: {}, send: {} },
    [Symbol(kMessageBuffer)]: <Buffer >
  },
  undefined
]
    at WriteWrap.req.oncomplete (node:internal/child_process:842:19)

Which I guess means [ 'EFAULT', 'bad address in system call argument' ].

Some observations I made:

  • After analyzing the raw chunks the worker receives, it looks like the first message cuts off when req.oncomplete with the error is called. The following chunks are from the second message.
  • When waiting for .send() to complete with a callback before the next .send() call, I still regularly get the -14 errors.
  • When I tried to send equally large utf8-strings without advanced serialization, I had no problems.

I'm sort of lost with the system call stuff and node, but I think once the error occurs the channel is unusable, so handling it in req.oncomplete is not possible. My guess is that the bug needs to be fixed somewhere in libuv, or the c++ layer in between, but I don't really know.

@peeter-tomberg
Copy link

Hello!

We ran into the same issue, and I played around a bit with this. To me it seems to be a buffer overflow / flush timing issue. If you change the size of the messages to be smaller, you will run into the same problem if you submit multiple messages in a short time period. For example, running the following script on my machine

const assert = require('assert'),
  cluster = require('cluster');

const SIZE = 6708878;
const sleep = timeout => new Promise(resolve => setTimeout(resolve, timeout));

if (cluster.isMaster) {

  cluster.setupMaster({ serialization: 'advanced' });

  const worker = cluster.fork();

  const message1 = new Uint8Array(SIZE).fill(0x00);
  const message2 = new Uint8Array(SIZE).fill(0x01);
  console.log(message1.length, message2.length);

  const sendMessages = async () => {
    worker.send(message1);
    worker.send(message2);
    await sleep(100);
    worker.send(message1);
    worker.send(message2);
  };
  sendMessages();
} else {
  process.on('message', data => {
    console.log(Buffer.from(data));
    assert(data.length === SIZE, 'invalid size');
    for (let i = 1; i < SIZE; i++) assert(data[0] === data[i], 'invalid byte at offset '+i);
  });

}

Generates the following output (the offset where the error happens changes by a couple of thousand each run)

AssertionError [ERR_ASSERTION]: invalid byte at offset 1482739
    at process.<anonymous> (/Users/petja/projects/sportsbet-graphql-server/test.js:29:36)
    at process.emit (events.js:327:22)
    at emit (internal/child_process.js:903:12)
    at processTicksAndRejections (internal/process/task_queues.js:81:21) {
  generatedMessage: false,
  code: 'ERR_ASSERTION',
  actual: false,
  expected: true,
  operator: '=='
}

If I change the sleep timer to await sleep(1000); I get the following output:

AssertionError [ERR_ASSERTION]: invalid byte at offset 4644851
    at process.<anonymous> (/Users/petja/projects/sportsbet-graphql-server/test.js:29:36)
    at process.emit (events.js:327:22)
    at emit (internal/child_process.js:903:12)
    at processTicksAndRejections (internal/process/task_queues.js:81:21) {
  generatedMessage: false,
  code: 'ERR_ASSERTION',
  actual: false,
  expected: true,
  operator: '=='
}

The offset again changes by a couple of thousand, so essentially it feels like it depends on how fast your machine is at actually transferring the message and the size of the message. If you transfer more data than can be transfered to the child process then new messages start over-writing old ones.

@addaleax addaleax added the confirmed-bug Issues with confirmed bugs. label May 18, 2021
addaleax added a commit to addaleax/node that referenced this issue May 18, 2021
Do the same thing we do for other streams, and retain a reference to
the Buffer that was sent over IPC while the write request is active,
so that it doesn’t get garbage collected while the data is still in
flight.

(This is a good example of why it’s a bad idea that we’re not re-using
the general streams implementation for IPC and instead maintain
separate usage of the low-level I/O primitives.)

Fixes: nodejs#34797
@addaleax
Copy link
Member

@peeter-tomberg Thanks, the reproduction here was really valuable. I’ve opened #38728 which should address this.

@peeter-tomberg
Copy link

Awsome! Will this be applied for older versions of node as well? We're experiencing this on Node 12 LTS.

@KishanBagaria
Copy link

+1 we're using electron 11 / node 12

would be great to have this back-ported!

@addaleax
Copy link
Member

@peeter-tomberg @KishanBagaria It should follow the regular release schedule, so, yes, that should not be an issue.

danielleadams pushed a commit that referenced this issue May 31, 2021
Do the same thing we do for other streams, and retain a reference to
the Buffer that was sent over IPC while the write request is active,
so that it doesn’t get garbage collected while the data is still in
flight.

(This is a good example of why it’s a bad idea that we’re not re-using
the general streams implementation for IPC and instead maintain
separate usage of the low-level I/O primitives.)

Fixes: #34797

PR-URL: #38728
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
richardlau pushed a commit that referenced this issue Jul 16, 2021
Do the same thing we do for other streams, and retain a reference to
the Buffer that was sent over IPC while the write request is active,
so that it doesn’t get garbage collected while the data is still in
flight.

(This is a good example of why it’s a bad idea that we’re not re-using
the general streams implementation for IPC and instead maintain
separate usage of the low-level I/O primitives.)

Fixes: #34797

PR-URL: #38728
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
richardlau pushed a commit that referenced this issue Jul 19, 2021
Do the same thing we do for other streams, and retain a reference to
the Buffer that was sent over IPC while the write request is active,
so that it doesn’t get garbage collected while the data is still in
flight.

(This is a good example of why it’s a bad idea that we’re not re-using
the general streams implementation for IPC and instead maintain
separate usage of the low-level I/O primitives.)

Fixes: #34797

PR-URL: #38728
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
richardlau pushed a commit that referenced this issue Jul 20, 2021
Do the same thing we do for other streams, and retain a reference to
the Buffer that was sent over IPC while the write request is active,
so that it doesn’t get garbage collected while the data is still in
flight.

(This is a good example of why it’s a bad idea that we’re not re-using
the general streams implementation for IPC and instead maintain
separate usage of the low-level I/O primitives.)

Fixes: #34797

PR-URL: #38728
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
foxxyz pushed a commit to foxxyz/node that referenced this issue Oct 18, 2021
Do the same thing we do for other streams, and retain a reference to
the Buffer that was sent over IPC while the write request is active,
so that it doesn’t get garbage collected while the data is still in
flight.

(This is a good example of why it’s a bad idea that we’re not re-using
the general streams implementation for IPC and instead maintain
separate usage of the low-level I/O primitives.)

Fixes: nodejs#34797

PR-URL: nodejs#38728
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
richardlau pushed a commit that referenced this issue Dec 14, 2021
Do the same thing we do for other streams, and retain a reference to
the Buffer that was sent over IPC while the write request is active,
so that it doesn’t get garbage collected while the data is still in
flight.

(This is a good example of why it’s a bad idea that we’re not re-using
the general streams implementation for IPC and instead maintain
separate usage of the low-level I/O primitives.)

Fixes: #34797

PR-URL: #38728
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
child_process Issues and PRs related to the child_process subsystem. confirmed-bug Issues with confirmed bugs. v8 module Issues and PRs related to the "v8" subsystem.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants