Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncaught exception in socket code: "Cannot read properties of null (reading 'finishWrite')" #46094

Open
danfuzz opened this issue Jan 4, 2023 · 16 comments
Labels
http2 Issues or PRs related to the http2 subsystem. tls Issues and PRs related to the tls subsystem.

Comments

@danfuzz
Copy link

danfuzz commented Jan 4, 2023

Version

v18.4.0

Platform

Linux i-00f751a18edbe3eb6.us-west-2.compute.internal 5.15.73-45.135.amzn2022.x86_64 #1 SMP Fri Oct 14 17:47:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

http2 (maybe actually net or tls)

What steps will reproduce the bug?

Apologies that I don't have an easy repro case yet. It's one of those "run my server for a while and then it does this." Near as I can tell, it's along these lines:

  1. Connection accepted to an Http2Server (via TLS socket).
  2. Session established.
  3. Session remains idle for five minutes. (No requests ever made.)
  4. My code notices the idleness and calls close() on the session. (Tick/turn concludes.)
  5. Boom! Uncaught exception: TypeError: Cannot read properties of null (reading 'finishWrite')

When I get a chance (hopefully soon) I will try to distill down a real example. Thanks for any help you can offer in the meantime.

How often does it reproduce? Is there a required condition?

Unclear. This shows up about once a day on my not-very-active server. It does correspond one-to-one with a call to session.close(). That is, every time my server does this right now, this uncaught exception seems to follow.

What is the expected behavior?

No uncaught exception. That is, either no exception at all, or an exception that can be safely caught and handled without shutting the system down.

What do you see instead?

This uncaught exception:

TypeError: Cannot read properties of null (reading 'finishWrite')
    at JSStreamSocket.finishWrite (node:internal/js_stream_socket:211:12)
    at Immediate.<anonymous> (node:internal/js_stream_socket:196:14)
    at process.processImmediate (node:internal/timers:471:21)
    at process.callbackTrampoline (node:internal/async_hooks:130:17)

Additional information

This is the actual code which sets up the timeout. The timeout time is 5 minutes, and the logger... call simply writes a structured log line.

    session.setTimeout(Http2Wrangler.#SESSION_TIMEOUT_MSEC, () => {
      logger?.idleTimeout();
      session.close();
    });

Http2Session.close() does accept an optional callback, which is documented as being bound to the close event. However, this code already binds a handler to close (which does not seem to be getting called before the uncaught exception is generated).

@nodejs nodejs deleted a comment Jan 5, 2023
@nodejs nodejs deleted a comment Jan 5, 2023
@lpinca lpinca added tls Issues and PRs related to the tls subsystem. http2 Issues or PRs related to the http2 subsystem. labels Jan 5, 2023
@danfuzz
Copy link
Author

danfuzz commented Jan 5, 2023

I have now also managed to reproduce this error on v19.3.0 (on the Mac).

In addition to the error noted in the original report, my repro-case-in-progress also manages to get a second error to be thrown (moments [CORRECTION] after before the original error is thrown), though this one ends up getting emitted in an error event on a socket. Unclear if it is truly meaningful with regards to the original report, but FWIW:

Error [ERR_STREAM_WRITE_AFTER_END]: write after end
    at new NodeError (node:internal/errors:400:5)
    at _write (node:internal/streams/writable:322:11)
    at Writable.write (node:internal/streams/writable:337:10)
    at JSStreamSocket.doWrite (node:internal/js_stream_socket:176:19)
    at JSStream.onwrite (node:internal/js_stream_socket:33:57)
    at JSStream.callbackTrampoline (node:internal/async_hooks:130:17)
    at Socket._final (node:net:495:28)
    at callFinal (node:internal/streams/writable:700:12)
    at prefinish (node:internal/streams/writable:712:7)
    at finishMaybe (node:internal/streams/writable:722:5) {
  code: 'ERR_STREAM_WRITE_AFTER_END'

@danfuzz
Copy link
Author

danfuzz commented Jan 5, 2023

Hi again! I'm afraid I still don't have a succinct repro, but I can tell you a bit more about what's going on.

In my codebase, I'm wrapping the raw socket that gets accepted from a net.Server, in order to implement rate-limiting on the write side. The wrapped Duplex stream is set up with allowHalfOpen: true.

When the session times out, (per original report) I call session.close(). This has the effect of doing a little bit of IO to the socket including reading, and when reading is done, endReadableNT() is ultimately called inside internal/streams/readable.js. Surprisingly, this calls through to endWritableNT(), which is only supposed to happen if allowHalfOpen is false. I never set it to false, but in fact at some point, something has indeed set it so, though not in my code. I confirmed that socket.allowHalfOpen === true just before the call to session.close() and that socket.allowHalfOpen === false by the time my wrapped Duplex is asked to close.

After the call to endWritableNT(), something in the HTTP2 code still wants to write some data, and this is what causes the write after end exception to be thrown (per my previous comment). And at that point, I think the internals of the library are kinda messed up, which is what then quickly causes the uncaught exception.

I have an "investigatory workaround" which "fixes" the problem by delaying my code acting on the socket.end() call so that the session can write its final bits. That said, I'm guessing it would be better to figure out why allowHalfOpen suddenly becomes false, as that seems at least wrong-ish. And, even with that fix maybe there is still a bug here in that the original exception should arguably be catch-able (even though it is a consequence of another problem).

@danfuzz
Copy link
Author

danfuzz commented Jan 6, 2023

I figured out what's going on in my code, and though I still think there's a Node bug worth addressing, I am no longer triggering it:

I made a mistake with my log-based instrumentation (from the previous comment), and I was wrong about allowHalfOpen switching from true to false. The actual case was that the unwrapped Socket was always allowHalfOpen: false, and my Duplex wrapper was always allowHalfOpen: true.

What I was right about was that HTTP2 tacitly expects allowHalfOpen: true of its socket/stream, and I am pretty sure simply setting it to false and then proactively closing an Http2Session will ultimately reproduce the uncaught exception described in my original report.

The solution in my code was (of course?) to specify allowHalfOpen: true in my server socket constructor options.

I checked the Node docs for the http2 module, and didn't see anything about allowHalfOpen: true being required. So, in addition to any other action on this bug, maybe it makes sense to note this effective requirement in the docs. (To be clear, I don't know the protocol well enough to understand whether it's truly required by the protocol.)

Thank you for going on this journey with me, and wishing you all a happy new year.

@danfuzz
Copy link
Author

danfuzz commented Jan 6, 2023

Maybe this is the same bug, or at least related?: #35695

That one is closed, but it sure looks similar.

@danfuzz
Copy link
Author

danfuzz commented Mar 21, 2023

BTW, I just ran into another way to get this error:

  • Connection accepted.
  • Set up a short (e.g. a few seconds) socket timeout.
  • Session established.
  • Set up a longer (e.g. a minute) session timeout.
  • Handle a request on the session.
  • Idle.
  • Socket timeout fires; callback does socket.destroy().
  • Idle.
  • Session timeout fires; callback does session.close().
  • Exception thrown, Cannot read properties of null (reading 'finishWrite').

danfuzz added a commit to danfuzz/lactoserv that referenced this issue Mar 21, 2023
This PR reworks the last couple days' worth of effort to do something
useful when connected sockets time out.

Notably, during the course of this work, I ran into another case of the
`null...finishWrite` exception. Existing Node bugs (the latter was filed
by me):

* <nodejs/node#35695>
* <nodejs/node#46094>
@ywave620
Copy link
Contributor

ywave620 commented Mar 22, 2023

node does not allow manipulation againt http2session.socket, how did you manage to call socket.destroy() @danfuzz

https://nodejs.org/docs/latest-v14.x/api/http2.html#http2_http2session_socket

@danfuzz
Copy link
Author

danfuzz commented Mar 22, 2023

node does not allow manipulation againt http2session.socket, how did you manage to call socket.destroy() @danfuzz

I'm accepting sockets in my code directly, wrapping them (to do rate limiting, timeout, and logging at the low level), then handing the wrapped sockets off to an Http2Server instance. I set up event listeners on the wrapped socket at my layer (or the original socket directly, depending on configuration), not on the one wrapped inside the HTTP2 library. You can find the salient code here:

@ywave620
Copy link
Contributor

ywave620 commented Apr 3, 2023

How did you

handing the wrapped sockets off to an Http2Server instance

could you point out which lines of code is responsible for this? (I glanced your code but couldn't find out)

@ywave620
Copy link
Contributor

ywave620 commented Apr 3, 2023

Actually, I have managed to do the same thing but in a different way.

  1. create a http(s) server without listening at any port, called httpServer
  2. given a connected socket(maybe from the 'connection' of net.Server), called sock, invoke httpServer.emit('connection', sock)

@danfuzz
Copy link
Author

danfuzz commented Apr 3, 2023

@ywave620 Not sure if you still want an answer, but just in case, here's (I think) what you're looking for, with a bit of extra context:

handing the wrapped sockets off to an Http2Server instance

could you point out which lines of code is responsible for this? (I glanced your code but couldn't find out)

(NB, I've been working on this area of the code lately, so the line numbers are off from my links above. I'll link to a tag here instead of main.)

  1. Listen for connection event from plain net.Server: https://github.com/danfuzz/lactoserv/blob/v0.5.8/src/network-protocol/private/TcpWrangler.js#L65
  2. Callback from previous: https://github.com/danfuzz/lactoserv/blob/v0.5.8/src/network-protocol/private/TcpWrangler.js#L105
  3. Accepted socket gets wrapped for rate limiting: https://github.com/danfuzz/lactoserv/blob/v0.5.8/src/network-protocol/private/TcpWrangler.js#L129
  4. setTimeout() on socket (original or wrapped): https://github.com/danfuzz/lactoserv/blob/v0.5.8/src/network-protocol/private/TcpWrangler.js#L149
  5. connection event emitted on the protocol server instance (Http2Server or similar), passing it the socket (original or wrapped): https://github.com/danfuzz/lactoserv/blob/v0.5.8/src/network-protocol/export/ProtocolWrangler.js#L245
  6. (In the case of HTTP2) Where the protocol server instance got set up: https://github.com/danfuzz/lactoserv/blob/v0.5.8/src/network-protocol/private/Http2Wrangler.js#L58
  7. And when an HTTP2 session is established, where it ends up back in my code: https://github.com/danfuzz/lactoserv/blob/v0.5.8/src/network-protocol/private/Http2Wrangler.js#L87
  8. Where the protocol server gets hooked up to Express (or in the case of HTTP2, the Express wrapper) to handle requests: https://github.com/danfuzz/lactoserv/blob/v0.5.8/src/network-protocol/export/ProtocolWrangler.js#L460

@danfuzz
Copy link
Author

danfuzz commented Apr 3, 2023

Actually, I have managed to do the same thing but in a different way.

  1. create a http(s) server without listening at any port, called httpServer
  2. given a connected socket(maybe from the 'connection' of net.Server), called sock, invoke httpServer.emit('connection', sock)

That's indeed what I'm doing (though along with other stuff too).

@pimterry
Copy link
Member

@danfuzz are you still seeing this issue?

I have a fix for a closely related issue that I think likely resolves this too. I can't reliably reproduce the finishWrite exception (although I have seen it intermittently) but I do have a reliable test following the outline of the code you described above that triggers a segfault due to broken TLS socket shutdowns with HTTP/2. I think that's a different symptom of the same bad state.

If you can still reproduce this, can you try rebuilding node with the changes from #49327, and then see if that fixes it for you?

@danfuzz
Copy link
Author

danfuzz commented Aug 28, 2023

@pimterry I have not seen the issue since that comment you referenced, because I adjusted my timeouts (etc.) to avoid the situation. I'm afraid my development workflow doesn't involve (re)building Node, so it'd take more time than I have available right now to do that in a timely fashion for you.

On the flip-side, the project in which I ran into the problem — https://github.com/danfuzz/lactoserv — is open source and builds pretty quickly. The allowHalfOpen fix is very easy to undo, switching true -> false in this line of src/network-protocol/private/AsyncServer.js:

    allowHalfOpen:         { default: true },

The un-fix for the other case (the one you reference) isn't too bad either. The two timeouts in question are SOCKET_TIMEOUT_MSEC in src/network-protocol/private/TcpWrangler.js and SESSION_TIMEOUT_MSEC in src/network-protocol/private/Http2Wrangler.js. Set the latter to be longer than the former to trigger the issue.

In either case, if you build and run using the default (development) config, then just use a browser (or at least something that does lasting HTTP2 sessions, e.g. not curl) to visit https://localhost:8443, close the tab, and then wait for the timeouts.

Build / run instructions are at https://github.com/danfuzz/lactoserv/blob/main/doc/development.md. If you decide to do this and run into any trouble, please do not hesitate to reach out.

@pimterry
Copy link
Member

I've tried to do that @danfuzz, but it doesn't reproduce for me. Diff:

diff --git a/src/network-protocol/private/AsyncServer.js b/src/network-protocol/private/AsyncServer.js
index 5d76748e..496f706b 100644
--- a/src/network-protocol/private/AsyncServer.js
+++ b/src/network-protocol/private/AsyncServer.js
@@ -275,7 +275,7 @@ export class AsyncServer {
    * `ProtocolWrangler` class doc for details.
    */
   static #CREATE_PROTO = Object.freeze({
-    allowHalfOpen:         { default: true },
+    allowHalfOpen:         { default: false },
     keepAlive:             null,
     keepAliveInitialDelay: null,
     noDelay:               null,
diff --git a/src/network-protocol/private/Http2Wrangler.js b/src/network-protocol/private/Http2Wrangler.js
index 1118d47e..49c2508c 100644
--- a/src/network-protocol/private/Http2Wrangler.js
+++ b/src/network-protocol/private/Http2Wrangler.js
@@ -262,5 +262,5 @@ export class Http2Wrangler extends TcpWrangler {
    * @type {number} How long in msec to wait for a session to have activity
    * before considering it "timed out" and telling it to close.
    */
-  static #SESSION_TIMEOUT_MSEC = 1 * 60 * 1000; // One minute.
+  static #SESSION_TIMEOUT_MSEC = 10 * 1000; // Ten seconds.
 }
diff --git a/src/network-protocol/private/TcpWrangler.js b/src/network-protocol/private/TcpWrangler.js
index 03d47d0a..e24ea5c2 100644
--- a/src/network-protocol/private/TcpWrangler.js
+++ b/src/network-protocol/private/TcpWrangler.js
@@ -253,7 +253,7 @@ export class TcpWrangler extends ProtocolWrangler {
    * socket (a/o/t a server socket doing a `listen()`) to be "timed out." When
    * timed out, a socket is closed proactively.
    */
-  static #SOCKET_TIMEOUT_MSEC = 3 * 60 * 1000; // Three minutes.
+  static #SOCKET_TIMEOUT_MSEC = 3 * 1000; // Three seconds.
 
   /**
    * @type {number} Grace period in msec after trying to close a socket due to

Testing with both Node v18.17.1 and v20.5.1, running with ./scripts/ubik build, then ./scripts/ubik run, then loading https://localhost:8443 in Chrome. It shows 'Hello!' successfully, I can see in the dev tools it's using HTTP/2, the logs show a socketTimeout.closed() about 3 seconds later, but no mention of a session timeout and no errors at all.

Cumulatively between #49327 and #49400 I'm fairly confident I've fixed a whole class of issues that should cover this case, so it'd be great to be able to confirm that here if you can help at all.

@danfuzz
Copy link
Author

danfuzz commented Aug 29, 2023

@pimterry Sorry for the trouble. It's been a while since my head's been in this code; I thought what I suggested would work for you, but clearly I was wrong!

But I think the news is at least good-ish: After trying a few things I figured out that I could induce the problem once your diffs (above) were applied, including the error I mentioned in #46094 (comment), by using curl (very sorry to suggest that curl wouldn't induce the behavior!). Specifically, curl --insecure https://localhost:8443.

This was on Node v20.0.0, as installed on macOS via brew. Log transcript of the errors:

...
Uncaught exception:
TypeError: Cannot read properties of null (reading 'finishShutdown')
    at JSStreamSocket.finishShutdown (node:internal/js_stream_socket:160:12)
    at node:internal/js_stream_socket:147:14
    at processTicksAndRejections (node:internal/process/task_queues:81:21)
    at runNextTicks (node:internal/process/task_queues:64:3)
    at process.processImmediate (node:internal/timers:449:9)
    at process.callbackTrampoline (node:internal/async_hooks:130:17)
...
20230829-16:07:51.6077 <framework> endpoint.secure.conn.qx-ea287-0001.dataRateLimiter.errorFromInner(@Error { {
  name: 'Error',
  code: 'ERR_STREAM_WRITE_AFTER_END',
  message: 'write after end',
  stack: @StackTrace { [
    { name: 'new NodeError', file: 'node:internal/errors', line: 399, col: 5 },
    { name: '_write', file: 'node:internal/streams/writable', line: 329, col: 11 },
    { name: 'Writable.write', file: 'node:internal/streams/writable', line: 344, col: 10 },
    { name: 'JSStreamSocket.doWrite', file: 'node:internal/js_stream_socket', line: 176, col: 19 },
    { name: 'JSStream.onwrite', file: 'node:internal/js_stream_socket', line: 33, col: 57 },
    { name: 'JSStream.callbackTrampoline', file: 'node:internal/async_hooks', line: 130, col: 17 },
    { name: 'Socket._final', file: 'node:net', line: 524, col: 28 },
    { name: 'callFinal', file: 'node:internal/streams/writable', line: 707, col: 12 },
    { name: 'prefinish', file: 'node:internal/streams/writable', line: 719, col: 7 },
    { name: 'finishMaybe', file: 'node:internal/streams/writable', line: 729, col: 5 }
  ] }
} })
...

@pimterry
Copy link
Member

Thanks @danfuzz! I can reproduce it perfectly now. It seems that #49327 alone doesn't fix this, but #49400 does - with the latter change (and/or both) in place, there are no errors and everything works perfectly 👍. Great! This can be closed once that's merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http2 Issues or PRs related to the http2 subsystem. tls Issues and PRs related to the tls subsystem.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants