Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% CPU native-only (no JS runs) loop for 2-5 minutes on Mac OS 10.15 in libuv.stream.uv__try_write #43916

Closed
huntharo opened this issue Jul 20, 2022 · 9 comments
Labels
libuv Issues and PRs related to the libuv dependency or the uv binding.

Comments

@huntharo
Copy link

Version

v16.13.1

Platform

Darwin hhunt-mbp-m1pro 21.5.0 Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:37 PDT 2022; root:xnu-8020.121.3~4/RELEASE_ARM64_T6000 arm64

Subsystem

libuv

What steps will reproduce the bug?

How often does it reproduce? Is there a required condition?

Within my team, every few minutes.

We do not know how to reproduce this in isolation.

The libuv ticket mentions "specific VPN software" that can exacerbate the problem and we are all using the same VPN software on Mac OS (both Intel and ARM) so that could be the key.

A major key is that the loop is stuck 100% in native code in a tight loop - As a result, Ctrl-C will not interrupt the application and it must be forced to exit.

What is the expected behavior?

No infinite loops in native code.

What do you see instead?

Infinite loops in native code (stepped through in debugger).

image

image

Additional information

This next.js issue appears to be due to this. All the reports are from Mac OS users and the behavior described is "does not respond to Ctrl-C until a few minutes have passed" (paraphrasing), so it seems this is happening in the wild but nobody looked at this in a debug build to find that the loop was already known and fixed :)

vercel/next.js#10061

These two fixes cherry-picked from libuv will address the issue:

libuv/libuv#3405
libuv/libuv#3413

I can prepare PRs if desired.

We're really like to have this fix for at least node16 and later versions.

Thanks!

@daeyeon daeyeon added the libuv Issues and PRs related to the libuv dependency or the uv binding. label Jul 21, 2022
@bnoordhuis
Copy link
Member

PR #42340 should fix this when merged and back-ported to v16.x. Unfortunately, it's not exactly been smooth sailing...

@huntharo
Copy link
Author

huntharo commented Jul 21, 2022

@bnoordhuis - Can we cherry pick a patch for libuv or is that not something typically done?

For example, here is what is needed to resolve the issue on v16 (same actually on v18, possibly v14 too):

https://github.com/nodejs/node/compare/v16.x...huntharo:node:v16-mac-loop-fix?expand=1

@bnoordhuis
Copy link
Member

Cherry-picking libuv commits is hardly ever done. I can't remember the last time we did that.

@huntharo
Copy link
Author

Hrm... bummer... on an internal survey we got ~30 responses and I think 28 of the 30 people are hitting this 100% CPU loop several times a day. Some have said they are surprised people are not leaving in droves as it's so frustrating. Some report that it happens every time they start node within 2 minutes. It's very hard to get work done like that.

We applied the patches in my diff above and the issue is resolved.

From the next.js ticket it seems that quite a few other developers in the wild are hitting this too. For the longest time we suspected our own code, then next.js, then webpack hmr, then the ws module used by next's webpack hmr... we tried removing everything that we could think of before we thought maybe it was a node bug... I suspect others are going through the same process.

I would really encourage a quick patch if possible for 14 and 16. I think the reputation of the ecosystem is being negatively impacted by this.

Thanks for looking at this and all that node is!

@ajjahn
Copy link

ajjahn commented Jul 22, 2022

Some have said they are surprised people are not leaving in droves as it's so frustrating. Some report that it happens every time they start node within 2 minutes. It's very hard to get work done like that.

I can also attest this has been a major blocker and pain point on our team. Exploring options to implement a targeted patch would be greatly appreciated!

@huntharo huntharo changed the title 100% CPU loop for 2-5 minutes on Mac OS 10.15 in libuv.stream.uv__try_write 100% CPU native-only (no JS runs) loop for 2-5 minutes on Mac OS 10.15 in libuv.stream.uv__try_write Jul 22, 2022
bnoordhuis added a commit to bnoordhuis/io.js that referenced this issue Jul 22, 2022
Original commit log follows:

darwin: remove EPROTOTYPE error workaround (nodejs#3405)

It's been reported in the past that OS X 10.10, because of a race
condition in the XNU kernel, sometimes returns a transient EPROTOTYPE
error when trying to write to a socket. Libuv handles that by retrying
the operation until it succeeds or fails with a different error.

Recently it's been reported that current versions of the operating
system formerly known as OS X fail permanently with EPROTOTYPE under
certain conditions, resulting in an infinite loop.

Because Apple isn't exactly forthcoming with bug fixes or even details,
I'm opting to simply remove the workaround and have the error bubble up.

Refs: libuv/libuv#482
Fixes: nodejs#43916
bnoordhuis added a commit to bnoordhuis/io.js that referenced this issue Jul 22, 2022
Original commit log follows:

darwin: translate EPROTOTYPE to ECONNRESET (nodejs#3413)

macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too -
have a bug where a race condition causes the kernel to return EPROTOTYPE
because the socket isn't fully constructed.

It's probably the result of the peer closing the connection and that is
why libuv translates it to ECONNRESET.

Previously, libuv retried until the EPROTOTYPE error went away but some
VPN software causes the same behavior except the error is permanent, not
transient, turning the retry mechanism into an infinite loop.

Refs: libuv/libuv#482
Refs: libuv/libuv#3405
Fixes: nodejs#43916
bnoordhuis added a commit to bnoordhuis/io.js that referenced this issue Jul 22, 2022
Original commit log follows:

darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405)

It's been reported in the past that OS X 10.10, because of a race
condition in the XNU kernel, sometimes returns a transient EPROTOTYPE
error when trying to write to a socket. Libuv handles that by retrying
the operation until it succeeds or fails with a different error.

Recently it's been reported that current versions of the operating
system formerly known as OS X fail permanently with EPROTOTYPE under
certain conditions, resulting in an infinite loop.

Because Apple isn't exactly forthcoming with bug fixes or even details,
I'm opting to simply remove the workaround and have the error bubble up.

Refs: libuv/libuv#482
Fixes: nodejs#43916
bnoordhuis added a commit to bnoordhuis/io.js that referenced this issue Jul 22, 2022
Original commit log follows:

darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413)

macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too -
have a bug where a race condition causes the kernel to return EPROTOTYPE
because the socket isn't fully constructed.

It's probably the result of the peer closing the connection and that is
why libuv translates it to ECONNRESET.

Previously, libuv retried until the EPROTOTYPE error went away but some
VPN software causes the same behavior except the error is permanent, not
transient, turning the retry mechanism into an infinite loop.

Refs: libuv/libuv#482
Refs: libuv/libuv#3405
Fixes: nodejs#43916
@bnoordhuis
Copy link
Member

I've opened #43950 to cherry-pick the fixes but be aware that the way the release process works means they're first released in the next v18.x before getting back-ported to v16 and finally v14 - and that's of course assuming they actually get merged.

@huntharo
Copy link
Author

I've opened #43950 to cherry-pick the fixes but be aware that the way the release process works means they're first released in the next v18.x before getting back-ported to v16 and finally v14 - and that's of course assuming they actually get merged.

THANK YOU!!!

I understand it may take some time. Thanks so much!

nodejs-github-bot pushed a commit that referenced this issue Jul 25, 2022
Original commit log follows:

darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413)

macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too -
have a bug where a race condition causes the kernel to return EPROTOTYPE
because the socket isn't fully constructed.

It's probably the result of the peer closing the connection and that is
why libuv translates it to ECONNRESET.

Previously, libuv retried until the EPROTOTYPE error went away but some
VPN software causes the same behavior except the error is permanent, not
transient, turning the retry mechanism into an infinite loop.

Refs: libuv/libuv#482
Refs: libuv/libuv#3405
Fixes: #43916

PR-URL: #43950
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
@huntharo
Copy link
Author

Should we keep this open to track the backport PRs? Or do we normally let this stay closed and just use the PRs for those specific versions do the tracking?

@bnoordhuis
Copy link
Member

Back-porting is normally an automated process unless the commits don't apply cleanly (but I expect they will.)

danielleadams pushed a commit that referenced this issue Jul 26, 2022
Original commit log follows:

darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405)

It's been reported in the past that OS X 10.10, because of a race
condition in the XNU kernel, sometimes returns a transient EPROTOTYPE
error when trying to write to a socket. Libuv handles that by retrying
the operation until it succeeds or fails with a different error.

Recently it's been reported that current versions of the operating
system formerly known as OS X fail permanently with EPROTOTYPE under
certain conditions, resulting in an infinite loop.

Because Apple isn't exactly forthcoming with bug fixes or even details,
I'm opting to simply remove the workaround and have the error bubble up.

Refs: libuv/libuv#482
Fixes: #43916

PR-URL: #43950
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
danielleadams pushed a commit that referenced this issue Jul 26, 2022
Original commit log follows:

darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413)

macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too -
have a bug where a race condition causes the kernel to return EPROTOTYPE
because the socket isn't fully constructed.

It's probably the result of the peer closing the connection and that is
why libuv translates it to ECONNRESET.

Previously, libuv retried until the EPROTOTYPE error went away but some
VPN software causes the same behavior except the error is permanent, not
transient, turning the retry mechanism into an infinite loop.

Refs: libuv/libuv#482
Refs: libuv/libuv#3405
Fixes: #43916

PR-URL: #43950
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
targos pushed a commit that referenced this issue Jul 28, 2022
Original commit log follows:

darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405)

It's been reported in the past that OS X 10.10, because of a race
condition in the XNU kernel, sometimes returns a transient EPROTOTYPE
error when trying to write to a socket. Libuv handles that by retrying
the operation until it succeeds or fails with a different error.

Recently it's been reported that current versions of the operating
system formerly known as OS X fail permanently with EPROTOTYPE under
certain conditions, resulting in an infinite loop.

Because Apple isn't exactly forthcoming with bug fixes or even details,
I'm opting to simply remove the workaround and have the error bubble up.

Refs: libuv/libuv#482
Fixes: #43916

PR-URL: #43950
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
targos pushed a commit that referenced this issue Jul 28, 2022
Original commit log follows:

darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413)

macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too -
have a bug where a race condition causes the kernel to return EPROTOTYPE
because the socket isn't fully constructed.

It's probably the result of the peer closing the connection and that is
why libuv translates it to ECONNRESET.

Previously, libuv retried until the EPROTOTYPE error went away but some
VPN software causes the same behavior except the error is permanent, not
transient, turning the retry mechanism into an infinite loop.

Refs: libuv/libuv#482
Refs: libuv/libuv#3405
Fixes: #43916

PR-URL: #43950
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
targos pushed a commit that referenced this issue Jul 31, 2022
Original commit log follows:

darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405)

It's been reported in the past that OS X 10.10, because of a race
condition in the XNU kernel, sometimes returns a transient EPROTOTYPE
error when trying to write to a socket. Libuv handles that by retrying
the operation until it succeeds or fails with a different error.

Recently it's been reported that current versions of the operating
system formerly known as OS X fail permanently with EPROTOTYPE under
certain conditions, resulting in an infinite loop.

Because Apple isn't exactly forthcoming with bug fixes or even details,
I'm opting to simply remove the workaround and have the error bubble up.

Refs: libuv/libuv#482
Fixes: #43916

PR-URL: #43950
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
targos pushed a commit that referenced this issue Jul 31, 2022
Original commit log follows:

darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413)

macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too -
have a bug where a race condition causes the kernel to return EPROTOTYPE
because the socket isn't fully constructed.

It's probably the result of the peer closing the connection and that is
why libuv translates it to ECONNRESET.

Previously, libuv retried until the EPROTOTYPE error went away but some
VPN software causes the same behavior except the error is permanent, not
transient, turning the retry mechanism into an infinite loop.

Refs: libuv/libuv#482
Refs: libuv/libuv#3405
Fixes: #43916

PR-URL: #43950
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
Fyko pushed a commit to Fyko/node that referenced this issue Sep 15, 2022
Original commit log follows:

darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405)

It's been reported in the past that OS X 10.10, because of a race
condition in the XNU kernel, sometimes returns a transient EPROTOTYPE
error when trying to write to a socket. Libuv handles that by retrying
the operation until it succeeds or fails with a different error.

Recently it's been reported that current versions of the operating
system formerly known as OS X fail permanently with EPROTOTYPE under
certain conditions, resulting in an infinite loop.

Because Apple isn't exactly forthcoming with bug fixes or even details,
I'm opting to simply remove the workaround and have the error bubble up.

Refs: libuv/libuv#482
Fixes: nodejs#43916

PR-URL: nodejs#43950
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
Fyko pushed a commit to Fyko/node that referenced this issue Sep 15, 2022
Original commit log follows:

darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413)

macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too -
have a bug where a race condition causes the kernel to return EPROTOTYPE
because the socket isn't fully constructed.

It's probably the result of the peer closing the connection and that is
why libuv translates it to ECONNRESET.

Previously, libuv retried until the EPROTOTYPE error went away but some
VPN software causes the same behavior except the error is permanent, not
transient, turning the retry mechanism into an infinite loop.

Refs: libuv/libuv#482
Refs: libuv/libuv#3405
Fixes: nodejs#43916

PR-URL: nodejs#43950
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
guangwong pushed a commit to noslate-project/node that referenced this issue Oct 10, 2022
Original commit log follows:

darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405)

It's been reported in the past that OS X 10.10, because of a race
condition in the XNU kernel, sometimes returns a transient EPROTOTYPE
error when trying to write to a socket. Libuv handles that by retrying
the operation until it succeeds or fails with a different error.

Recently it's been reported that current versions of the operating
system formerly known as OS X fail permanently with EPROTOTYPE under
certain conditions, resulting in an infinite loop.

Because Apple isn't exactly forthcoming with bug fixes or even details,
I'm opting to simply remove the workaround and have the error bubble up.

Refs: libuv/libuv#482
Fixes: nodejs/node#43916

PR-URL: nodejs/node#43950
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
guangwong pushed a commit to noslate-project/node that referenced this issue Oct 10, 2022
Original commit log follows:

darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413)

macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too -
have a bug where a race condition causes the kernel to return EPROTOTYPE
because the socket isn't fully constructed.

It's probably the result of the peer closing the connection and that is
why libuv translates it to ECONNRESET.

Previously, libuv retried until the EPROTOTYPE error went away but some
VPN software causes the same behavior except the error is permanent, not
transient, turning the retry mechanism into an infinite loop.

Refs: libuv/libuv#482
Refs: libuv/libuv#3405
Fixes: nodejs/node#43916

PR-URL: nodejs/node#43950
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
danielleadams pushed a commit that referenced this issue Oct 29, 2022
Original commit log follows:

darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405)

It's been reported in the past that OS X 10.10, because of a race
condition in the XNU kernel, sometimes returns a transient EPROTOTYPE
error when trying to write to a socket. Libuv handles that by retrying
the operation until it succeeds or fails with a different error.

Recently it's been reported that current versions of the operating
system formerly known as OS X fail permanently with EPROTOTYPE under
certain conditions, resulting in an infinite loop.

Because Apple isn't exactly forthcoming with bug fixes or even details,
I'm opting to simply remove the workaround and have the error bubble up.

Refs: libuv/libuv#482
Fixes: #43916

PR-URL: #43950
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
danielleadams pushed a commit that referenced this issue Oct 29, 2022
Original commit log follows:

darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413)

macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too -
have a bug where a race condition causes the kernel to return EPROTOTYPE
because the socket isn't fully constructed.

It's probably the result of the peer closing the connection and that is
why libuv translates it to ECONNRESET.

Previously, libuv retried until the EPROTOTYPE error went away but some
VPN software causes the same behavior except the error is permanent, not
transient, turning the retry mechanism into an infinite loop.

Refs: libuv/libuv#482
Refs: libuv/libuv#3405
Fixes: #43916

PR-URL: #43950
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libuv Issues and PRs related to the libuv dependency or the uv binding.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants