Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: sometimes debug-port value for child process can be broken #1524

Closed
wants to merge 6 commits into from

Conversation

Olegas
Copy link
Contributor

@Olegas Olegas commented Apr 25, 2015

In case of long running cluster, if worker processes is restarting
periodically, --debug-port value may become out of range [1024, 65535].

In case of long running cluster, if worker processes is restarting
periodically, --debug-port value may become out of range [1024, 65535].
@brendanashworth brendanashworth added cluster Issues and PRs related to the cluster subsystem. land-on-master labels Apr 25, 2015
@brendanashworth
Copy link
Contributor

Is there a good way to test this?

@Olegas
Copy link
Contributor Author

Olegas commented Apr 26, 2015

May be we can start every test with --debug-port=65535? Without this fix every test, which is using fork will fail.

For example this in v1.x branch will fail:

./iojs --debug-port=65535 test/parallel/test-cluster-worker-disconnect.js 

@brendanashworth
Copy link
Contributor

@Olegas that probably wouldn't be accepted. Could we spin off a child process with those arguments (same file) and test it that way?

@Olegas
Copy link
Contributor Author

Olegas commented Apr 26, 2015

Ok, I'll test it this way.

@brendanashworth
Copy link
Contributor

@Olegas sorry if I was vague in my last response, but it'd be spawning a new child process of the same file (kind of like this one).

@Olegas
Copy link
Contributor Author

Olegas commented Apr 27, 2015

@brendanashworth please take a look, I've added a test case.

@Olegas
Copy link
Contributor Author

Olegas commented Apr 27, 2015

@brendanashworth also I've noted land-on-master tag. Do I need to rewrite my PR to master branch instead of v1.x?

@rvagg
Copy link
Member

rvagg commented Apr 27, 2015

@Olegas yes, nothing new should be going on to v1.x at this stage, we've moved to master

@@ -282,7 +283,7 @@ function masterInit() {
function createWorkerProcess(id, env) {
var workerEnv = util._extend({}, process.env);
var execArgv = cluster.settings.execArgv.slice();
var debugPort = process.debugPort + id;
var debugPort = (process.debugPort + id) % debugPortRange + 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a little neater to turn the number literal into a constant and use it in the range calculation above.

Aside, it's something of a (not consistently enforced) convention to name constants kFoo, i.e.:

const kDebugPortStart = 1024;
const kDebugPortRange = (65536 - kDebugPortStart);

@Olegas
Copy link
Contributor Author

Olegas commented Apr 27, 2015

@bnoordhuis Fixed
@rvagg So, I'll do a rebase+squash and rewrite to master when it will be ready

}
} else {
// iojs --debug-port=65535 test-cluster-debugport-overflow.js master
spawn(process.argv[0], ['--debug-port=65535', __filename, 'master']).on('close', function(code){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nits: line > 80 columns and missing space before brace. If you assign the arguments to a var args first, you stay under the limit.

Ditto for the line in lib/cluster.js; eyeballing it, it looks to be just over 80 columns.

@bnoordhuis
Copy link
Member

LGTM sans style errors.

@Olegas
Copy link
Contributor Author

Olegas commented Apr 27, 2015

@bnoordhuis fixed. Splitted line in lib/cluster into 2 statements.

@@ -285,6 +287,8 @@ function masterInit() {
var debugPort = process.debugPort + id;
var hasDebugArg = false;

debugPort = debugPort % kDebugPortRange + kDebugPortStart;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed it earlier today but there is a subtle change in behavior here that's also a potential bug.

Say debugPort equals 12345. That's within the legal range but your commit changes it to 12345 % 65536 + 1024 = 13369.

Worse, with debugPort == 65000, it's changed to 65000 % 65536 + 1024 = 66024 and that's not a legal port number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second is not the case, cause' kDebugPortRange is not 65536, it is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... 64512

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debugPort is equals to 12345 only in cluster's master. First worker created will receive 12346. After my fix it'll receive 12346 % kDebugPortRange + 1024. Yes, behaviour is changed, but I'm wondering was it documented? One have no any control on worker's debugPort value, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, valid point but consider debugPort == 64511: 64511 % (65536 - 1024) + 1024 = 65536.

Apropos the behavior change, I think node-inspector assumes consecutive ports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll rewrite this another way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I use generators here? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnoordhuis 64511 % (65536 - 1024) + 1024 == 65535.
range == 64512
x % range ∈ [ 0, range )
port == x % range + 1024;
port ∈ [ 0 + 1024, range + 1024 )
port ∈ [ 1024, 64512 + 1024 )
port ∈ [ 1024, 65536 )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I use generators here?

Fwiw the answer is yes if it's nicer. :)

@Olegas
Copy link
Contributor Author

Olegas commented Apr 27, 2015

@bnoordhuis I've rewrited it from scratch. I think we don't need worker ID in debug port generation. We need just monotonic sequence in allowed range here.

if (cluster.isMaster) {
cluster.fork().on('exit', function(code) {
process.exit(code);
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: semicolon

@mjseidel
Copy link

I was pinged to this by @cjihrig on my related PR at nodejs/node-v0.x-archive#14816; thanks, Colin. Looking forward to a resolution however it comes.

I believe this commit will result in debug port collisions in some circumstances (after nextDebugPort has wrapped around and started doubly assigning the port numbers of long-lived processes).

@sam-github
Copy link
Contributor

Before this, the worker would die because the debug port couldn't be listened to because its out-of-range, right?

After this PR, the worker will die whenever it happens to be using a debug port that is already in use by another process?

That's an improvement, because the former problem is unresolvable, whereas the second will probably be better when the next replacement worker starts: it might get an unused port.

I guess there is no reasonable way to modify the debug protocol to include the target pid, and tunnel multiple debug sessions to the master's debug port, and have it then re-direct the session to an ephemeral-but-known debug port of it's worker? Since we are now maintaining a fork of the debug protocol, this might be more approachable now.

I reiterate nodejs/node-v0.x-archive#14816 (comment)

And in response to nodejs/node-v0.x-archive#14816 (comment), @cjihrig, I'm not certain what, if anything, would break if the worker ID for a new fork was always the lowest free.

It would certainly break all kinds of unit tests (mine, also io.js I assume). Its a bit subtle figuring out when a worker ID is "unused", as well. At some points ("point" meaning "versions of node"), in master, cluster workers are removed from cluster.workers when disconnected, but while they still exist as a child process. I assume the worker ID could only become free when the control IPC disconnects, the worker has exited, and it has "closed" (all it's stdout/stderr has been read).

Despite that, I like the idea, but I wory its a breaking change: worker IDs are never recycled now, and its easy to have code that relies on this.

Having an auxiliary CLUSTER_INDEX wouldn't be breaking, and we could notify people that eventualy the worker ID would be made the same as the CLUSTER_INDEX, in io.js 2.x.

@Olegas
Copy link
Contributor Author

Olegas commented Apr 29, 2015

What if we'll maintain a list of used debug ports in master and check getDebugPort() value against it?

@Olegas
Copy link
Contributor Author

Olegas commented Apr 29, 2015

Just pick next in case of conflict

@mjseidel
Copy link

@sam-github, unless I'm missing something, I don't see how this patch will result in a worker dying upon being handed a duplicate debug port, because, as you said, which ports are in use aren't tracked. Unless dupe debug ports will reliably and mercifully cause crashes in some way I haven't tracked down, they'll run with possibly incorrect (and developer-insanity-producing) behavior. If I can get a little time I'll just check this by manually assigning the same debug port to multiple workers.

@Olegas, I think that would be great, but I have two reservations: One, it's not necessarily trivial to keep track of what ports are in use and guarantee that there are no collisions. I could be wrong about that, but I feel a bit more confident in my second reservation: that it might be difficult to ensure no collisions in a timely manner, since, after all, this is overhead that will be incurred on every call to fork.

@Olegas
Copy link
Contributor Author

Olegas commented Jun 8, 2015

@sam-github

TL;DR

This PR does not change any behavior. It only fixes a bug. AFAICS no need to change any docs. And it is not an imaginary bug. I hit this in production environment.

Details...

Before, in Node 0.10.x every process by default use port 5858 for debugger session. And all was fine, until we start using a cluster.

When cluster is used, we can connect to specified worker cause' all they share same debug port.

So, the fix was introduced - --debug-port=N command line argument. One can use it to specify debug port, which will be used then SIGUSR1 is received. Cluster master is using this argument to specify separate debug port to each worker started. Port value is based on master's own debug port value and worker ID. Worker ID is an integer starting from 1.

All went fine, until long living cluster starts to recycle it's children for some reason (some internal logic or in case of error and process crash).

Debug port must be in range from 1024 to 65535. When cluster is alive for a long period of time and worker processes dying and restarting, after a while we can see a cluster can't start new worker, cause' it is trying to start new worker with debug port out of allowed range. And it can't be fixed, except a full cluster restart.

With this PR, debug port numbers will be generated cyclically, starting from 5859 up to 65535 and then again from 1024 to 65535 and so on.

I need not to go through 60000 debug ports to find one I need. Every worker has a PID. In my app, it is exposed as a special response header (X-Worker-ID). Knowing a pid, I can get a process and it's args. When I got args, I know the port number.

I think what extending a debug protocol is not a good idea here, cause' it is far more complex task and puts a knowledge of clustering mechanisms into debugging.

@sam-github
Copy link
Contributor

With this PR, debug port numbers will be generated cyclically, starting from 5859 up to 65535
and then again from 1024 to 65535 and so on.

This a change in behaviour, the debug port is no longer base + worker ID, it now wraps. Please document.

@Olegas
Copy link
Contributor Author

Olegas commented Jun 8, 2015

@sam-github
Copy link
Contributor

I guess so. Though I don't see any docs at all for the new debug options. Maybe add a section after advanced usage.

@mjseidel
Copy link

mjseidel commented Jun 9, 2015

@sam-github fwiw, I'm regularly going through over 60,000 workers in a production environment, never attaching a debugger, and this is still a nuisance. Killing and restarting master on debug port out of bounds error is the workaround.

The fact that this is a problem in environments in which the debugger (and thus debug port) will never be used makes me question the approach of attempting to retain unique debug port, since a correct implementation will be slow, and forking is already painful. That said, this PR isn't correct insofar as retaining unique debug port, but it changes crashing behavior into rarely-colliding debug port behavior, it would solve my particular problem, and I'd welcome it with open arms and a Ritter Sport.

@cjihrig
Copy link
Contributor

cjihrig commented Jun 10, 2015

This may sound like a crazy idea, but what if we just didn't set --debug-port unconditionally? This problem seems to be encountered by people who use very long running cluster applications, but who aren't using the debugger anyway. We could even add a flag to retain the current behavior.

@mjseidel
Copy link

@cjihrig, that's what my open PR on Node does. I don't need that flag, and I don't think failing to set it is a crazy idea.

@cjihrig
Copy link
Contributor

cjihrig commented Jun 10, 2015

I'm proposing either removing these lines, or putting them behind a flag.

@sam-github
Copy link
Contributor

@bajtos do you have thoughts? I'm ok with not setting it unless debug was enabled by CLI, but it means that only one node process can be debugged at a time. Its an open question how many people are benefitting from debuggable cluster workers, thought.

@bajtos
Copy link
Contributor

bajtos commented Jun 11, 2015

@sam-github I don't have a strong opinion, I think this is really a task of picking a lesser evil.

If we decide to disable auto-increment, the we should take care to correctly handle the case where the master process is started in debug mode and cluster workers inherit the debug mode flag too. IIRC, in v0.10, running node --debug cluster-master.js produces a bunch of error messages as all worker processes attempted to open a debugger server on the same port.

I think @cjihrig proposal (#1524 (comment)) is a good solution:

  1. When the master process has a --debug, --debug-brk or --debug-port flag in exec args, cluster workers auto-increment debugger's port number. No change from the current behaviour, enabling debug mode in cluster stays easy. If somebody wants to enable debug mode later via SIGUSR1, then they should use start their app with the flag --debug-port=5858.
  2. When the master process does not have any of those three debug flags, then all workers share the same default debug port. This is great for long-running processes in production, where one does not want to debug anyways and thus it's not a big deal that only a single process can enter the debug mode via SIGUSR1.

@ChALkeR
Copy link
Member

ChALkeR commented Jun 16, 2015

@Olegas Shouldn't a more narrow port range be used instead?
You are now mixing registered ports and dynamic ports ranges.

Ah, I misread. You are now incrementing the supplied port number by 1, and loop only when the port number reaches 65535. I don't think that anything significantly better could be done here, though we could try to first populate the 49152-65535 port range, then fallback to 49151-1024 (decreasing). What do you think?

@cjihrig
Copy link
Contributor

cjihrig commented Jun 17, 2015

Worker debug ports are no longer set by default as of 309c0f4.

@brendanashworth
Copy link
Contributor

Is this still an issue then?

@cjihrig
Copy link
Contributor

cjihrig commented Jun 25, 2015

Technically, yes. You can still run into the same issue, but you would have to do so after explicitly turning on debugging.

@Olegas
Copy link
Contributor Author

Olegas commented Jun 25, 2015

I think we can close this now. And I plan to add some docs to cluster about this "issue"

@brendanashworth
Copy link
Contributor

Okay, closing per #1524 (comment).

cjihrig added a commit that referenced this pull request Jul 22, 2015
Currently, each cluster worker is assigned an ever increasing
--debug-port argument. A long running cluster application that
does not use the debugger can run into errors related to the
port range. This commit mitigates the problem by only setting
the debug port if the master is started with debug arguments, or
the user explicitly defines debug arguments for the worker. This
commit also adds a new debug port offset counter that is only
incremented when a worker is created that utilizes debugging.

Fixes: nodejs/node-v0.x-archive#8159
Refs: #1524
PR-URL: #1949
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Oleg Elifantiev <[email protected]>
cjihrig added a commit that referenced this pull request Jul 24, 2015
Currently, each cluster worker is assigned an ever increasing
--debug-port argument. A long running cluster application that
does not use the debugger can run into errors related to the
port range. This commit mitigates the problem by only setting
the debug port if the master is started with debug arguments, or
the user explicitly defines debug arguments for the worker. This
commit also adds a new debug port offset counter that is only
incremented when a worker is created that utilizes debugging.

Fixes: nodejs/node-v0.x-archive#8159
Refs: #1524
PR-URL: #1949
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Oleg Elifantiev <[email protected]>
cjihrig added a commit that referenced this pull request Jul 30, 2015
Currently, each cluster worker is assigned an ever increasing
--debug-port argument. A long running cluster application that
does not use the debugger can run into errors related to the
port range. This commit mitigates the problem by only setting
the debug port if the master is started with debug arguments, or
the user explicitly defines debug arguments for the worker. This
commit also adds a new debug port offset counter that is only
incremented when a worker is created that utilizes debugging.

Fixes: nodejs/node-v0.x-archive#8159
Refs: #1524
PR-URL: #1949
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Oleg Elifantiev <[email protected]>
cjihrig added a commit that referenced this pull request Aug 1, 2015
Currently, each cluster worker is assigned an ever increasing
--debug-port argument. A long running cluster application that
does not use the debugger can run into errors related to the
port range. This commit mitigates the problem by only setting
the debug port if the master is started with debug arguments, or
the user explicitly defines debug arguments for the worker. This
commit also adds a new debug port offset counter that is only
incremented when a worker is created that utilizes debugging.

Fixes: nodejs/node-v0.x-archive#8159
Refs: #1524
PR-URL: #1949
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Oleg Elifantiev <[email protected]>
cjihrig added a commit that referenced this pull request Aug 3, 2015
Currently, each cluster worker is assigned an ever increasing
--debug-port argument. A long running cluster application that
does not use the debugger can run into errors related to the
port range. This commit mitigates the problem by only setting
the debug port if the master is started with debug arguments, or
the user explicitly defines debug arguments for the worker. This
commit also adds a new debug port offset counter that is only
incremented when a worker is created that utilizes debugging.

Fixes: nodejs/node-v0.x-archive#8159
Refs: #1524
PR-URL: #1949
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Oleg Elifantiev <[email protected]>
cjihrig added a commit that referenced this pull request Aug 4, 2015
Currently, each cluster worker is assigned an ever increasing
--debug-port argument. A long running cluster application that
does not use the debugger can run into errors related to the
port range. This commit mitigates the problem by only setting
the debug port if the master is started with debug arguments, or
the user explicitly defines debug arguments for the worker. This
commit also adds a new debug port offset counter that is only
incremented when a worker is created that utilizes debugging.

Fixes: nodejs/node-v0.x-archive#8159
Refs: #1524
PR-URL: #1949
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Oleg Elifantiev <[email protected]>
cjihrig added a commit that referenced this pull request Aug 4, 2015
Currently, each cluster worker is assigned an ever increasing
--debug-port argument. A long running cluster application that
does not use the debugger can run into errors related to the
port range. This commit mitigates the problem by only setting
the debug port if the master is started with debug arguments, or
the user explicitly defines debug arguments for the worker. This
commit also adds a new debug port offset counter that is only
incremented when a worker is created that utilizes debugging.

Fixes: nodejs/node-v0.x-archive#8159
Refs: #1524
PR-URL: #1949
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Oleg Elifantiev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster Issues and PRs related to the cluster subsystem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants