[SPARK-27868][core] Better default value and documentation for socket server backlog.#24732
[SPARK-27868][core] Better default value and documentation for socket server backlog.#24732vanzin wants to merge 1 commit intoapache:masterfrom
Conversation
… server backlog. First, there is currently no public documentation for this setting. So it's hard to even know that it could be a problem if your application starts failing with weird shuffle errors. Second, the javadoc attached to the code was incorrect; the default value just uses the default value from the JRE, which is 50, instead of having an unbounded queue as the comment implies. So use a default that is a "rounded" version of the JRE default, and provide documentation explaining that this value may need to be adjusted. Also added a log message that was very helpful in debugging an issue caused by this problem.
|
Test build #105881 has finished for PR 24732 at commit
|
| /** Requested maximum length of the queue of incoming connections. Default -1 for no backlog. */ | ||
| public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, -1); } | ||
| /** Requested maximum length of the queue of incoming connections. Default is 64. */ | ||
| public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, 64); } |
There was a problem hiding this comment.
what's the different between setting to -1 or to 64 as a default? does this change any existing behavior?
There was a problem hiding this comment.
Looks like the default previously was effectively 50 from the PR comments
There was a problem hiding this comment.
It's just being explicit, in case there are differences between various JREs. For the most common ones, at least, the default was 50.
|
I checked |
|
It's such a small change that it shouldn't affect anybody. (For comparison, the default value on sockets not created from Java is 128.) In fact it makes it clearer that Spark is setting a value explicitly instead of relying on the JVM default (for those who know that detail). |
|
Got it. Thank you, @vanzin . +1, LGTM. Merged to master. |
|
Also, merged to |
… server backlog. First, there is currently no public documentation for this setting. So it's hard to even know that it could be a problem if your application starts failing with weird shuffle errors. Second, the javadoc attached to the code was incorrect; the default value just uses the default value from the JRE, which is 50, instead of having an unbounded queue as the comment implies. So use a default that is a "rounded" version of the JRE default, and provide documentation explaining that this value may need to be adjusted. Also added a log message that was very helpful in debugging an issue caused by this problem. Closes #24732 from vanzin/SPARK-27868. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 09ed64d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
|
Thank you for review, @felixcheung , @srowen , @gaborgsomogyi , @squito , too! |
… server backlog. First, there is currently no public documentation for this setting. So it's hard to even know that it could be a problem if your application starts failing with weird shuffle errors. Second, the javadoc attached to the code was incorrect; the default value just uses the default value from the JRE, which is 50, instead of having an unbounded queue as the comment implies. So use a default that is a "rounded" version of the JRE default, and provide documentation explaining that this value may need to be adjusted. Also added a log message that was very helpful in debugging an issue caused by this problem. Closes apache#24732 from vanzin/SPARK-27868. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 09ed64d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
… server backlog. First, there is currently no public documentation for this setting. So it's hard to even know that it could be a problem if your application starts failing with weird shuffle errors. Second, the javadoc attached to the code was incorrect; the default value just uses the default value from the JRE, which is 50, instead of having an unbounded queue as the comment implies. So use a default that is a "rounded" version of the JRE default, and provide documentation explaining that this value may need to be adjusted. Also added a log message that was very helpful in debugging an issue caused by this problem. Closes apache#24732 from vanzin/SPARK-27868. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 09ed64d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
… server backlog. First, there is currently no public documentation for this setting. So it's hard to even know that it could be a problem if your application starts failing with weird shuffle errors. Second, the javadoc attached to the code was incorrect; the default value just uses the default value from the JRE, which is 50, instead of having an unbounded queue as the comment implies. So use a default that is a "rounded" version of the JRE default, and provide documentation explaining that this value may need to be adjusted. Also added a log message that was very helpful in debugging an issue caused by this problem. Closes apache#24732 from vanzin/SPARK-27868. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 09ed64d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
|
This change, despite being
in fact broke our production configuration and led to almost a month of instability of our Spark workflows, while we were investigating the root cause (just because we had a totally different suspect and it took some time to realise that it wasn't a single contributing factor). In our set up we had the default JVM value overridden, and While we learned our own lesson out of this story, I'd like to emphasise, that one should not omit mentioning a change in release notes just because it is believed to be insignificant. Especially if you change names or values of default properties. |
|
Sorry you ran into problems, but I'm curious about this:
How did you do that? I just looked at the source again and it's hardcoded to 50. Which is why in my mind this wouldn't cause any problems. |
|
First of all, sorry, "the default JVM value" is just a poor wording from my side. I blindly repeated it from the initial post of this thread. What I meant is OS default. Here's what it looks like:
protected final ServerSocket javaSocket;
private volatile int backlog = NetUtil.SOMAXCONN;The first one is in fact a protected void doBind(SocketAddress localAddress) throws Exception {
if (PlatformDependent.javaVersion() >= 7) {
javaChannel().bind(localAddress, config.getBacklog());
} else {
javaChannel().socket().bind(localAddress, config.getBacklog());
}
}And the value of For So, the default value in fact neither Not only the default value is twice (more than three times for Windows) bigger than |
|
Ah, in that case it might be a better idea to revert the configuration change. Want to open a PR? |
|
Does it make any sense to leave the default but make it higher, if the default can vary a lot and is generally lower than what's practical? (not sure if that's true) |
|
I think that since Netty makes an effort to look at the OS configuration to define the default value, that it makes more sense to not have Spark override that. |
|
@vanzin will do. @srowen to be honest, I don't see any flaws in previous implementation (except poor documentation, which is already resolved). Meanwhile doing any assumptions without a context will lead to bigger problems anyway. There is basically no issue which need to be addressed by adjusting the default value. So, let it be |
|
cc @dbtsai , too. |
|
+1 for reverting. |
The default value for backLog set back to -1, as any other value may break existing configuration by overriding Netty's default io.netty.util.NetUtil#SOMAXCONN. The documentation accordingly adjusted. See discussion thread: apache#24732
|
Pull request has been sent. Not sure if this case requires a separate Jira ticket. Reused original SPARK-27868. |
The default value for backLog set back to -1, as any other value may break existing configuration by overriding Netty's default io.netty.util.NetUtil#SOMAXCONN. The documentation accordingly adjusted. See discussion thread: #24732 ### What changes were proposed in this pull request? Partial rollback of #24732 (default for backLog set back to -1). ### Why are the changes needed? Previous change introduces backward incompatibility by overriding default of Netty's `io.netty.util.NetUtil#SOMAXCONN` Closes #27230 from xCASx/master. Authored-by: Maxim Kolesnikov <swe.kolesnikov@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
… server backlog. First, there is currently no public documentation for this setting. So it's hard to even know that it could be a problem if your application starts failing with weird shuffle errors. Second, the javadoc attached to the code was incorrect; the default value just uses the default value from the JRE, which is 50, instead of having an unbounded queue as the comment implies. So use a default that is a "rounded" version of the JRE default, and provide documentation explaining that this value may need to be adjusted. Also added a log message that was very helpful in debugging an issue caused by this problem. Closes apache#24732 from vanzin/SPARK-27868. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 09ed64d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
First, there is currently no public documentation for this setting. So it's hard
to even know that it could be a problem if your application starts failing with
weird shuffle errors.
Second, the javadoc attached to the code was incorrect; the default value just uses
the default value from the JRE, which is 50, instead of having an unbounded queue
as the comment implies.
So use a default that is a "rounded" version of the JRE default, and provide
documentation explaining that this value may need to be adjusted. Also added
a log message that was very helpful in debugging an issue caused by this
problem.