-
Notifications
You must be signed in to change notification settings - Fork 7.3k
ZOOKEEPER-2549 Add exception handling to sendResponse #99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| sendResponse(h, e, "notification"); | ||
| } catch (IOException ex) { | ||
| LOG.debug("Problem sending to " + getRemoteSocketAddress(), ex); | ||
| close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an interesting change -- how/when were we closing the connection before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was not closing (I think) before as exception was swallowed since sendResponse in NIOServerCnxn was not throwing IOException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, that would be a pretty bad leak given all what goes on in close():
@Override
public void close() {
if (!factory.removeCnxn(this)) {
return;
}
if (zkServer != null) {
zkServer.removeCnxn(this);
}
if (sk != null) {
try {
// need to cancel this selection key from the selector
sk.cancel();
} catch (Exception e) {
if (LOG.isDebugEnabled()) {
LOG.debug("ignoring exception during selectionkey cancel", e);
}
}
}
closeSock();
}
should we add a test case for making sure close is called then? although i think it's handled somewhere else, let me re-read the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it must be closing somewhere - just need to trace where
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was not closing (I think) before as exception was swallowed since sendResponse in NIOServerCnxn was not throwing IOException
Yes, I think the connection was not closing before in cases of exception thrown from NIOServerCnxn.sendResponse which swallows everything. The change in this PR changes the behavior by closing the connection in case of exceptions occur in sendResponse. I am leaning towards the old behavior of NOT closing the connection, because the connection looks pretty innocent - in fact NIOServerCnxn.sendResponse does not involve any socket IO I believe, it just queuing stuff to be send over sockets. So if something goes wrong, we just do our best effort by logging what's wrong - rather than trying mess up with sockets which seems out of responsibilities of NIOServerCnxn.sendResponse. Similarly since NIOServerCnxn.sendResponse does not directly involve sockets, there should not be any leaks in case sendResponse screw up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove close() from catch block
| try { | ||
| sendResponse(h, e, "notification"); | ||
| } catch (IOException ex) { | ||
| LOG.debug("Problem sending to " + getRemoteSocketAddress(), ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have mixed feelings with concatenating strings in a hot path (IOException happening here is a hot path when, for instance, a network blip happens).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestion here? Not to getRemoteSocketAddress() at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using LOG.debug, so it shouldn't be an issue on prod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would use a modern debug format:
LOG.debug("Problem sending to {}", getRemoteSocketAddress(), ex);
| */ | ||
| public NIOServerCnxnFactory() { | ||
| String serverCnxnClassName = System.getProperty(ServerCnxnFactory.ZOOKEEPER_SERVER_CNXN); | ||
| if ( serverCnxnClassName != null ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra spaces
|
|
||
| NIOServerCnxn cnxn = null; | ||
|
|
||
| if (serverCnxnClassCtr != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not set serverCnxnClassCtr to NIOServerCnxn by default to avoid looking up the right constructor every time we want to create a new connection (a bit of a hot path)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I tend to come back to double instantiation - as it will be used only in UnitTests (serverCnxnClassCtr != null), while manipulation with constructor (invocation of newInstance) instead of directly creating instance of NIOServerCnxn will be really a hot path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's wrong with setting serverCnxnClassCtr to NettyServerCnxn by default (by doing all the reflection magic)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing is wrong with it per say - just overhead, as we will call "newInstance" on the constructor for every code path - test or not.
| NettyServerCnxn cnxn = new NettyServerCnxn(ctx.getChannel(), | ||
| zkServer, NettyServerCnxnFactory.this); | ||
| NettyServerCnxn cnxn = null; | ||
| if (serverCnxnClassCtr != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as for NIO, lets set serverCnxnClassCtr to NettyServerCnxn by default
|
Addressed review comments |
| try { | ||
| sendResponse(h, e, "notification"); | ||
| } catch (IOException ex) { | ||
| LOG.debug("Problem sending to " + getRemoteSocketAddress(), ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using LOG.debug, so it shouldn't be an issue on prod.
| bos.writeRecord(r, tag); | ||
| } | ||
| baos.close(); | ||
| } catch (IOException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This IOException is swallowed either, should we re-throw it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not modify this code - it was like that before, but potentially - yes it makes sense to rethrow
I would say there are multiple places I came across where exceptions are swallowed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on re-throw the exception here. We could just remove the try - catch block here given there is a new try - catch block added in this PR that wraps the entire sendResponse.
We can start triaging other places where the exceptions are swallowed but should re-throw after this PR merging in.
hanm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I checked the patch locally and tests are green.
One place I'd like further discussion is the change of behavior of NIOServerCnxn.sendResponse in terms of close connection upon exception.
| boolean initialized; | ||
|
|
||
| NettyServerCnxn(Channel channel, ZooKeeperServer zks, NettyServerCnxnFactory factory) { | ||
| public NettyServerCnxn(Channel channel, ZooKeeperServer zks, NettyServerCnxnFactory factory) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not obvious to me why the access specifier of NettyServerCnxn should be changed public here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I did this to match NIOServerCnxn constructor. It can be kept package level, as my test class is in the same package namespace. I can change it back, but it will be inconsistent with NIO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take my words back. I need it to be public as I use reflection to create it in NettyServerCnxnFactory and if it is not I would have to do couple of more steps during init to set access to public which is unnecessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yufeldman Thanks for explaining, makes sense to me. It is required to run the tests, however I do find another problem with tests: it looks like the Netty tests (testNetty*) never run with current configuration. Proof: remove the public access specifier appertain to NettyServerCnxn and all tests of ServerCxnExceptionsTest still pass. We expect Netty related tests fail here without public access specifier, right? Now put back the public for NettyServerCnxn but remove the public access specifier appertains to NIOServercCnxn, now all tests failed while we expect only NIO tests fail but Netty tests pass.
It's likely caused by the intervening of the java system properties that controls the Netty vs NIO server selection. One solution is to split the ServerCxnExceptionsTest into Netty and NIO specific tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hanm - there is nothing wrong with the tests, they do run fine. They do not use ctor from NettyServerCnxn, but from MockNettyServerCnxn - that has public ctor.
You can make MockNettyServerCnxn not public and you will have the same issue. And BTW
org.apache.zookeeper.server.NettyServerCnxnTest fails with not public ctor in NettyServerCnxn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yufeldman Yeah the tests are fine. What I mentioned that leads to NettyServerCnxnx not get instantiated only happens in erroneous cases when ZOOKEEPER_SERVER_CNXN_FACTORY is not initialized properly - this happens when the public specifier was removed so all tests fall back to create NIOServerCnxnx instead. That's not a real alarm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hanm - >>> so all tests fall back to create NIOServerCnxnx instead
It is not the case, they don't fall back to NIO - they fail.
Here is printout from running tests with both Netty and MockNettyServerCnxn not having public ctor:
[junit] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.159 sec
[junit] Test org.apache.zookeeper.server.NettyServerCnxnTest FAILED
[junit] Running org.apache.zookeeper.server.PrepRequestProcessorTest
[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.279 sec
[junit] Running org.apache.zookeeper.server.PurgeTxnTest
[junit] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.431 sec
[junit] Running org.apache.zookeeper.server.ReferenceCountedACLCacheTest
[junit] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.089 sec
[junit] Running org.apache.zookeeper.server.SerializationPerfTest
[junit] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.114 sec
[junit] Running org.apache.zookeeper.server.ServerCxnExceptionsTest
[junit] Tests run: 6, Failures: 0, Errors: 6, Skipped: 0, Time elapsed: 0.284 sec
[junit] Test org.apache.zookeeper.server.ServerCxnExceptionsTest FAILED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not the case, they don't fall back to NIO - they fail.
The fallback I was referring to is https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/ServerCnxnFactory.java#L130. It gets hit when public was removed from NIOServerCnxn for Netty* tests. An example call stack (note that a Netty test complaining about NIOServerCnxnFactory):
` INFO [main:ZKTestCase$1@70] - FAILED testNettyRunTimeException
java.io.IOException: Couldn't instantiate org.apache.zookeeper.server.NIOServerCnxnFactory
-------at org.apache.zookeeper.server.ServerCnxnFactory.createFactory(ServerCnxnFactory.java:142)
-------at org.apache.zookeeper.server.ServerCnxnFactory.createFactory(ServerCnxnFactory.java:158)
-------at org.apache.zookeeper.server.ServerCnxnFactory.createFactory(ServerCnxnFactory.java:152)`
Thanks for persisting on this, but I don't think this erroneous case need to be investigated further as it would not happen when real test cases were running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. For me it actually is:
2016-12-05 11:11:02,523 [myid:] - INFO [Time-limited test:JUnit4ZKTestRunner$LoggedInvokeMethod@98] - TEST METHOD FAILED testNettyRunTimeException
java.io.IOException: Couldn't instantiate org.apache.zookeeper.server.NettyServerCnxnFactory at org.apache.zookeeper.server.ServerCnxnFactory.createFactory(ServerCnxnFactory.java:141) at org.apache.zookeeper.server.ServerCnxnFactory.createFactory(ServerCnxnFactory.java:157)
at org.apache.zookeeper.server.ServerCnxnFactory.createFactory(ServerCnxnFactory.java:151)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hanm - thank you for all the reviews and feedback. Really appreciate it.
| bos.writeRecord(r, tag); | ||
| } | ||
| baos.close(); | ||
| } catch (IOException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on re-throw the exception here. We could just remove the try - catch block here given there is a new try - catch block added in this PR that wraps the entire sendResponse.
We can start triaging other places where the exceptions are swallowed but should re-throw after this PR merging in.
| sendResponse(h, e, "notification"); | ||
| } catch (IOException ex) { | ||
| LOG.debug("Problem sending to " + getRemoteSocketAddress(), ex); | ||
| close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was not closing (I think) before as exception was swallowed since sendResponse in NIOServerCnxn was not throwing IOException
Yes, I think the connection was not closing before in cases of exception thrown from NIOServerCnxn.sendResponse which swallows everything. The change in this PR changes the behavior by closing the connection in case of exceptions occur in sendResponse. I am leaning towards the old behavior of NOT closing the connection, because the connection looks pretty innocent - in fact NIOServerCnxn.sendResponse does not involve any socket IO I believe, it just queuing stuff to be send over sockets. So if something goes wrong, we just do our best effort by logging what's wrong - rather than trying mess up with sockets which seems out of responsibilities of NIOServerCnxn.sendResponse. Similarly since NIOServerCnxn.sendResponse does not directly involve sockets, there should not be any leaks in case sendResponse screw up.
|
+1 with latest update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this patch, but I think the whole reflection/mock thing is kind of reinventing a fault injection inside the test classes. The final result is pretty cool but it raises a fundamental question: why not use a production ready framework as Byteman (byteman.jboss.org)?
I wrote a PR that strips the boilerplate stuff while leaving the feature of this PR: #123 (btw, this is an old JIRA issue raised by @andreisavu).
ant test -Dtestcase=ServerCxnExceptionsTest
Still WIP (would need runs on BSD and Windows, only tested on Linux and MacOS), so any suggestions are welcome. :) The cons is the addition of yet-another-dependencies (byteman classes), but IMHO ZooKeeper project will benefit a lot from Byteman in the long term. @rgs1 @hanm @fpj @phunt, wdyt?
| try { | ||
| sendResponse(h, e, "notification"); | ||
| } catch (IOException ex) { | ||
| LOG.debug("Problem sending to " + getRemoteSocketAddress(), ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would use a modern debug format:
LOG.debug("Problem sending to {}", getRemoteSocketAddress(), ex);
| import static org.junit.Assert.fail; | ||
|
|
||
| /** | ||
| * Unit tests to test different exceptions scenarious in sendResponse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: scenarios
|
|
||
| private static final Logger LOG = LoggerFactory.getLogger(ServerCxnExceptionsTest.class); | ||
|
|
||
| private String exceptionType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exceptionType is never used!
| zk.create(path, "test".getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE, | ||
| CreateMode.EPHEMERAL); | ||
| fail("Should not come here"); | ||
| Stat stats = zk.exists(path, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get why we need the lines 87-90, because fail() throws an AssertionError that will interrupt the processing flow, so those lines are effectively unreachable, right? There's should be nothing more after the fail(), I guess.
Also, I would suggest to put a more meaningful message, something along the lines of sendResponse should have thrown IOException and failed this test.
|
|
||
| private String exceptionType; | ||
|
|
||
| private void NettySetup() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method name doesn't follow camel case convention (nettySetup instead of NettySetup).
| String returnPath = zk.create(path, "test".getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE, | ||
| CreateMode.EPHEMERAL); | ||
| Stat stats = zk.exists(returnPath, false); | ||
| if ( stats != null ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra spaces after the parenthesis.
| zk.create(path, "test".getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE, | ||
| CreateMode.EPHEMERAL); | ||
| Stat stats = zk.exists(path, false); | ||
| if ( stats != null ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra spaces after the parenthesis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get what we are even bothering of adding lines 122-124. Why?
| zk.close(); | ||
| } | ||
| } | ||
| @Test (timeout = 10000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 10000 and not 60000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
| testZKRunTimeExceptionHelper(); | ||
| } | ||
|
|
||
| @Test (timeout = 10000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 10000 and not 60000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
| } | ||
| } catch(Exception e) { | ||
| LOG.warn("Unexpected exception. Destruction averted.", e); | ||
| throw new IOException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion: wdyt about bubbling up a more custom message with the IOException instead of just encapsulate the Exception? I mean, something like:
throw new IOException("sendMessage exception: blah blah", e);
|
@yufeldman could you please fix some coding style issues pointed out by @eribeiro (4 spaces indentation, extra white space, etc)? I'll merge your patch once those comments are addressed. Thanks. |
|
will do tomorrow. Thanks |
Fix for:
As NettyServerCnxn.sendResponse() allows all the exception to bubble up it can stop main ZK requests processing thread
Same changes done for NIOServerCnxn