Skip to content

Conversation

@yufeldman
Copy link

Fix for:
As NettyServerCnxn.sendResponse() allows all the exception to bubble up it can stop main ZK requests processing thread
Same changes done for NIOServerCnxn

sendResponse(h, e, "notification");
} catch (IOException ex) {
LOG.debug("Problem sending to " + getRemoteSocketAddress(), ex);
close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an interesting change -- how/when were we closing the connection before?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not closing (I think) before as exception was swallowed since sendResponse in NIOServerCnxn was not throwing IOException

Copy link
Member

@rgs1 rgs1 Nov 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, that would be a pretty bad leak given all what goes on in close():

    @Override                                                                                                                                                                     
    public void close() {                                                                                                                                                         
        if (!factory.removeCnxn(this)) {                                                                                                                                          
            return;                                                                                                                                                               
        }                                                                                                                                                                         

        if (zkServer != null) {                                                                                                                                                   
            zkServer.removeCnxn(this);                                                                                                                                            
        }                                                                                                                                                                         

        if (sk != null) {                                                                                                                                                         
            try {                                                                                                                                                                 
                // need to cancel this selection key from the selector                                                                                                            
                sk.cancel();                                                                                                                                                      
            } catch (Exception e) {                                                                                                                                               
                if (LOG.isDebugEnabled()) {                                                                                                                                       
                    LOG.debug("ignoring exception during selectionkey cancel", e);                                                                                                
                }                                                                                                                                                                 
            }                                                                                                                                                                     
        }                                                                                                                                                                         

        closeSock();                                                                                                                                                              
    } 

should we add a test case for making sure close is called then? although i think it's handled somewhere else, let me re-read the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it must be closing somewhere - just need to trace where

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not closing (I think) before as exception was swallowed since sendResponse in NIOServerCnxn was not throwing IOException

Yes, I think the connection was not closing before in cases of exception thrown from NIOServerCnxn.sendResponse which swallows everything. The change in this PR changes the behavior by closing the connection in case of exceptions occur in sendResponse. I am leaning towards the old behavior of NOT closing the connection, because the connection looks pretty innocent - in fact NIOServerCnxn.sendResponse does not involve any socket IO I believe, it just queuing stuff to be send over sockets. So if something goes wrong, we just do our best effort by logging what's wrong - rather than trying mess up with sockets which seems out of responsibilities of NIOServerCnxn.sendResponse. Similarly since NIOServerCnxn.sendResponse does not directly involve sockets, there should not be any leaks in case sendResponse screw up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove close() from catch block

try {
sendResponse(h, e, "notification");
} catch (IOException ex) {
LOG.debug("Problem sending to " + getRemoteSocketAddress(), ex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have mixed feelings with concatenating strings in a hot path (IOException happening here is a hot path when, for instance, a network blip happens).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestion here? Not to getRemoteSocketAddress() at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using LOG.debug, so it shouldn't be an issue on prod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would use a modern debug format:

LOG.debug("Problem sending to {}", getRemoteSocketAddress(), ex);

*/
public NIOServerCnxnFactory() {
String serverCnxnClassName = System.getProperty(ServerCnxnFactory.ZOOKEEPER_SERVER_CNXN);
if ( serverCnxnClassName != null ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra spaces


NIOServerCnxn cnxn = null;

if (serverCnxnClassCtr != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not set serverCnxnClassCtr to NIOServerCnxn by default to avoid looking up the right constructor every time we want to create a new connection (a bit of a hot path)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I tend to come back to double instantiation - as it will be used only in UnitTests (serverCnxnClassCtr != null), while manipulation with constructor (invocation of newInstance) instead of directly creating instance of NIOServerCnxn will be really a hot path

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's wrong with setting serverCnxnClassCtr to NettyServerCnxn by default (by doing all the reflection magic)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing is wrong with it per say - just overhead, as we will call "newInstance" on the constructor for every code path - test or not.

NettyServerCnxn cnxn = new NettyServerCnxn(ctx.getChannel(),
zkServer, NettyServerCnxnFactory.this);
NettyServerCnxn cnxn = null;
if (serverCnxnClassCtr != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as for NIO, lets set serverCnxnClassCtr to NettyServerCnxn by default

@yufeldman
Copy link
Author

Addressed review comments

try {
sendResponse(h, e, "notification");
} catch (IOException ex) {
LOG.debug("Problem sending to " + getRemoteSocketAddress(), ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using LOG.debug, so it shouldn't be an issue on prod.

bos.writeRecord(r, tag);
}
baos.close();
} catch (IOException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This IOException is swallowed either, should we re-throw it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not modify this code - it was like that before, but potentially - yes it makes sense to rethrow
I would say there are multiple places I came across where exceptions are swallowed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on re-throw the exception here. We could just remove the try - catch block here given there is a new try - catch block added in this PR that wraps the entire sendResponse.
We can start triaging other places where the exceptions are swallowed but should re-throw after this PR merging in.

@rgs1
Copy link
Member

rgs1 commented Dec 3, 2016

@fpj, @hanm could you take a look? I am +1, but another look would be nice.

Copy link
Contributor

@hanm hanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I checked the patch locally and tests are green.

One place I'd like further discussion is the change of behavior of NIOServerCnxn.sendResponse in terms of close connection upon exception.

boolean initialized;

NettyServerCnxn(Channel channel, ZooKeeperServer zks, NettyServerCnxnFactory factory) {
public NettyServerCnxn(Channel channel, ZooKeeperServer zks, NettyServerCnxnFactory factory) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not obvious to me why the access specifier of NettyServerCnxn should be changed public here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I did this to match NIOServerCnxn constructor. It can be kept package level, as my test class is in the same package namespace. I can change it back, but it will be inconsistent with NIO

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take my words back. I need it to be public as I use reflection to create it in NettyServerCnxnFactory and if it is not I would have to do couple of more steps during init to set access to public which is unnecessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yufeldman Thanks for explaining, makes sense to me. It is required to run the tests, however I do find another problem with tests: it looks like the Netty tests (testNetty*) never run with current configuration. Proof: remove the public access specifier appertain to NettyServerCnxn and all tests of ServerCxnExceptionsTest still pass. We expect Netty related tests fail here without public access specifier, right? Now put back the public for NettyServerCnxn but remove the public access specifier appertains to NIOServercCnxn, now all tests failed while we expect only NIO tests fail but Netty tests pass.

It's likely caused by the intervening of the java system properties that controls the Netty vs NIO server selection. One solution is to split the ServerCxnExceptionsTest into Netty and NIO specific tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanm - there is nothing wrong with the tests, they do run fine. They do not use ctor from NettyServerCnxn, but from MockNettyServerCnxn - that has public ctor.
You can make MockNettyServerCnxn not public and you will have the same issue. And BTW
org.apache.zookeeper.server.NettyServerCnxnTest fails with not public ctor in NettyServerCnxn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yufeldman Yeah the tests are fine. What I mentioned that leads to NettyServerCnxnx not get instantiated only happens in erroneous cases when ZOOKEEPER_SERVER_CNXN_FACTORY is not initialized properly - this happens when the public specifier was removed so all tests fall back to create NIOServerCnxnx instead. That's not a real alarm.

Copy link
Author

@yufeldman yufeldman Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanm - >>> so all tests fall back to create NIOServerCnxnx instead

It is not the case, they don't fall back to NIO - they fail.
Here is printout from running tests with both Netty and MockNettyServerCnxn not having public ctor:
[junit] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.159 sec
[junit] Test org.apache.zookeeper.server.NettyServerCnxnTest FAILED
[junit] Running org.apache.zookeeper.server.PrepRequestProcessorTest
[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.279 sec
[junit] Running org.apache.zookeeper.server.PurgeTxnTest
[junit] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.431 sec
[junit] Running org.apache.zookeeper.server.ReferenceCountedACLCacheTest
[junit] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.089 sec
[junit] Running org.apache.zookeeper.server.SerializationPerfTest
[junit] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.114 sec
[junit] Running org.apache.zookeeper.server.ServerCxnExceptionsTest
[junit] Tests run: 6, Failures: 0, Errors: 6, Skipped: 0, Time elapsed: 0.284 sec
[junit] Test org.apache.zookeeper.server.ServerCxnExceptionsTest FAILED

Copy link
Contributor

@hanm hanm Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not the case, they don't fall back to NIO - they fail.

The fallback I was referring to is https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/ServerCnxnFactory.java#L130. It gets hit when public was removed from NIOServerCnxn for Netty* tests. An example call stack (note that a Netty test complaining about NIOServerCnxnFactory):
` INFO [main:ZKTestCase$1@70] - FAILED testNettyRunTimeException
java.io.IOException: Couldn't instantiate org.apache.zookeeper.server.NIOServerCnxnFactory

-------at org.apache.zookeeper.server.ServerCnxnFactory.createFactory(ServerCnxnFactory.java:142)
-------at org.apache.zookeeper.server.ServerCnxnFactory.createFactory(ServerCnxnFactory.java:158)
-------at org.apache.zookeeper.server.ServerCnxnFactory.createFactory(ServerCnxnFactory.java:152)`

Thanks for persisting on this, but I don't think this erroneous case need to be investigated further as it would not happen when real test cases were running.

Copy link
Author

@yufeldman yufeldman Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. For me it actually is:

2016-12-05 11:11:02,523 [myid:] - INFO [Time-limited test:JUnit4ZKTestRunner$LoggedInvokeMethod@98] - TEST METHOD FAILED testNettyRunTimeException
java.io.IOException: Couldn't instantiate org.apache.zookeeper.server.NettyServerCnxnFactory at org.apache.zookeeper.server.ServerCnxnFactory.createFactory(ServerCnxnFactory.java:141) at org.apache.zookeeper.server.ServerCnxnFactory.createFactory(ServerCnxnFactory.java:157)
at org.apache.zookeeper.server.ServerCnxnFactory.createFactory(ServerCnxnFactory.java:151)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanm - thank you for all the reviews and feedback. Really appreciate it.

bos.writeRecord(r, tag);
}
baos.close();
} catch (IOException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on re-throw the exception here. We could just remove the try - catch block here given there is a new try - catch block added in this PR that wraps the entire sendResponse.
We can start triaging other places where the exceptions are swallowed but should re-throw after this PR merging in.

sendResponse(h, e, "notification");
} catch (IOException ex) {
LOG.debug("Problem sending to " + getRemoteSocketAddress(), ex);
close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not closing (I think) before as exception was swallowed since sendResponse in NIOServerCnxn was not throwing IOException

Yes, I think the connection was not closing before in cases of exception thrown from NIOServerCnxn.sendResponse which swallows everything. The change in this PR changes the behavior by closing the connection in case of exceptions occur in sendResponse. I am leaning towards the old behavior of NOT closing the connection, because the connection looks pretty innocent - in fact NIOServerCnxn.sendResponse does not involve any socket IO I believe, it just queuing stuff to be send over sockets. So if something goes wrong, we just do our best effort by logging what's wrong - rather than trying mess up with sockets which seems out of responsibilities of NIOServerCnxn.sendResponse. Similarly since NIOServerCnxn.sendResponse does not directly involve sockets, there should not be any leaks in case sendResponse screw up.

@hanm
Copy link
Contributor

hanm commented Dec 5, 2016

+1 with latest update.

Copy link
Contributor

@eribeiro eribeiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this patch, but I think the whole reflection/mock thing is kind of reinventing a fault injection inside the test classes. The final result is pretty cool but it raises a fundamental question: why not use a production ready framework as Byteman (byteman.jboss.org)?

I wrote a PR that strips the boilerplate stuff while leaving the feature of this PR: #123 (btw, this is an old JIRA issue raised by @andreisavu).

ant test -Dtestcase=ServerCxnExceptionsTest

Still WIP (would need runs on BSD and Windows, only tested on Linux and MacOS), so any suggestions are welcome. :) The cons is the addition of yet-another-dependencies (byteman classes), but IMHO ZooKeeper project will benefit a lot from Byteman in the long term. @rgs1 @hanm @fpj @phunt, wdyt?

try {
sendResponse(h, e, "notification");
} catch (IOException ex) {
LOG.debug("Problem sending to " + getRemoteSocketAddress(), ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would use a modern debug format:

LOG.debug("Problem sending to {}", getRemoteSocketAddress(), ex);

import static org.junit.Assert.fail;

/**
* Unit tests to test different exceptions scenarious in sendResponse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: scenarios


private static final Logger LOG = LoggerFactory.getLogger(ServerCxnExceptionsTest.class);

private String exceptionType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exceptionType is never used!

zk.create(path, "test".getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE,
CreateMode.EPHEMERAL);
fail("Should not come here");
Stat stats = zk.exists(path, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get why we need the lines 87-90, because fail() throws an AssertionError that will interrupt the processing flow, so those lines are effectively unreachable, right? There's should be nothing more after the fail(), I guess.

Also, I would suggest to put a more meaningful message, something along the lines of sendResponse should have thrown IOException and failed this test.


private String exceptionType;

private void NettySetup() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method name doesn't follow camel case convention (nettySetup instead of NettySetup).

String returnPath = zk.create(path, "test".getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE,
CreateMode.EPHEMERAL);
Stat stats = zk.exists(returnPath, false);
if ( stats != null ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra spaces after the parenthesis.

zk.create(path, "test".getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE,
CreateMode.EPHEMERAL);
Stat stats = zk.exists(path, false);
if ( stats != null ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra spaces after the parenthesis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get what we are even bothering of adding lines 122-124. Why?

zk.close();
}
}
@Test (timeout = 10000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 10000 and not 60000?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

testZKRunTimeExceptionHelper();
}

@Test (timeout = 10000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 10000 and not 60000?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

}
} catch(Exception e) {
LOG.warn("Unexpected exception. Destruction averted.", e);
throw new IOException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion: wdyt about bubbling up a more custom message with the IOException instead of just encapsulate the Exception? I mean, something like:

throw new IOException("sendMessage exception: blah blah", e);

@hanm
Copy link
Contributor

hanm commented Dec 23, 2016

@yufeldman could you please fix some coding style issues pointed out by @eribeiro (4 spaces indentation, extra white space, etc)? I'll merge your patch once those comments are addressed. Thanks.

@yufeldman
Copy link
Author

will do tomorrow. Thanks

@anmolnar anmolnar closed this Jan 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants