-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resource leak in netty tests #3353
Comments
BTW, I see slightly different stacktrace at our production grpc-based backend:
|
@alexvas What you are seeing is a different bug. Would you mind filing a separate issue? This one is specifically for tests. (and thus not as urgent as yours) @ericgribkoff Message deframing change possibly? |
@carl-mastrangelo done in #3422 |
@kannanjgithub As per the defect details I can run the whole test class as well as single unit test (NettyServerHandlerTest.headersWithInvalidContentTypeShouldFail()) which is mentioned in error logs of the defect successfully. PFA the Successful Test class which ran in my local machine. Even I tried some of the approaches/details which is mentioned in the below documentation for finding the Resource leaks in unit tests but unable to see any. Also has you mentioned earlier some classes are mocked which are going to used in the setup() method the get required streamListner/streamTraces object instance from Factory which is required by all the tests otherwise we will be getting NPE saying streamTracerFactory is null. Need to have a connect and have more discussion on the same with you to make sure I am not missing anything to see this issue |
As per the suggestion from @kannanjgithub working in the direction to enable to info/debug logs through logback.xml for the UT case to re-produce issue and the same in progress. |
@kannanjgithub As per your suggestion I am able to configure logback.xml and enable the loggers for UT cases but still it’s not showing required stack trace with any resource leakage issues/errors as mentioned in the defect. We need to have connect and discuss more on this to make sure we are not missing anything to Reproduce this issue. Please suggest. |
Can we try using the older Netty dependencies from before the time this issue got reported and see if we are able to repro it? We can try with netty 4.0.50 (release history for netty) netty-tcnative 2.0.5 (release history for netty-tcnative) and run the test. You need to modify the netty and nettytcnative versions in gradle/libs.versions.toml. |
@kannanjgithub tried with above mentioned old versions for netty and netty-tcnnative along multiple combination versions but most of them are giving issues with project build phase as mentioned in below snippet also followed some of the working combinations for above libraries mentioned in the "https://github.com/vinodhabib/grpc-java/blob/master/SECURITY.md" but here also initial combinations are failing at build level. not sure the old libraries available in the maven antifactory as i am not able to access the same to check. Please suggest further? are you able to build the grpc-java project with Old versions of netty as mentioned above @kannanjgithub? |
For this we need to enable the leak detector at PARANOID level. https://netty.io/wiki/reference-counted-objects.html#leak-detection-levels |
@ejona86 Thanks for your reply as per your above suggestion I tried with using the JVM option ( -Dio.netty.leakDetection.level=paranoid') for NettyServerHandlerTest and able to see the below error with leakage issue.
But the problem is I am unable to reproduce the same error with individual UT case running (i,e UT case where I can see the same issue while running the complete test class which I can see in the trace logs) as that will be helpful for me to Debug the issue. Any Suggestions here? @ejona86 @kannanjgithub Expected fix for this issue is we should not see this above Leakage error while running the NettyServerHandlerTest class, right? Please confirm. Further Analysis/Debug in progress for the same. |
These sorts of failures won't tend to show up if you run a single test by itself, because it requires the GC running to notice the object was leaked. You might be able to add an |
Thanks for the suggestions @ejona86. I will look into this. |
@ejona86 Tried the above suggestion for NettyServerHandlerTest class but it's not giving the Leakage issue for individual test, also not sure I am missing something as I don't see any existing reference for the same in the grpc-java codebase. |
ResourceLeakDetector checks for leaks on certain ByteBuf operations, so after the GC, create a new ByteBuf. |
@ejona86 I tried your above suggestions but still not getting the leakage info while running the single UT case, can see the below tearDown method is getting called after every individual UT run in NettyServerHandlerTest class. Looks like we need to connect on this and discuss further along with @kannanjgithub, please let us know your suitable time/availability. |
@vinodhabib, it sounds like you already have a reproduction, just not as small of a reproduction as you want. NettyServerHandlerTest runs in ~3 seconds, so that should be fast enough to work with. I suggest looking into the leak information you've already gathered. If we need to meet to discuss that, we can; I'll let kannanjgithub create any meetings as appropriate. |
@ejona86 Thanks for your response. Yes, I am already working in this direction as you mentioned above but I feel debug process will be more effective with Single UT case instead of checking all the 66 UT at once. |
@ejona86 @kannanjgithub I was going through the Individual UT cases and trying to close the created ByteBuf operations in the finally block but still the same Leakage issue is coming as per suggestion in the netty document for reference counted objects to fix the leakage in JUnit's as mentioned in the below snippet. Note: ReferenceCountUtil.releaseLater() is deprecated but still tried. Currently I am blocked with this issue and not sure like this is just a test issue at netty or it's a code issue? also not sure like my analysis is going in right direction? I found one similar issue related to Resource Leak (#9563) which was already merged which says about fix at code level of removing the deadline and disableRetry() Need your help/support on this, better if we schedule a meeting on this to discuss more. |
Randomly modifying reference counts generally won't get you far. You need to look at where the buffer was created, and use that to help you understand where it should have been closed. Looking at the original stack trace, as an example, was on the read path. And was a buffer created by grpc/netty in response to the read frame and would have been sent back out on the channel. So in this case the buffer would have been passed to the test via the
For the one you posted earlier:
That stack shows a timer fired and caused the shutdown of the channel (probably maxConnectionAge or maxConnectionIdle). During the shutdown process it sends a PING and expects a PING ack in reply. That ping would be in the same
And looking at the test, it doesn't free a single message from |
@ejona86 Thanks for the Reply. |
In NettyServerHandler::handlerAdded there is a ChannelHandlerContext passed by the grpc server side code, and during the grpc server's handling of a rpc read or write, the code sending this context that wraps the EmbeddedChannel will have the responsibility of freeing up buffers. That should be already existing and be occurring when an rpc gets handled, and that may be in another layer of code outside the grpc netty package, or lie in some post processing code path in grpc netty code that gets called during real rpc invocation. |
FYI: I just stumbled on an old branch of mine. It might be helpful, but I make no guarantees. I see |
@ejona86 Thanks for the above response and suggestion. Remaining 30% Tests still giving Leakage error randomly along with 3 UT Failures (currently disabled locally) Further working on them. Failed UT: UT which are Failing Randomly with Leakage Error : NettyHandlerTestBase.windowUpdateMatchesTarget cc : @kannanjgithub |
Fixed below 3 UT cases for Leakage issue and Its working as Expected. further UT fixes are in progress. NettyHandlerTestBase.bdpPingWindowResizing |
@kannanjgithub @ejona86 with my continuous analysis i can see below are the 4 UT's which are failing so frequently/intermittently with Leakage issue even after free up the buffer as i can see in the debug mode the outbound messages getting released after execution of this channel().releaseOutbound() call. NettyHandlerTestBase.windowUpdateMatchesTarget Further looking into these and other Failed UT's which are commented currently. |
@kannanjgithub Fixed below 4 Failing UTs at Netty Server and Client Handler Tests. NettyServerHandlerTest.streamErrorShouldNotCloseChannel() NettyClientHandlerTest.sendFrameShouldSucceed() |
@kannanjgithub Out of 4 UTs now with recent changes 2 UTs got Fixed and 2 are Frequently Failing. Further Analysis/Debug/Fix in progress for the below Failing UTs. NettyHandlerTestBase.testPingBackoff Note : Validated the same by commenting these 2 UTs locally. |
Hi @kannanjgithub @ejona86 As per my recent Analysis found like below are the 4 UT's in NettyHandlerTestBase are tightly coupled or inter dependent on each other as i can see when i run UT's 1 & 2 with commenting/disable UT's 3 and 4, I don't see any leakage issue and its stable.
When i run the UT's with uncommenting/enable 3 & 4, I can see the Leakage issue in UT's 1 & 2 also in 3 & 4 with Leakage error with most of the time with method readXCopies(1, data1Kb); which is called multiple times with different params/input. While debug these UT's (3 and 4) i found like resource was not closed then i have called channel().releaseOutbound() in all the possible places to release the buffer but still the UT's giving the intermittent leakage behavior. Please suggest and need your help/support here with these UTs (3 & 4) Stack Trace of the Leakage Error in UT 3 & 4 with readXCopies(1, data1Kb); most of the times. |
The unreleased ByteBuf mentioned in the stacktrace is not about the outbound queue messages. The ByteBuf created in NettyHandlereTestBase::grpcFrame needs to be released. |
@kannanjgithub As discussed in today's call below changes in the readXcopies() method not making any difference behavior is still same. private void readXCopies(int copies, byte[] data) throws Exception { Further i will look into NettyHandlereTestBase::grpcFrame as per your above comment. |
My suggestion was to call ByteBuf.release(), not ByteBuf.clear(). |
You can't modify the frame after you've passed it to The buffers are most likely stuck in |
@kannanjgithub if if use the ByteBuf.release() then i will get below Exception so i used ByteBuf.clear() |
@ejona86 Thanks for your above suggestion and i will look into it further. |
@kannanjgithub I tried all the possible ways to release the buffer (around dataFrame() method ) for the pending 2 UT's mentioned above in previous comments also tried the suggestions (around streamListenerMessageQueue) but still facing the same and below is the only common issue as mentioned in the snippet as per the log stack trace. Kindly suggest. Note : Dropped a mail with details about all tried approaches for the same. |
@vinodhabib, the grpcDataFrame are messages sent on an RPC but never read. The data needs to be consumed via streamTransportState... and there's no stream object, so... seems like requestMessagesFromDeframerForTesting() to request messages and streamListenerMessageQueue to close the InputStreams holding the messages. (At least that's how it'd look on client-side.) |
@ejona86 @kannanjgithub was working on this suggested approach and simulated similar changes in NettyServerHandlerTest class. Kindly suggest further on the same. Below is the code snippet and console statements which says the messages got released.
|
Please provide the leak detection error message that occurs even after the changes. |
|
That is broken. |
I don't see a problem with the loop. i is unused (and it should have just been a while loop) but the condition to continue the loop (streamListenerMessageQueue.size() > 0 ) is correct. |
Oh, I see. It just didn't use |
I haven't tracked down the responsible, but I ran with leak detection and found this:
The text was updated successfully, but these errors were encountered: