Skip to content

Conversation

@caneGuy
Copy link
Contributor

@caneGuy caneGuy commented Jul 24, 2017

What changes were proposed in this pull request?

In our production cluster,oom happens when NettyBlockRpcServer receive OpenBlocks message.The reason we observed is below:
When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default maxNumComponents=16 in low-level CompositeByteBuf.When our component's number is bigger than 16, it will execute consolidateIfNeeded

    int numComponents = this.components.size();
    if(numComponents > this.maxNumComponents) {
        int capacity = ((CompositeByteBuf.Component)this.components.get(numComponents - 1)).endOffset;
        ByteBuf consolidated = this.allocBuffer(capacity);

        for(int c = 0; c < numComponents; ++c) {
            CompositeByteBuf.Component c1 = (CompositeByteBuf.Component)this.components.get(c);
            ByteBuf b = c1.buf;
            consolidated.writeBytes(b);
            c1.freeIfNecessary();
        }

        CompositeByteBuf.Component var7 = new CompositeByteBuf.Component(consolidated);
        var7.endOffset = var7.length;
        this.components.clear();
        this.components.add(var7);
    }

in CompositeByteBuf which will consume some memory during buffer copy.
We can use another api Unpooled. wrappedBuffer(int maxNumComponents, ByteBuffer... buffers) to avoid this comsuming.

How was this patch tested?

Test in production cluster.

@caneGuy
Copy link
Contributor Author

caneGuy commented Jul 25, 2017

Can anyone help verify this?Thanks too much.

*/
def toNetty: ByteBuf = {
Unpooled.wrappedBuffer(getChunks(): _*)
Unpooled.wrappedBuffer(chunks.length + 1, getChunks(): _*)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why chunks.length+1 instead of chunks.length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the value to chunks.length

@kiszk
Copy link
Member

kiszk commented Jul 25, 2017

IIUC, this PR avoid data copy for consolidation in netty.buffer.CompositeByteBuf.consolidateIfNeeded. I think that it is reasonable.

I have two questions to understand this change.

  1. Where OOM happens at consoolidateIfNeeded in the original code?
  2. Is there any downside (e.g. performance degradation) to keep many components in CompositeByteBuf?

@caneGuy
Copy link
Contributor Author

caneGuy commented Jul 25, 2017

@kiszk Thanks for your time.
Yes,you are right.
1.OOM happens at io.netty.buffer.CompositeByteBuf.allocBuffer when one executor(client) fetch blocks from an other(server) , oom happens at server side before response to client.
2.Components number depends on chunks' length.Low-level is a list,since we do not need to remove or update component in the list, i can not think of any downside in our usage.If we do not consolidate we can avoid copy,which will improve the performance.

@kiszk
Copy link
Member

kiszk commented Jul 25, 2017

Thank you. 1. makes sense since memory is allocated by 2x at that point.
While I looked at a PR for netty, I cannot understand why 16 was used.

I have another question to understand why it happens. Does this OOM occurs when any OpenBlocks message are received. Or, any specific scenario (e.g. receive a large message, a lot of multiple messages, or so on.)

@caneGuy
Copy link
Contributor Author

caneGuy commented Jul 25, 2017

@kiszk Actually, i am confused with default value 16 too.
Yes, it occurs in specific scenario.In our case, it was large size block data which caused this issue.

@kiszk
Copy link
Member

kiszk commented Jul 25, 2017

Sound good to me
cc @JoshRosen

@zsxwing
Copy link
Member

zsxwing commented Jul 25, 2017

ok to test

@zsxwing
Copy link
Member

zsxwing commented Jul 25, 2017

LGTM

@SparkQA
Copy link

SparkQA commented Jul 25, 2017

Test build #79943 has finished for PR 18723 at commit 4c576a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Jul 26, 2017

Thanks! Merging to master.

@asfgit asfgit closed this in 1661263 Jul 26, 2017
@caneGuy
Copy link
Contributor Author

caneGuy commented Jul 26, 2017

Thanks @kiszk @zsxwing

@caneGuy caneGuy deleted the zhoukang/fix-chunkbuffer branch July 26, 2017 02:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants