HDDS-9536. Datanode perf: Copying (heap) buffers is costly. #5497

szetszwo · 2023-10-26T23:48:17Z

What changes were proposed in this pull request?

Change ChunkBuffer to allocate direct buffers and in order to avoid buffer copying.

What is the link to the Apache JIRA

How was this patch tested?

Modified existing tests.

duongkame · 2023-10-27T05:39:24Z

Thanks for the patch @szetszwo. Is this optimization for the WriteChunk path, or only for the read path? Looks like it's only for reads.

I believe the problem with WriteChunk starts from the moment the ContainerCommandRequestProto is parsed from the ratis log entry. The resulted StateMachineLogEntry is not a NioByteString backed by direct buffer. And this could be a result of LogEntryProto (StateMachineLogEntryProto#data) is not backed by a direct buffer.
I think the notion of "zero copies" has to start with ratis log reader, aka, the start point of the data flow. And I guess that is a part of ratis-streaming and cannot be just Ozone code change.

umamaheswararao · 2023-10-27T05:55:45Z

Yes, looking at the changes in KeyValueHandler.java, changes seems to be for readChunk. Are you planning separate JIRA for writeChunk?

szetszwo · 2023-10-27T16:37:18Z

@duongkame , @umamaheswararao , this PR is mainly to change ChunkBuffer not to allocate non-direct buffers. You are right that it is mostly for Read but not Write.

For Write, the buffers are allocated by the gRPC server when it receives the requests. Let me see if the buffers are direct or not. If the buffers from gRPC are direct, we might have copied them to non-direct buffers somewhere in our code.

... WriteChunk starts from the moment the ContainerCommandRequestProto is parsed from the ratis log entry. The resulted StateMachineLogEntry is not a NioByteString backed by direct buffer. ...

Thanks for the hint. Let me check.

szetszwo · 2023-10-27T16:59:41Z

2023-10-27 09:52:40,873 INFO server.GrpcClientProtocolService (GrpcClientProtocolService.java:onNext(244)) - XXX gRPC class org.apache.ratis.thirdparty.com.google.protobuf.ByteString$LiteralByteString

In a ratis test, it shows that the ByteString received in gRPC unfortunately is LiteralByteString but not NioByteString.

szetszwo · 2023-10-27T17:16:44Z

... I think the notion of "zero copies" has to start with ratis log reader, aka, the start point of the data flow. ...

The client requests are received from the network (for leader, it is directly from client; for followers, it is the log entries from the leader). It should not be related to the log reader unless the cache is full and the entry is invalidated. This remind me that Ozone does have a ContainerStateMachine cache bug.

Anyway, the ByteString received in gRPC unfortunately seems not NioByteString. Let me verify it in Ozone.

... I guess that is a part of ratis-streaming and cannot be just Ozone code change.

Yes, Ratis Streaming uses Netty directly. It neither use gRPC nor Protobuf for data. (It does use Protobuf for headers.)

szetszwo · 2023-10-27T17:53:12Z

Tried testing Ratis with a larger message size (32MB) to see if gRPC will change to use NioByteString. Unfortunately, no!

2023-10-27 10:49:51,701 INFO server.GrpcClientProtocolService (GrpcClientProtocolService.java:onNext(246)) - XXX message size 33554454, class LiteralByteString

szetszwo · 2023-10-27T21:01:10Z

Found that gRPC uses CodedInputStream to decode the incoming messages and there is an UnsafeDirectNioDecoder implementation. Not sure if we can config gRPC to use it.

duongkame · 2023-10-29T17:00:16Z

Found that gRPC uses CodedInputStream to decode the incoming messages and there is an UnsafeDirectNioDecoder implementation. Not sure if we can config gRPC to use it.

Thanks for the diggings, @szetszwo.

Seems to me that gRPC recently finalized the APIs to support zero-copy, details in grpc/grpc-java#7387. This implies some effort to configure the right marshaller for the CodedInputStream to leverage NIO, for example GoogleCloudPlatform/grpc-gcp-java#77.

Today datannodes clone and copy a WriteChunk data buffer 3+ times, and this is due to the LogEntryProto is not marshaled using NioByteBuffer (so any subsequence operation on the log proto will just copy buffers instead of deriving).
I'm not sure if we should follow the route to try making ratis with gRPC zero-copy, or directly switch to ratis-streaming and deprecate the current approach. @kerneltime @szetszwo

umamaheswararao · 2023-10-30T22:55:54Z

supportsUnsafeByteBufferOperations seems to be depending on MEMORY_ACCESSOR. I am not sure we have a way to control it. Do we ?
private static boolean supportsUnsafeByteBufferOperations() {
if (MEMORY_ACCESSOR == null) {
return false;
}
return MEMORY_ACCESSOR.supportsUnsafeByteBufferOperations();
}

umamaheswararao · 2023-10-30T23:37:18Z

I was just doing some experimenting here:
I was trying to pass direct buffer as input to codec buffer and seems like we get CodedInputStream$UnsafeDirectNioDecoder in that case.

Here is my sample:

public static void main(String[] args) throws IOException {
   ByteBuffer buf = ByteBuffer.allocateDirect(1024);
   for(int i =0;i< 1024; i++) {
     buf.put((byte)i);
   }
   CodedInputStream input = CodedInputStream.newInstance(buf);
   input.enableAliasing(true);
   System.out.println(input.readBytes());
 }

However I get some expection as my input is just some random bytes and not a proper proto format. But trace is telling that bytes are getting extracted from directBuffer.

Exception in thread "main" org.apache.ratis.thirdparty.com.google.protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field.  This could mean either that the input has been truncated or that an embedded message misreported its own length.
  at org.apache.ratis.thirdparty.com.google.protobuf.InvalidProtocolBufferException.truncatedMessage(InvalidProtocolBufferException.java:107)
  at org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$UnsafeDirectNioDecoder.readRawByte(CodedInputStream.java:1954)
  at org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$UnsafeDirectNioDecoder.readRawVarint64SlowPath(CodedInputStream.java:1856)
  at org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$UnsafeDirectNioDecoder.readRawVarint32(CodedInputStream.java:1751)
  at org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$UnsafeDirectNioDecoder.readBytes(CodedInputStream.java:1621)

What I am thinking is, what if we load the log into directBuffer and make CodedInputStream backed by that codecBuffer. I am not sure if that is possible option in Ratis code, but just throwing some thoughts if that make sense in case.

szetszwo · 2023-10-31T03:47:56Z

What I am thinking is, what if we load the log into directBuffer ...

@umamaheswararao , thanks for testing it! The log is from the network but not loaded from the log.

umamaheswararao · 2023-10-31T16:42:11Z

@szetszwo yeah, I had offline chat with @duongkame. He is actually trying to get directBuffers from stream directly. I would let him comment once he has reasonable results!

duongkame · 2023-10-31T16:56:07Z

What I am thinking is, what if we load the log into directBuffer ...

@umamaheswararao , thanks for testing it! The log is from the network but not loaded from the log.

I filed 2 JIRA to make zero-copy work for ratis GRPC.
https://issues.apache.org/jira/browse/RATIS-1925
https://issues.apache.org/jira/browse/RATIS-1926

(Did a quick try for zero-copy in GrpcService and I'm positive it's feasible).

Let's keep this PR for readChunk only.

szetszwo · 2023-10-31T17:50:12Z

@duongkame , sure, let's do only readChunk here. Could you review this?

szetszwo · 2023-10-31T18:28:55Z

@duongkame , found some bugs in this PR. Let me fix them first.

szetszwo · 2023-11-01T17:51:07Z

... found some bugs in this PR. ...

The bug is that a buffer can be released only after the proto is sent out to network. We use gRPC onNext(..) which is asynchronous. However, onNext(..) does not return a future. Not sure how to wait for the asynchronous task to complete.

@duongkame , do you have any idea?

duongkame

Thanks for the patch @szetszwo . It will not only solve the memory/gc inefficiency in datanode but also on the client-side (BlockOutputStream).
I put a few inline comments below.

hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/common/IncrementalChunkBuffer.java

duongkame · 2023-12-16T02:24:32Z

hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/common/ChunkBuffer.java

+  static ChunkBuffer preallocate(long capacity, int increment) {
+    Preconditions.assertTrue(increment > 0);
+    if (capacity <= increment) {
+      final CodecBuffer c = CodecBuffer.allocateDirect(Math.toIntExact(capacity));


Would be cleaner if we directly deal with ByteBufAllocator and ByteBuf in ChunkBuffer. CodecBuffer logic doesn't provide much, but adds an unnecessary dependency.

duongkame · 2023-12-16T02:42:24Z

.../main/java/org/apache/hadoop/ozone/container/common/transport/server/GrpcXceiverService.java


      @Override
      public void onNext(ContainerCommandRequestProto request) {
+        final DispatcherContext context;


I think we got to do the same in ContainerStateMachine.readStateMachineData. Otherwise there'll be memory leak. Not sure if I should put this comment in #5805

kerneltime · 2024-06-13T16:31:17Z

@szetszwo @duongkame should we continue working on this PR post #6153

adoroszlai · 2025-04-03T09:15:00Z

/pending conflicts; Q: should we continue working on this PR after 6153?

github-actions

Marking this issue as un-mergeable as requested.

Please use /ready comment when it's resolved.

Please note that the PR will be closed after 21 days of inactivity from now. (But can be re-opened anytime later...)

conflicts; Q: should we continue working on this PR after 6153?

adoroszlai · 2025-04-03T09:42:21Z

@szetszwo The issue is marked as duplicate, but this PR is still open. Should we close this or reopen HDDS-9536?

szetszwo · 2025-04-03T15:40:57Z

Sure, let's close this.

szetszwo requested a review from duongkame October 31, 2023 17:50

szetszwo marked this pull request as draft November 1, 2023 18:06

szetszwo force-pushed the HDDS-9536 branch from 001a9b1 to 6c99873 Compare November 25, 2023 00:36

HDDS-9536. Datanode perf: Copying (heap) buffers is costly.

5d68a57

szetszwo force-pushed the HDDS-9536 branch from 6c99873 to 5d68a57 Compare December 15, 2023 19:21

duongkame reviewed Dec 16, 2023

View reviewed changes

duongkame mentioned this pull request Feb 3, 2024

HDDS-9843. Ozone client high memory (heap) utilization #6153

Merged

github-actions bot requested changes Apr 3, 2025

View reviewed changes

github-actions bot added the pending label Apr 3, 2025

szetszwo closed this Apr 3, 2025

HDDS-9536. Datanode perf: Copying (heap) buffers is costly. #5497

HDDS-9536. Datanode perf: Copying (heap) buffers is costly. #5497

Uh oh!

Conversation

szetszwo commented Oct 26, 2023

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

duongkame commented Oct 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

umamaheswararao commented Oct 27, 2023

Uh oh!

szetszwo commented Oct 27, 2023

Uh oh!

szetszwo commented Oct 27, 2023

Uh oh!

szetszwo commented Oct 27, 2023

Uh oh!

szetszwo commented Oct 27, 2023

Uh oh!

szetszwo commented Oct 27, 2023

Uh oh!

duongkame commented Oct 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

umamaheswararao commented Oct 30, 2023

Uh oh!

umamaheswararao commented Oct 30, 2023

Uh oh!

szetszwo commented Oct 31, 2023

Uh oh!

umamaheswararao commented Oct 31, 2023

Uh oh!

duongkame commented Oct 31, 2023

Uh oh!

szetszwo commented Oct 31, 2023

Uh oh!

szetszwo commented Oct 31, 2023

Uh oh!

szetszwo commented Nov 1, 2023

Uh oh!

duongkame left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

duongkame Dec 16, 2023

Choose a reason for hiding this comment

Uh oh!

duongkame Dec 16, 2023

Choose a reason for hiding this comment

Uh oh!

kerneltime commented Jun 13, 2024

Uh oh!

adoroszlai commented Apr 3, 2025

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Apr 3, 2025

Uh oh!

szetszwo commented Apr 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

duongkame commented Oct 27, 2023 •

edited

Loading

duongkame commented Oct 29, 2023 •

edited

Loading