Skip to content

Conversation

@szetszwo
Copy link
Contributor

What changes were proposed in this pull request?

Change ChunkBuffer to allocate direct buffers and in order to avoid buffer copying.

What is the link to the Apache JIRA

HDDS-9536

How was this patch tested?

Modified existing tests.

@duongkame
Copy link
Contributor

duongkame commented Oct 27, 2023

Thanks for the patch @szetszwo. Is this optimization for the WriteChunk path, or only for the read path? Looks like it's only for reads.

I believe the problem with WriteChunk starts from the moment the ContainerCommandRequestProto is parsed from the ratis log entry. The resulted StateMachineLogEntry is not a NioByteString backed by direct buffer. And this could be a result of LogEntryProto (StateMachineLogEntryProto#data) is not backed by a direct buffer.
I think the notion of "zero copies" has to start with ratis log reader, aka, the start point of the data flow. And I guess that is a part of ratis-streaming and cannot be just Ozone code change.

@umamaheswararao
Copy link
Contributor

Yes, looking at the changes in KeyValueHandler.java, changes seems to be for readChunk. Are you planning separate JIRA for writeChunk?

@szetszwo
Copy link
Contributor Author

@duongkame , @umamaheswararao , this PR is mainly to change ChunkBuffer not to allocate non-direct buffers. You are right that it is mostly for Read but not Write.

For Write, the buffers are allocated by the gRPC server when it receives the requests. Let me see if the buffers are direct or not. If the buffers from gRPC are direct, we might have copied them to non-direct buffers somewhere in our code.

... WriteChunk starts from the moment the ContainerCommandRequestProto is parsed from the ratis log entry. The resulted StateMachineLogEntry is not a NioByteString backed by direct buffer. ...

Thanks for the hint. Let me check.

@szetszwo
Copy link
Contributor Author

2023-10-27 09:52:40,873 INFO server.GrpcClientProtocolService (GrpcClientProtocolService.java:onNext(244)) - XXX gRPC class org.apache.ratis.thirdparty.com.google.protobuf.ByteString$LiteralByteString

In a ratis test, it shows that the ByteString received in gRPC unfortunately is LiteralByteString but not NioByteString.

@szetszwo
Copy link
Contributor Author

... I think the notion of "zero copies" has to start with ratis log reader, aka, the start point of the data flow. ...

The client requests are received from the network (for leader, it is directly from client; for followers, it is the log entries from the leader). It should not be related to the log reader unless the cache is full and the entry is invalidated. This remind me that Ozone does have a ContainerStateMachine cache bug.

Anyway, the ByteString received in gRPC unfortunately seems not NioByteString. Let me verify it in Ozone.

... I guess that is a part of ratis-streaming and cannot be just Ozone code change.

Yes, Ratis Streaming uses Netty directly. It neither use gRPC nor Protobuf for data. (It does use Protobuf for headers.)

@szetszwo
Copy link
Contributor Author

Tried testing Ratis with a larger message size (32MB) to see if gRPC will change to use NioByteString. Unfortunately, no!

2023-10-27 10:49:51,701 INFO server.GrpcClientProtocolService (GrpcClientProtocolService.java:onNext(246)) - XXX message size 33554454, class LiteralByteString

@szetszwo
Copy link
Contributor Author

Found that gRPC uses CodedInputStream to decode the incoming messages and there is an UnsafeDirectNioDecoder implementation. Not sure if we can config gRPC to use it.

@duongkame
Copy link
Contributor

duongkame commented Oct 29, 2023

Found that gRPC uses CodedInputStream to decode the incoming messages and there is an UnsafeDirectNioDecoder implementation. Not sure if we can config gRPC to use it.

Thanks for the diggings, @szetszwo.

Seems to me that gRPC recently finalized the APIs to support zero-copy, details in grpc/grpc-java#7387. This implies some effort to configure the right marshaller for the CodedInputStream to leverage NIO, for example GoogleCloudPlatform/grpc-gcp-java#77.

Today datannodes clone and copy a WriteChunk data buffer 3+ times, and this is due to the LogEntryProto is not marshaled using NioByteBuffer (so any subsequence operation on the log proto will just copy buffers instead of deriving).
I'm not sure if we should follow the route to try making ratis with gRPC zero-copy, or directly switch to ratis-streaming and deprecate the current approach. @kerneltime @szetszwo

@umamaheswararao
Copy link
Contributor

supportsUnsafeByteBufferOperations seems to be depending on MEMORY_ACCESSOR. I am not sure we have a way to control it. Do we ?
private static boolean supportsUnsafeByteBufferOperations() {
if (MEMORY_ACCESSOR == null) {
return false;
}
return MEMORY_ACCESSOR.supportsUnsafeByteBufferOperations();
}

@umamaheswararao
Copy link
Contributor

I was just doing some experimenting here:
I was trying to pass direct buffer as input to codec buffer and seems like we get CodedInputStream$UnsafeDirectNioDecoder in that case.

Here is my sample:

public static void main(String[] args) throws IOException {
   ByteBuffer buf = ByteBuffer.allocateDirect(1024);
   for(int i =0;i< 1024; i++) {
     buf.put((byte)i);
   }
   CodedInputStream input = CodedInputStream.newInstance(buf);
   input.enableAliasing(true);
   System.out.println(input.readBytes());
 }

However I get some expection as my input is just some random bytes and not a proper proto format. But trace is telling that bytes are getting extracted from directBuffer.

Exception in thread "main" org.apache.ratis.thirdparty.com.google.protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field.  This could mean either that the input has been truncated or that an embedded message misreported its own length.
  at org.apache.ratis.thirdparty.com.google.protobuf.InvalidProtocolBufferException.truncatedMessage(InvalidProtocolBufferException.java:107)
  at org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$UnsafeDirectNioDecoder.readRawByte(CodedInputStream.java:1954)
  at org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$UnsafeDirectNioDecoder.readRawVarint64SlowPath(CodedInputStream.java:1856)
  at org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$UnsafeDirectNioDecoder.readRawVarint32(CodedInputStream.java:1751)
  at org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$UnsafeDirectNioDecoder.readBytes(CodedInputStream.java:1621)

What I am thinking is, what if we load the log into directBuffer and make CodedInputStream backed by that codecBuffer. I am not sure if that is possible option in Ratis code, but just throwing some thoughts if that make sense in case.

@szetszwo
Copy link
Contributor Author

What I am thinking is, what if we load the log into directBuffer ...

@umamaheswararao , thanks for testing it! The log is from the network but not loaded from the log.

@umamaheswararao
Copy link
Contributor

@szetszwo yeah, I had offline chat with @duongkame. He is actually trying to get directBuffers from stream directly. I would let him comment once he has reasonable results!

@duongkame
Copy link
Contributor

What I am thinking is, what if we load the log into directBuffer ...

@umamaheswararao , thanks for testing it! The log is from the network but not loaded from the log.

I filed 2 JIRA to make zero-copy work for ratis GRPC.
https://issues.apache.org/jira/browse/RATIS-1925
https://issues.apache.org/jira/browse/RATIS-1926

(Did a quick try for zero-copy in GrpcService and I'm positive it's feasible).

Let's keep this PR for readChunk only.

@szetszwo
Copy link
Contributor Author

@duongkame , sure, let's do only readChunk here. Could you review this?

@szetszwo szetszwo requested a review from duongkame October 31, 2023 17:50
@szetszwo
Copy link
Contributor Author

@duongkame , found some bugs in this PR. Let me fix them first.

@szetszwo
Copy link
Contributor Author

szetszwo commented Nov 1, 2023

... found some bugs in this PR. ...

The bug is that a buffer can be released only after the proto is sent out to network. We use gRPC onNext(..) which is asynchronous. However, onNext(..) does not return a future. Not sure how to wait for the asynchronous task to complete.

@duongkame , do you have any idea?

@szetszwo szetszwo marked this pull request as draft November 1, 2023 18:06
Copy link
Contributor

@duongkame duongkame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @szetszwo . It will not only solve the memory/gc inefficiency in datanode but also on the client-side (BlockOutputStream).
I put a few inline comments below.

static ChunkBuffer preallocate(long capacity, int increment) {
Preconditions.assertTrue(increment > 0);
if (capacity <= increment) {
final CodecBuffer c = CodecBuffer.allocateDirect(Math.toIntExact(capacity));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be cleaner if we directly deal with ByteBufAllocator and ByteBuf in ChunkBuffer. CodecBuffer logic doesn't provide much, but adds an unnecessary dependency.


@Override
public void onNext(ContainerCommandRequestProto request) {
final DispatcherContext context;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we got to do the same in ContainerStateMachine.readStateMachineData. Otherwise there'll be memory leak. Not sure if I should put this comment in #5805

@kerneltime
Copy link
Contributor

@szetszwo @duongkame should we continue working on this PR post #6153

@adoroszlai
Copy link
Contributor

/pending conflicts; Q: should we continue working on this PR after 6153?

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking this issue as un-mergeable as requested.

Please use /ready comment when it's resolved.

Please note that the PR will be closed after 21 days of inactivity from now. (But can be re-opened anytime later...)

conflicts; Q: should we continue working on this PR after 6153?

@adoroszlai
Copy link
Contributor

@szetszwo The issue is marked as duplicate, but this PR is still open. Should we close this or reopen HDDS-9536?

@szetszwo
Copy link
Contributor Author

szetszwo commented Apr 3, 2025

Sure, let's close this.

@szetszwo szetszwo closed this Apr 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants