HDDS-4552. Read data from chunk into ByteBuffer[] instead of single ByteBuffer. #1685

hanishakoneru · 2020-12-10T20:14:09Z

What changes were proposed in this pull request?

When a ReadChunk operation is performed, all the data to be read from one chunk is read into a single ByteBuffer.

#ChunkUtils#readData()
public static void readData(File file, ByteBuffer buf,
    long offset, long len, VolumeIOStats volumeIOStats)
    throws StorageContainerException {
  .....
  try {
    bytesRead = processFileExclusively(path, () -> {
      try (FileChannel channel = open(path, READ_OPTIONS, NO_ATTRIBUTES);
           FileLock ignored = channel.lock(offset, len, true)) {
        return channel.read(buf, offset);
      } catch (IOException e) {
        throw new UncheckedIOException(e);
      }
    });
  } catch (UncheckedIOException e) {
    throw wrapInStorageContainerException(e.getCause());
  }
  .....
  .....

This Jira proposes to read the data from the channel and put it into an array of ByteBuffers for optimizing reads. For example, currently we hold onto the buffer until the chunkInputStream is closed or the last chunk byte is read (which can lead upto 4MB of data being cached in memory per ChunkInputStream). If we have smaller buffers, they can be released sooner, thus helping to optimize memory utilization (HDDS-4553). This Jira is a pre-requisite for optimizing client memory utilization.

We propose to add ReadChunk version to the ReadChunkRequestProto to determine if the response should have all the chunk data as a single ByteString (V0) or as a list of ByteStrings (V1). Default version will be V0. Older clients will get data back as a single ByteString to maintain wire compatibility.

For new clients, data will be returned as a list of ByteStrings with each ByteString having the capacity equal to its number of bytes per checksum. This is done so that checksum verification is uncomplicated and doesn't require extra buffer copying. For chunks with no checksum, the buffer capacity will be set to a configurable default.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4552

How was this patch tested?

Added unit tests. Working on adding more unit tests.

hanishakoneru · 2020-12-12T00:33:36Z

Moved the read error handling by refreshing pipeline from BlockInput to ChunkInputStream.
Let's say ChunkInputStream already has 1MB of data in its buffers when the DNs in the pipeline shutdown. If client tries to read 2MB from this ChunkInputStream, then it can read 1MB from the existing buffers and gets StorageContainerException when trying to read the next 1MB of data from the DN. In this case, the first 1MB which was already read will be discarded by BlockInputStream. But the ChunkInputStream would have advanced its position to 1MB and will start reading from offset 1MB instead of 0 after acquiring new client.

cc. @adoroszlai

hanishakoneru · 2020-12-21T18:35:32Z

@bshashikant, can you please take a look at this PR when you get a chance. Thanks.

bshashikant · 2020-12-21T08:20:29Z

hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/BlockInputStream.java

LOG.info -> LOG.warn?

bshashikant · 2020-12-21T08:43:40Z

hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ChunkInputStream.java

This will do an additional buffer copy here. Let's see if we can explore anything here to avoid buffer copy here:
https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/UnsafeByteOperations

We get ByteString from the response. But the returned ByteString does not have the underlying buffer boundary information. Hence ByteString#asReadOnlyByteBufferList() will return only one ByteBuffer with all the data irrespective of the backing arrays used to construct the ByteString.

What if we read in small buffers on the server side itself and send it across as a list of bytestrings to the client?
Copying a big buffer on the client read path will be slowing down the read. Probably we should do some benchmarking to understand the effects of all these.

In case, this turns out to be unavoidable, we can also think about doing bytebuffer.compact() which also does an intrinsic buffer copy to release the buffers but the logic would be more simpler.

What if we read in small buffers on the server side itself and send it across as a list of bytestrings to the client?

This might work but would require a change in the DN-Client protocol. Would have to analyze the compatibility issues and how to address them.

In case, this turns out to be unavoidable, we can also think about doing bytebuffer.compact() which also does an intrinsic buffer copy to release the buffers but the logic would be more simpler

I am not sure if there is much gain in doing this. The code changes in this PR were required because the logic was inaccurate. It was working because there was always only one ByteBuffer.

The basic problem we are trying to solve here is to minimize the memory overhead in the client. In order to solve this, adding an extra buffer copy overhead(with the patch) does not seem to be a reasonable idea to me. Let's discuss it in some more detail on how to address this.

hanishakoneru · 2021-02-18T21:02:04Z

Changed the design to avoid buffer copying. Instead of copying read chunk data into smaller buffers on client side, readChunk response will return data as a list of smaller ByteStrings. Please refer to the PR description for more details.

hanishakoneru · 2021-02-22T17:40:21Z

@bshashikant can you please take a look at the updated patch.

bshashikant

Thanks @hanishakoneru for the patch. The patch in general looks good. I am still reviewing the patch but have 2 questions:

I think we should use the read default buffer size irrespective of whether checksum is disabled or not. It can be same as the checksum boundary by default.
Can we add few acceptance tests to test the compatibility?

hanishakoneru · 2021-02-23T23:20:42Z

I think we should use the read default buffer size irrespective of whether checksum is disabled or not. It can be same as the checksum boundary by default.

The problem with that would be while verifying the checksums. Let's say a chunk has checksum boundary at every 256KB and we set the default read buffer size to 64KB. To calculate checksum, we would need to combine 4 buffers of 64KB each and create a read only buffer of 256KB which will be passed to Checksum.verifyChecksum. This would result in buffer copy which we were trying to avoid in the previous deisgn.

Can we add few acceptance tests to test the compatibility?

Yes I am working on adding more tests here.

bshashikant · 2021-03-02T12:01:21Z

I think we should use the read default buffer size irrespective of whether checksum is disabled or not. It can be same as the checksum boundary by default.

The problem with that would be while verifying the checksums. Let's say a chunk has checksum boundary at every 256KB and we set the default read buffer size to 64KB. To calculate checksum, we would need to combine 4 buffers of 64KB each and create a read only buffer of 256KB which will be passed to Checksum.verifyChecksum. This would result in buffer copy which we were trying to avoid in the previous design.

@hanishakoneru , can we do byteString.concat in cases where bytes.per. Checksum < read buffer size? It doesn't do buffer copy as per the documentation here: https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/ByteString.
"Concatenation is likewise supported without copying (long strings) by building a tree of pieces in RopeByteString".

hanishakoneru · 2021-03-10T20:33:10Z

can we do byteString.concat in cases where bytes.per. Checksum < read buffer size? It doesn't do buffer copy as per the documentation here: https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/ByteString.
"Concatenation is likewise supported without copying (long strings) by building a tree of pieces in RopeByteString".

ByteString concat could work. But the problem is ChunkBuffer wraps around ByteBuffers. And checksums are calculated on ChunkBuffers. We would have to change the whole ChunkBuffer model (or change Checksum computations to use ByteStrings instead). I think the change would get very complicated. Also, with concating ByteStrings, we would have to keep track of position, limit etc. separately to track checksum boundaries.

hanishakoneru · 2021-03-10T20:44:54Z

The existing xcompat acceptance tests added as part of HDDS-4731 should cover most of the testing required for this change.
Thanks @adoroszlai for adding the compatibility tests. In these tests, there is an old client (v1.0.0) talking to a new server and testing read write compatibility with different combinations. Please correct me if this is incorrect.
Also, are there any acceptance tests for new clients talking to old servers?

adoroszlai · 2021-03-11T06:48:02Z

ByteString concat could work. But the problem is ChunkBuffer wraps around ByteBuffers. And checksums are calculated on ChunkBuffers. We would have to change the whole ChunkBuffer model (or change Checksum computations to use ByteStrings instead).

Just an idea: we have a ChunkBuffer implementation which wraps a list of ByteBuffers, and ByteString#asReadOnlyByteBufferList provides exactly that list. I'm not sure it fully addresses all your needs.

Also, are there any acceptance tests for new clients talking to old servers?

Yes, the same xcompat tests this situation.

hanishakoneru · 2021-03-11T21:39:27Z

Thanks @adoroszlai

Just an idea: we have a ChunkBuffer implementation which wraps a list of ByteBuffers, and ByteString#asReadOnlyByteBufferList provides exactly that list. I'm not sure it fully addresses all your needs.

Yes, using ChunkBuffer implementation of ByteBufferList in this PR to wrap the list of buffers.

bshashikant

Thanks @hanishakoneru for the explanation. The changes looks good with few minor suggestions.

hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ChunkInputStream.java

bshashikant · 2021-03-15T05:50:59Z

hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ChunkInputStream.java

Unintended change?

Yup. Reverted.

bshashikant · 2021-03-15T06:01:15Z

hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/OzoneClientConfig.java

Let's not change the default for now. We can change once we do some tests and analyze performance .

Sure. Reverted back to 1MB.

bshashikant · 2021-03-17T12:45:22Z

Thanks @hanishakoneru. Can you please rebase? Also, do we need to add this to ozone documentation somewhere?

…hecksum to 16KB

V0 for returning data as single ByteString (old format). V1 for returning data as a list of ByteStrings, with each ByteString length = number of bytes per checksum. 2. If chunk does not have checksums, then set buffer capacity to a default (64KB). 3. Return data from chunk as a list of ByteBuffers instead of a single ByteBuffer.

hanishakoneru · 2021-03-17T17:51:55Z

Also, do we need to add this to ozone documentation somewhere?

Sure, we can open a doc Jira to get this documented. Do you know where we can document this?

bshashikant · 2021-03-18T10:10:32Z

Thanks @hanishakoneru . For documentation, you can refer https://issues.apache.org/jira/browse/HDDS-4948 for example.

hanishakoneru · 2021-03-18T21:41:10Z

Thank you @bshashikant. I will merge this shortly.

We currently do not have docs explaining client read/ write path. Do you propose we add docs for that? As this is an internal feature (not configurable), do we want to add it to docs or the javadocs in the code would suffice?

adoroszlai

Thanks @hanishakoneru for working on this, and sorry for the late review. I have a few comments. Would it be possible to address them in a follow-up issue?

adoroszlai · 2021-03-19T11:41:24Z

...vice/src/main/java/org/apache/hadoop/ozone/container/keyvalue/impl/FilePerChunkStrategy.java

          throw ex;
        }
-        data.clear();
+        dataBuffers = null;


If read from first location fails and we have to fall back to the temp chunk file, this would cause exception.

adoroszlai · 2021-03-19T11:43:14Z

...vice/src/main/java/org/apache/hadoop/ozone/container/keyvalue/impl/FilePerChunkStrategy.java

+    long bufferCapacity = 0;
+    if (info.isReadDataIntoSingleBuffer()) {
+      // Older client - read all chunk data into one single buffer.
+      bufferCapacity = len;
+    } else {
+      // Set buffer capacity to checksum boundary size so that each buffer
+      // corresponds to one checksum. If checksum is NONE, then set buffer
+      // capacity to default (OZONE_CHUNK_READ_BUFFER_DEFAULT_SIZE_KEY = 64KB).
+      ChecksumData checksumData = info.getChecksumData();
+
+      if (checksumData != null) {
+        if (checksumData.getChecksumType() ==
+            ContainerProtos.ChecksumType.NONE) {
+          bufferCapacity = defaultReadBufferCapacity;
+        } else {
+          bufferCapacity = checksumData.getBytesPerChecksum();
+        }
+      }
+    }
+    // If the buffer capacity is 0, set all the data into one ByteBuffer
+    if (bufferCapacity == 0) {
+      bufferCapacity = len;
+    }


This block seems to be duplicated from FilePerBlock.... Can it be extracted?

adoroszlai · 2021-03-19T11:49:48Z

hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ChunkInputStream.java

+      return buffersList.stream()
+          .map(ByteString::asReadOnlyByteBuffer)
+          .collect(Collectors.toList());


I think we should avoid streams on read/write path. Earlier these were found to cause CPU usage hotspots. See eg. HDDS-3702.

(Also in few other instances below.)

hanishakoneru · 2021-03-19T22:33:55Z

Thanks for the review @adoroszlai. I will address them in HDDS-4553 (#2062) which is a follow up of this Jira.

hanishakoneru requested review from arp7, bshashikant and jojochuang December 10, 2020 20:14

hanishakoneru force-pushed the HDDS-4552 branch 2 times, most recently from bdfe9cb to 577ddad Compare December 10, 2020 23:56

bshashikant reviewed Dec 22, 2020

View reviewed changes

hanishakoneru force-pushed the HDDS-4552 branch from 7819426 to e3f550c Compare February 18, 2021 20:37

hanishakoneru requested a review from bshashikant February 22, 2021 17:39

bshashikant reviewed Feb 23, 2021

View reviewed changes

hanishakoneru force-pushed the HDDS-4552 branch from 853bfdf to 3b5253e Compare March 10, 2021 20:26

bshashikant reviewed Mar 15, 2021

View reviewed changes

hanishakoneru added 8 commits March 17, 2021 10:49

Change default bytes per checksum to 256KB and reduce min bytes per c…

0493e0e

…hecksum to 16KB

Unit tests for InputStreams

03a3d82

CI fixes

bb672c0

CI fixes 2

f19c58e

CI fix 3

cc460c8

Set ReadChunkVersion explicitly

710e999

CI fixes + check if ReadChunkResponse hasData before calling getData

d4020ef

Review comments and CI fixes

6beb15d

hanishakoneru force-pushed the HDDS-4552 branch from a3d49a1 to 6beb15d Compare March 17, 2021 17:50

hanishakoneru requested a review from bshashikant March 18, 2021 16:45

bshashikant approved these changes Mar 18, 2021

View reviewed changes

hanishakoneru merged commit 4136d47 into apache:master Mar 18, 2021

adoroszlai reviewed Mar 19, 2021

View reviewed changes

HDDS-4552. Read data from chunk into ByteBuffer[] instead of single ByteBuffer. #1685

HDDS-4552. Read data from chunk into ByteBuffer[] instead of single ByteBuffer. #1685

Uh oh!

Conversation

hanishakoneru commented Dec 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

hanishakoneru commented Dec 12, 2020

Uh oh!

hanishakoneru commented Dec 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bshashikant Dec 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanishakoneru Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanishakoneru commented Feb 18, 2021

Uh oh!

hanishakoneru commented Feb 22, 2021

Uh oh!

bshashikant left a comment

Choose a reason for hiding this comment

Uh oh!

hanishakoneru commented Feb 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bshashikant commented Mar 2, 2021

Uh oh!

hanishakoneru commented Mar 10, 2021

Uh oh!

hanishakoneru commented Mar 10, 2021

Uh oh!

adoroszlai commented Mar 11, 2021

Uh oh!

hanishakoneru commented Mar 11, 2021

Uh oh!

bshashikant left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bshashikant commented Mar 17, 2021

Uh oh!

hanishakoneru commented Mar 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bshashikant commented Mar 18, 2021

Uh oh!

hanishakoneru commented Mar 18, 2021

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanishakoneru commented Dec 10, 2020 •

edited

Loading

bshashikant Dec 24, 2020 •

edited

Loading

hanishakoneru Jan 7, 2021 •

edited

Loading

hanishakoneru commented Feb 23, 2021 •

edited

Loading

hanishakoneru commented Mar 17, 2021 •

edited

Loading