Skip to content

Conversation

@hanishakoneru
Copy link
Contributor

@hanishakoneru hanishakoneru commented Dec 10, 2020

What changes were proposed in this pull request?

When a ReadChunk operation is performed, all the data to be read from one chunk is read into a single ByteBuffer.

#ChunkUtils#readData()
public static void readData(File file, ByteBuffer buf,
    long offset, long len, VolumeIOStats volumeIOStats)
    throws StorageContainerException {
  .....
  try {
    bytesRead = processFileExclusively(path, () -> {
      try (FileChannel channel = open(path, READ_OPTIONS, NO_ATTRIBUTES);
           FileLock ignored = channel.lock(offset, len, true)) {
        return channel.read(buf, offset);
      } catch (IOException e) {
        throw new UncheckedIOException(e);
      }
    });
  } catch (UncheckedIOException e) {
    throw wrapInStorageContainerException(e.getCause());
  }
  .....
  .....

This Jira proposes to read the data from the channel and put it into an array of ByteBuffers for optimizing reads. For example, currently we hold onto the buffer until the chunkInputStream is closed or the last chunk byte is read (which can lead upto 4MB of data being cached in memory per ChunkInputStream). If we have smaller buffers, they can be released sooner, thus helping to optimize memory utilization (HDDS-4553). This Jira is a pre-requisite for optimizing client memory utilization.

We propose to add ReadChunk version to the ReadChunkRequestProto to determine if the response should have all the chunk data as a single ByteString (V0) or as a list of ByteStrings (V1). Default version will be V0. Older clients will get data back as a single ByteString to maintain wire compatibility.

For new clients, data will be returned as a list of ByteStrings with each ByteString having the capacity equal to its number of bytes per checksum. This is done so that checksum verification is uncomplicated and doesn't require extra buffer copying. For chunks with no checksum, the buffer capacity will be set to a configurable default.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4552

How was this patch tested?

Added unit tests. Working on adding more unit tests.

@hanishakoneru hanishakoneru force-pushed the HDDS-4552 branch 2 times, most recently from bdfe9cb to 577ddad Compare December 10, 2020 23:56
@hanishakoneru
Copy link
Contributor Author

Moved the read error handling by refreshing pipeline from BlockInput to ChunkInputStream.
Let's say ChunkInputStream already has 1MB of data in its buffers when the DNs in the pipeline shutdown. If client tries to read 2MB from this ChunkInputStream, then it can read 1MB from the existing buffers and gets StorageContainerException when trying to read the next 1MB of data from the DN. In this case, the first 1MB which was already read will be discarded by BlockInputStream. But the ChunkInputStream would have advanced its position to 1MB and will start reading from offset 1MB instead of 0 after acquiring new client.

cc. @adoroszlai

@hanishakoneru
Copy link
Contributor Author

@bshashikant, can you please take a look at this PR when you get a chance. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOG.info -> LOG.warn?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will do an additional buffer copy here. Let's see if we can explore anything here to avoid buffer copy here:
https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/UnsafeByteOperations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get ByteString from the response. But the returned ByteString does not have the underlying buffer boundary information. Hence ByteString#asReadOnlyByteBufferList() will return only one ByteBuffer with all the data irrespective of the backing arrays used to construct the ByteString.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we read in small buffers on the server side itself and send it across as a list of bytestrings to the client?
Copying a big buffer on the client read path will be slowing down the read. Probably we should do some benchmarking to understand the effects of all these.

Copy link
Contributor

@bshashikant bshashikant Dec 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case, this turns out to be unavoidable, we can also think about doing bytebuffer.compact() which also does an intrinsic buffer copy to release the buffers but the logic would be more simpler.

Copy link
Contributor Author

@hanishakoneru hanishakoneru Jan 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we read in small buffers on the server side itself and send it across as a list of bytestrings to the client?

This might work but would require a change in the DN-Client protocol. Would have to analyze the compatibility issues and how to address them.

In case, this turns out to be unavoidable, we can also think about doing bytebuffer.compact() which also does an intrinsic buffer copy to release the buffers but the logic would be more simpler

I am not sure if there is much gain in doing this. The code changes in this PR were required because the logic was inaccurate. It was working because there was always only one ByteBuffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic problem we are trying to solve here is to minimize the memory overhead in the client. In order to solve this, adding an extra buffer copy overhead(with the patch) does not seem to be a reasonable idea to me. Let's discuss it in some more detail on how to address this.

@hanishakoneru
Copy link
Contributor Author

Changed the design to avoid buffer copying. Instead of copying read chunk data into smaller buffers on client side, readChunk response will return data as a list of smaller ByteStrings. Please refer to the PR description for more details.

@hanishakoneru
Copy link
Contributor Author

@bshashikant can you please take a look at the updated patch.

Copy link
Contributor

@bshashikant bshashikant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hanishakoneru for the patch. The patch in general looks good. I am still reviewing the patch but have 2 questions:

  1. I think we should use the read default buffer size irrespective of whether checksum is disabled or not. It can be same as the checksum boundary by default.

  2. Can we add few acceptance tests to test the compatibility?

@hanishakoneru
Copy link
Contributor Author

hanishakoneru commented Feb 23, 2021

I think we should use the read default buffer size irrespective of whether checksum is disabled or not. It can be same as the checksum boundary by default.

The problem with that would be while verifying the checksums. Let's say a chunk has checksum boundary at every 256KB and we set the default read buffer size to 64KB. To calculate checksum, we would need to combine 4 buffers of 64KB each and create a read only buffer of 256KB which will be passed to Checksum.verifyChecksum. This would result in buffer copy which we were trying to avoid in the previous deisgn.

Can we add few acceptance tests to test the compatibility?

Yes I am working on adding more tests here.

@bshashikant
Copy link
Contributor

I think we should use the read default buffer size irrespective of whether checksum is disabled or not. It can be same as the checksum boundary by default.

The problem with that would be while verifying the checksums. Let's say a chunk has checksum boundary at every 256KB and we set the default read buffer size to 64KB. To calculate checksum, we would need to combine 4 buffers of 64KB each and create a read only buffer of 256KB which will be passed to Checksum.verifyChecksum. This would result in buffer copy which we were trying to avoid in the previous design.

@hanishakoneru , can we do byteString.concat in cases where bytes.per. Checksum < read buffer size? It doesn't do buffer copy as per the documentation here: https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/ByteString.
"Concatenation is likewise supported without copying (long strings) by building a tree of pieces in RopeByteString".

@hanishakoneru
Copy link
Contributor Author

can we do byteString.concat in cases where bytes.per. Checksum < read buffer size? It doesn't do buffer copy as per the documentation here: https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/ByteString.
"Concatenation is likewise supported without copying (long strings) by building a tree of pieces in RopeByteString".

ByteString concat could work. But the problem is ChunkBuffer wraps around ByteBuffers. And checksums are calculated on ChunkBuffers. We would have to change the whole ChunkBuffer model (or change Checksum computations to use ByteStrings instead). I think the change would get very complicated. Also, with concating ByteStrings, we would have to keep track of position, limit etc. separately to track checksum boundaries.

@hanishakoneru
Copy link
Contributor Author

The existing xcompat acceptance tests added as part of HDDS-4731 should cover most of the testing required for this change.
Thanks @adoroszlai for adding the compatibility tests. In these tests, there is an old client (v1.0.0) talking to a new server and testing read write compatibility with different combinations. Please correct me if this is incorrect.
Also, are there any acceptance tests for new clients talking to old servers?

@adoroszlai
Copy link
Contributor

ByteString concat could work. But the problem is ChunkBuffer wraps around ByteBuffers. And checksums are calculated on ChunkBuffers. We would have to change the whole ChunkBuffer model (or change Checksum computations to use ByteStrings instead).

Just an idea: we have a ChunkBuffer implementation which wraps a list of ByteBuffers, and ByteString#asReadOnlyByteBufferList provides exactly that list. I'm not sure it fully addresses all your needs.

Also, are there any acceptance tests for new clients talking to old servers?

Yes, the same xcompat tests this situation.

@hanishakoneru
Copy link
Contributor Author

Thanks @adoroszlai

Just an idea: we have a ChunkBuffer implementation which wraps a list of ByteBuffers, and ByteString#asReadOnlyByteBufferList provides exactly that list. I'm not sure it fully addresses all your needs.

Yes, using ChunkBuffer implementation of ByteBufferList in this PR to wrap the list of buffers.

Copy link
Contributor

@bshashikant bshashikant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hanishakoneru for the explanation. The changes looks good with few minor suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unintended change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Reverted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not change the default for now. We can change once we do some tests and analyze performance .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Reverted back to 1MB.

@bshashikant
Copy link
Contributor

Thanks @hanishakoneru. Can you please rebase? Also, do we need to add this to ozone documentation somewhere?

   V0 for returning data as single ByteString (old format).
   V1 for returning data as a list of ByteStrings, with each ByteString length = number of bytes per checksum.
2. If chunk does not have checksums, then set buffer capacity to a default (64KB).
3. Return data from chunk as a list of ByteBuffers instead of a single ByteBuffer.
@hanishakoneru
Copy link
Contributor Author

hanishakoneru commented Mar 17, 2021

Also, do we need to add this to ozone documentation somewhere?

Sure, we can open a doc Jira to get this documented. Do you know where we can document this?

@bshashikant
Copy link
Contributor

Thanks @hanishakoneru . For documentation, you can refer https://issues.apache.org/jira/browse/HDDS-4948 for example.

@hanishakoneru
Copy link
Contributor Author

Thank you @bshashikant. I will merge this shortly.

We currently do not have docs explaining client read/ write path. Do you propose we add docs for that? As this is an internal feature (not configurable), do we want to add it to docs or the javadocs in the code would suffice?

@hanishakoneru hanishakoneru merged commit 4136d47 into apache:master Mar 18, 2021
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hanishakoneru for working on this, and sorry for the late review. I have a few comments. Would it be possible to address them in a follow-up issue?

throw ex;
}
data.clear();
dataBuffers = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If read from first location fails and we have to fall back to the temp chunk file, this would cause exception.

Comment on lines +232 to +254
long bufferCapacity = 0;
if (info.isReadDataIntoSingleBuffer()) {
// Older client - read all chunk data into one single buffer.
bufferCapacity = len;
} else {
// Set buffer capacity to checksum boundary size so that each buffer
// corresponds to one checksum. If checksum is NONE, then set buffer
// capacity to default (OZONE_CHUNK_READ_BUFFER_DEFAULT_SIZE_KEY = 64KB).
ChecksumData checksumData = info.getChecksumData();

if (checksumData != null) {
if (checksumData.getChecksumType() ==
ContainerProtos.ChecksumType.NONE) {
bufferCapacity = defaultReadBufferCapacity;
} else {
bufferCapacity = checksumData.getBytesPerChecksum();
}
}
}
// If the buffer capacity is 0, set all the data into one ByteBuffer
if (bufferCapacity == 0) {
bufferCapacity = len;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block seems to be duplicated from FilePerBlock.... Can it be extracted?

Comment on lines +412 to +414
return buffersList.stream()
.map(ByteString::asReadOnlyByteBuffer)
.collect(Collectors.toList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid streams on read/write path. Earlier these were found to cause CPU usage hotspots. See eg. HDDS-3702.

(Also in few other instances below.)

@hanishakoneru
Copy link
Contributor Author

Thanks for the review @adoroszlai. I will address them in HDDS-4553 (#2062) which is a follow up of this Jira.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants