Skip to content

Conversation

@ChenSammi
Copy link
Contributor

@ChenSammi ChenSammi commented Mar 5, 2024

What changes were proposed in this pull request?

When using TestDFSIO to compare the random read performance of HDFS and Ozone, Ozone is way more slow than HDFS. Here are the data tested in YCloud cluster.

Test Suit: TestDFSIO

Number of files: 64

File Size: 1024MB
image

And for Ozone itself, sequence read is must faster than random read:
image

While for HDFS, there is no much gap between its sequence read and random read execution time:
image

After some investigation, it's found that the total bytes read from DN in TestDFSIO random read test is almost double the data size. Here the total data to read is 64 * 1024MB = 64GB, while the aggregated DN bytesReadChunk metric value is increased by 128GB after one test run. The root cause is when client reads data, it will align the requested data size with "ozone.client.bytes.per.checksum" which is 1MB currently.  For example, if reading 1 byte, client will send request to DN to fetch 1MB data. If reading 2 bytes, but these 2 byte's offsets are cross the 1MB boundary, then client will send request to DN to fetch the first 1MB for first byte data, and the second 1MB for second byte data. In the random read mode, TestDFSIO use a read buffer with size 1000000 = 976.5KB, that's why the total bytes read from DN is double the size.

According, HDFS uses property "file.bytes-per-checksum", which is 512 bytes by default.

To improve the Ozone random read performance, a straightforward idea is to use a smaller "ozone.client.bytes.per.checksum" default value. Here we tested 1MB, 16KB and 8KB, get the data using TestDFSIO(64 files, each 512MB)

 
image

From the above data, we can see that for same amount of data

  • write, the execution time have no obvious differences in all there cases
  • sequential read, 1MB bytes.per.checksum has best execution time.  16KB and 8KB has the close execution time.
  • random read, 1MB has the worst execution time. 16KB and 8KB has the close execution time.
  • For either 16KB or 8KB bytes.per.checksum, their sequential read and random read has close execution time, similar to HDFS behavior.

Change bytes.per.checksum from 1MB to 16KB, although the sequential read performance will drop a bit, but the performance gain in random read is much higher than that. Applications which leverage random read a lot, such as HBASE, Impala, Iceberg(Parquet) will all benefit from this. 

So this task propose to change the ozone.client.file.bytes-per-checksum default value from current 1MB to 16KB, and lower the current min limit of the property from 16KB to 8KB, to improve the overall read performance.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10465

How was this patch tested?

  • existing unit test
  • performance test with TestDFSIO(find the details in JIRA)

@jojochuang
Copy link
Contributor

@duongkame @kerneltime thoughts? Making the smaller size the default across the board seems a little aggresive.

@kerneltime
Copy link
Contributor

This can cause the payload in the RocksDB to increase by 64x. I understand the need to avoid client to not do wasteful work.
Should we benchmark a single stream performance for 16, 32, 64, 128 kb?

@jojochuang
Copy link
Contributor

jojochuang commented Mar 5, 2024

This can cause the payload in the RocksDB to increase by 64x.

On a side note, we should remove ChecksumType and bytesPerChecksum from ChunkInfo. There's absolutely no reason why you have different kind of checksums within a block. Chunksum type and bytes should be a block level property.

@ChenSammi
Copy link
Contributor Author

@duongkame @kerneltime thoughts? Making the smaller size the default across the board seems a little aggresive.

Currently there is large gap between Ozone's sequential read performance and random read performance with 1MB byte.per.checksum value, which only favor applications who heavily rely on sequential read. But for other applications, such as Hbase, impala, spark with Parquet files, they will suffer the bad random read performance.

I think we need a default value of this property to get a balanced sequential read and random read performance. With 16KB as default byte.per.checksum, the execution time for sequential read dropped from ~50s to ~60s, while random read execution time improved from ~100s to ~60s(See the tables in MR description). If the applications has combined sequential read and random read, the overall performance will be get improved.

And it not only benefit HBASE, but will also benefit most of other applications too.

@ChenSammi
Copy link
Contributor Author

ChenSammi commented Mar 7, 2024

This can cause the payload in the RocksDB to increase by 64x. I understand the need to avoid client to not do wasteful work. Should we benchmark a single stream performance for 16, 32, 64, 128 kb?

@kerneltime , thanks for the review. The RocksDB data size change is a good point. After a further investigation, here is the data. Currently following data structures are used to save block data in RocksDB

message BlockData {
  required DatanodeBlockID blockID = 1;
  optional int64 flags = 2; // for future use.
  repeated KeyValue metadata = 3;
  repeated ChunkInfo chunks = 4;
  optional int64 size = 5;
}

message ChunkInfo {
  required string chunkName = 1;
  required uint64 offset = 2;
  required uint64 len = 3;
  repeated KeyValue metadata = 4;
  required ChecksumData checksumData =5;
  optional bytes stripeChecksum = 6;
}

message ChecksumData {
  required ChecksumType type = 1;
  required uint32 bytesPerChecksum = 2;
  repeated bytes checksums = 3;
}

Without ChunkInfo, the serialized BlockData size is 30 bytes. ChunkInfo with one checksum like following is 53 bytes.

chunks {
  chunkName: "113750153625600001_chunk_1"
  offset: 0
  len: 1048576
  checksumData {
    type: CRC32
    bytesPerChecksum: 1048576
    checksums: "H\271n\332"
  }
} 

ChunkInfo with two checksums is 59 bytes,

chunks {
  chunkName: "113750153625600001_chunk_1"
  offset: 0
  len: 1048576
  checksumData {
    type: CRC32
    bytesPerChecksum: 524288
    checksums: "\243\245\351\333"
    checksums: "\243\245\351\333"
  }
}

So 1 checksums is 6 bytes.

With Container Schema V3, a single RocksDB will hold all BlockData on one disk. Let's assume high density storage case, say a disk is 1TB(1024GB). Let's consider three extreme cases.

  1. Big file case, every block is 256MB and is full.
  2. Medium file case, every block is 1MB.
  3. Very small file case, every block is 16KB.

image

When changing ozone.client.bytes.per.checksum from 1MB to 16KB,

  1. Big file case, total blockData size is changed from 17.9MB to 395.9MB, 22 times.
  2. Medium file case, blockSize from 83MB to 461MB, 5.6 times.
  3. Very small file case, blockSize is 5312MB, no change.

If a disk is full of big files and medium file, BlockData size increase by 22 times and 5.6 times. But given the total size is less than 400MB/500MB, it's not a big problem for RocksDB to efficiently handle them.
If a disk is full of small files, then the bytes.per.checksum change doesn't have impact.

A real disk always will have both big file, medium file and small files. Let's assume their occupied disk space rate is 6:3:1(no data back this, just assume), then the overall BlockData size will be, and times are 1.6. I believe 1.6 times rocksdb space increase is acceptable.

  1. 1mb bytes.per.checksum, 17.87 * 0.6 + 83 * 0.3 + 5312 * 0.1 = 566.8MB
  2. 16kb bytes.per.checksum, 395.86 * 0.6 + 461 * 0.3 + 5312 * 0.1 = 907.0MB

@ChenSammi
Copy link
Contributor Author

Also regarding the BlockData transferred on wire, @jojochuang has implemented the increment chunkInfo feature. With this feature, the on wire BlockData transferred size will be significantly reduced.

@ChenSammi ChenSammi requested a review from szetszwo March 20, 2024 04:16
Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good. A smaller byte-per-checksum make sense to me.

@jojochuang , @kerneltime, do you have further comments?

@ChenSammi ChenSammi requested a review from arp7 March 21, 2024 04:02
@jojochuang jojochuang added the hbase HBase on Ozone support label Mar 21, 2024
@adoroszlai adoroszlai merged commit d49a2b6 into apache:master May 6, 2024
@adoroszlai
Copy link
Contributor

Thanks @ChenSammi for the patch, @szetszwo for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hbase HBase on Ozone support

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants