HDDS-10465. Change ozone.client.bytes.per.checksum default to 16KB #6331

ChenSammi · 2024-03-05T14:51:39Z

What changes were proposed in this pull request?

When using TestDFSIO to compare the random read performance of HDFS and Ozone, Ozone is way more slow than HDFS. Here are the data tested in YCloud cluster.

Test Suit: TestDFSIO

Number of files: 64

File Size: 1024MB

And for Ozone itself, sequence read is must faster than random read:

While for HDFS, there is no much gap between its sequence read and random read execution time:

After some investigation, it's found that the total bytes read from DN in TestDFSIO random read test is almost double the data size. Here the total data to read is 64 * 1024MB = 64GB, while the aggregated DN bytesReadChunk metric value is increased by 128GB after one test run. The root cause is when client reads data, it will align the requested data size with "ozone.client.bytes.per.checksum" which is 1MB currently. For example, if reading 1 byte, client will send request to DN to fetch 1MB data. If reading 2 bytes, but these 2 byte's offsets are cross the 1MB boundary, then client will send request to DN to fetch the first 1MB for first byte data, and the second 1MB for second byte data. In the random read mode, TestDFSIO use a read buffer with size 1000000 = 976.5KB, that's why the total bytes read from DN is double the size.

According, HDFS uses property "file.bytes-per-checksum", which is 512 bytes by default.

To improve the Ozone random read performance, a straightforward idea is to use a smaller "ozone.client.bytes.per.checksum" default value. Here we tested 1MB, 16KB and 8KB, get the data using TestDFSIO(64 files, each 512MB)

From the above data, we can see that for same amount of data

write, the execution time have no obvious differences in all there cases
sequential read, 1MB bytes.per.checksum has best execution time. 16KB and 8KB has the close execution time.
random read, 1MB has the worst execution time. 16KB and 8KB has the close execution time.
For either 16KB or 8KB bytes.per.checksum, their sequential read and random read has close execution time, similar to HDFS behavior.

Change bytes.per.checksum from 1MB to 16KB, although the sequential read performance will drop a bit, but the performance gain in random read is much higher than that. Applications which leverage random read a lot, such as HBASE, Impala, Iceberg(Parquet) will all benefit from this.

So this task propose to change the ozone.client.file.bytes-per-checksum default value from current 1MB to 16KB, and lower the current min limit of the property from 16KB to 8KB, to improve the overall read performance.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10465

How was this patch tested?

existing unit test
performance test with TestDFSIO(find the details in JIRA)

jojochuang · 2024-03-05T16:46:47Z

@duongkame @kerneltime thoughts? Making the smaller size the default across the board seems a little aggresive.

kerneltime · 2024-03-05T17:24:12Z

This can cause the payload in the RocksDB to increase by 64x. I understand the need to avoid client to not do wasteful work.
Should we benchmark a single stream performance for 16, 32, 64, 128 kb?

jojochuang · 2024-03-05T18:13:23Z

This can cause the payload in the RocksDB to increase by 64x.

On a side note, we should remove ChecksumType and bytesPerChecksum from ChunkInfo. There's absolutely no reason why you have different kind of checksums within a block. Chunksum type and bytes should be a block level property.

ChenSammi · 2024-03-07T08:11:34Z

@duongkame @kerneltime thoughts? Making the smaller size the default across the board seems a little aggresive.

Currently there is large gap between Ozone's sequential read performance and random read performance with 1MB byte.per.checksum value, which only favor applications who heavily rely on sequential read. But for other applications, such as Hbase, impala, spark with Parquet files, they will suffer the bad random read performance.

I think we need a default value of this property to get a balanced sequential read and random read performance. With 16KB as default byte.per.checksum, the execution time for sequential read dropped from ~50s to ~60s, while random read execution time improved from ~100s to ~60s(See the tables in MR description). If the applications has combined sequential read and random read, the overall performance will be get improved.

And it not only benefit HBASE, but will also benefit most of other applications too.

ChenSammi · 2024-03-07T08:16:05Z

This can cause the payload in the RocksDB to increase by 64x. I understand the need to avoid client to not do wasteful work. Should we benchmark a single stream performance for 16, 32, 64, 128 kb?

@kerneltime , thanks for the review. The RocksDB data size change is a good point. After a further investigation, here is the data. Currently following data structures are used to save block data in RocksDB

message BlockData {
  required DatanodeBlockID blockID = 1;
  optional int64 flags = 2; // for future use.
  repeated KeyValue metadata = 3;
  repeated ChunkInfo chunks = 4;
  optional int64 size = 5;
}

message ChunkInfo {
  required string chunkName = 1;
  required uint64 offset = 2;
  required uint64 len = 3;
  repeated KeyValue metadata = 4;
  required ChecksumData checksumData =5;
  optional bytes stripeChecksum = 6;
}

message ChecksumData {
  required ChecksumType type = 1;
  required uint32 bytesPerChecksum = 2;
  repeated bytes checksums = 3;
}

Without ChunkInfo, the serialized BlockData size is 30 bytes. ChunkInfo with one checksum like following is 53 bytes.

chunks {
  chunkName: "113750153625600001_chunk_1"
  offset: 0
  len: 1048576
  checksumData {
    type: CRC32
    bytesPerChecksum: 1048576
    checksums: "H\271n\332"
  }
}

ChunkInfo with two checksums is 59 bytes,

chunks {
  chunkName: "113750153625600001_chunk_1"
  offset: 0
  len: 1048576
  checksumData {
    type: CRC32
    bytesPerChecksum: 524288
    checksums: "\243\245\351\333"
    checksums: "\243\245\351\333"
  }
}

So 1 checksums is 6 bytes.

With Container Schema V3, a single RocksDB will hold all BlockData on one disk. Let's assume high density storage case, say a disk is 1TB(1024GB). Let's consider three extreme cases.

Big file case, every block is 256MB and is full.
Medium file case, every block is 1MB.
Very small file case, every block is 16KB.

When changing ozone.client.bytes.per.checksum from 1MB to 16KB,

Big file case, total blockData size is changed from 17.9MB to 395.9MB, 22 times.
Medium file case, blockSize from 83MB to 461MB, 5.6 times.
Very small file case, blockSize is 5312MB, no change.

If a disk is full of big files and medium file, BlockData size increase by 22 times and 5.6 times. But given the total size is less than 400MB/500MB, it's not a big problem for RocksDB to efficiently handle them.
If a disk is full of small files, then the bytes.per.checksum change doesn't have impact.

A real disk always will have both big file, medium file and small files. Let's assume their occupied disk space rate is 6:3:1(no data back this, just assume), then the overall BlockData size will be, and times are 1.6. I believe 1.6 times rocksdb space increase is acceptable.

1mb bytes.per.checksum, 17.87 * 0.6 + 83 * 0.3 + 5312 * 0.1 = 566.8MB
16kb bytes.per.checksum, 395.86 * 0.6 + 461 * 0.3 + 5312 * 0.1 = 907.0MB

ChenSammi · 2024-03-07T10:56:06Z

Also regarding the BlockData transferred on wire, @jojochuang has implemented the increment chunkInfo feature. With this feature, the on wire BlockData transferred size will be significantly reduced.

szetszwo

+1 the change looks good. A smaller byte-per-checksum make sense to me.

@jojochuang , @kerneltime, do you have further comments?

adoroszlai · 2024-05-06T16:20:55Z

Thanks @ChenSammi for the patch, @szetszwo for the review.

…pache#6331) (cherry picked from commit d49a2b6)

ChenSammi added 2 commits March 15, 2024 12:48

HDDS-10465. Change ozone.client.bytes.per.checksum default to 16KB

b9c99f8

fix ECBlockChecksumComputer#computeCompositeCrc

1b7eb92

ChenSammi force-pushed the HDDS-10465 branch from 0186be6 to 1b7eb92 Compare March 15, 2024 08:22

remove unused code

f65aed0

ChenSammi requested a review from szetszwo March 20, 2024 04:16

szetszwo approved these changes Mar 20, 2024

View reviewed changes

ChenSammi requested a review from arp7 March 21, 2024 04:02

jojochuang added the hbase HBase on Ozone support label Mar 21, 2024

ChenSammi and others added 2 commits March 28, 2024 17:54

remove the change to ozone.chunk.read.buffer.default.size

c7ffe5e

Merge remote-tracking branch 'origin/master' into HDDS-10465

eeed72d

adoroszlai merged commit d49a2b6 into apache:master May 6, 2024

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request May 29, 2024

HDDS-10465. Change ozone.client.bytes.per.checksum default to 16KB (a…

0b2f531

…pache#6331) (cherry picked from commit d49a2b6)

ChenSammi mentioned this pull request Sep 24, 2024

HDDS-11483. Make s3g object get and put operation buffer configurable #7233

Merged

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Oct 16, 2024

HDDS-10465. Change ozone.client.bytes.per.checksum default to 16KB (a…

06a3d1f

…pache#6331) (cherry picked from commit d49a2b6)

xichen01 mentioned this pull request Oct 22, 2024

[DO NOT MERGE] Backport some fixes and compatibility commits from master to ozone-1.4 #7341

Merged

aswinshakil mentioned this pull request Dec 2, 2024

HDDS-11475: Verify EC reconstruction correctness #7401

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-10465. Change ozone.client.bytes.per.checksum default to 16KB #6331

HDDS-10465. Change ozone.client.bytes.per.checksum default to 16KB #6331

Uh oh!

ChenSammi commented Mar 5, 2024 •

edited

Loading

Uh oh!

jojochuang commented Mar 5, 2024

Uh oh!

kerneltime commented Mar 5, 2024

Uh oh!

jojochuang commented Mar 5, 2024 •

edited

Loading

Uh oh!

ChenSammi commented Mar 7, 2024

Uh oh!

ChenSammi commented Mar 7, 2024 •

edited

Loading

Uh oh!

ChenSammi commented Mar 7, 2024

Uh oh!

szetszwo left a comment

Uh oh!

adoroszlai commented May 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HDDS-10465. Change ozone.client.bytes.per.checksum default to 16KB #6331

HDDS-10465. Change ozone.client.bytes.per.checksum default to 16KB #6331

Uh oh!

Conversation

ChenSammi commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

jojochuang commented Mar 5, 2024

Uh oh!

kerneltime commented Mar 5, 2024

Uh oh!

jojochuang commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChenSammi commented Mar 7, 2024

Uh oh!

ChenSammi commented Mar 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChenSammi commented Mar 7, 2024

Uh oh!

szetszwo left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented May 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ChenSammi commented Mar 5, 2024 •

edited

Loading

jojochuang commented Mar 5, 2024 •

edited

Loading

ChenSammi commented Mar 7, 2024 •

edited

Loading