-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-10465. Change ozone.client.bytes.per.checksum default to 16KB #6331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@duongkame @kerneltime thoughts? Making the smaller size the default across the board seems a little aggresive. |
|
This can cause the payload in the RocksDB to increase by 64x. I understand the need to avoid client to not do wasteful work. |
On a side note, we should remove ChecksumType and bytesPerChecksum from ChunkInfo. There's absolutely no reason why you have different kind of checksums within a block. Chunksum type and bytes should be a block level property. |
Currently there is large gap between Ozone's sequential read performance and random read performance with 1MB byte.per.checksum value, which only favor applications who heavily rely on sequential read. But for other applications, such as Hbase, impala, spark with Parquet files, they will suffer the bad random read performance. I think we need a default value of this property to get a balanced sequential read and random read performance. With 16KB as default byte.per.checksum, the execution time for sequential read dropped from ~50s to ~60s, while random read execution time improved from ~100s to ~60s(See the tables in MR description). If the applications has combined sequential read and random read, the overall performance will be get improved. And it not only benefit HBASE, but will also benefit most of other applications too. |
@kerneltime , thanks for the review. The RocksDB data size change is a good point. After a further investigation, here is the data. Currently following data structures are used to save block data in RocksDB Without ChunkInfo, the serialized BlockData size is 30 bytes. ChunkInfo with one checksum like following is 53 bytes. ChunkInfo with two checksums is 59 bytes, So 1 checksums is 6 bytes. With Container Schema V3, a single RocksDB will hold all BlockData on one disk. Let's assume high density storage case, say a disk is 1TB(1024GB). Let's consider three extreme cases.
When changing ozone.client.bytes.per.checksum from 1MB to 16KB,
If a disk is full of big files and medium file, BlockData size increase by 22 times and 5.6 times. But given the total size is less than 400MB/500MB, it's not a big problem for RocksDB to efficiently handle them. A real disk always will have both big file, medium file and small files. Let's assume their occupied disk space rate is 6:3:1(no data back this, just assume), then the overall BlockData size will be, and times are 1.6. I believe 1.6 times rocksdb space increase is acceptable.
|
|
Also regarding the BlockData transferred on wire, @jojochuang has implemented the increment chunkInfo feature. With this feature, the on wire BlockData transferred size will be significantly reduced. |
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 the change looks good. A smaller byte-per-checksum make sense to me.
@jojochuang , @kerneltime, do you have further comments?
|
Thanks @ChenSammi for the patch, @szetszwo for the review. |
…pache#6331) (cherry picked from commit d49a2b6)
…pache#6331) (cherry picked from commit d49a2b6)

What changes were proposed in this pull request?
When using TestDFSIO to compare the random read performance of HDFS and Ozone, Ozone is way more slow than HDFS. Here are the data tested in YCloud cluster.
Test Suit: TestDFSIO
Number of files: 64
File Size: 1024MB

And for Ozone itself, sequence read is must faster than random read:

While for HDFS, there is no much gap between its sequence read and random read execution time:

After some investigation, it's found that the total bytes read from DN in TestDFSIO random read test is almost double the data size. Here the total data to read is 64 * 1024MB = 64GB, while the aggregated DN bytesReadChunk metric value is increased by 128GB after one test run. The root cause is when client reads data, it will align the requested data size with "ozone.client.bytes.per.checksum" which is 1MB currently. For example, if reading 1 byte, client will send request to DN to fetch 1MB data. If reading 2 bytes, but these 2 byte's offsets are cross the 1MB boundary, then client will send request to DN to fetch the first 1MB for first byte data, and the second 1MB for second byte data. In the random read mode, TestDFSIO use a read buffer with size 1000000 = 976.5KB, that's why the total bytes read from DN is double the size.
According, HDFS uses property "file.bytes-per-checksum", which is 512 bytes by default.
To improve the Ozone random read performance, a straightforward idea is to use a smaller "ozone.client.bytes.per.checksum" default value. Here we tested 1MB, 16KB and 8KB, get the data using TestDFSIO(64 files, each 512MB)
From the above data, we can see that for same amount of data
Change bytes.per.checksum from 1MB to 16KB, although the sequential read performance will drop a bit, but the performance gain in random read is much higher than that. Applications which leverage random read a lot, such as HBASE, Impala, Iceberg(Parquet) will all benefit from this.
So this task propose to change the ozone.client.file.bytes-per-checksum default value from current 1MB to 16KB, and lower the current min limit of the property from 16KB to 8KB, to improve the overall read performance.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10465
How was this patch tested?