-
Notifications
You must be signed in to change notification settings - Fork 587
HDDS-3053. Decrease the number of the chunk writer threads #578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @elek for proposing this change. I think it makes sense to reduce the default, the actual config can be tuned as needed.
|
@mukul1987 @arp7 @bshashikant @lokeshj1703 |
@elek , whatever tests we conducted(using teragen for runs > 100GB), we never changed this limit. Also, it was observed that the queue used to hold the pending task in the write chunk executor used to fill up fast thereby indicating all the threads being busy actually doing write chunk. Though 60 may not be the ideal value, but i would still prefer to do some similar tests before changing the default here. |
Fix me if I am wrong, but if the disk is slow, the queue will be filled up very fast and it's independent from the number of the threads. |
You mean a 100G teragen with and without this change? I can do it, if this is the suggestion... |
|
I executed teragen and for me it was slightly faster: |
@marton, did u check the pending request queue in the leader? how many mappers were used for the test? |
No I didn't. Why is it interesting?
92 (see the link for this and all the other parameters. |
With lesser no of threads, if the no pending tasks keep on building on the chunk executor queue bcoz of lesser no of threads available for work, once it hits the max limit, the main thread itself starts executing the write call which degrades the performance as per the caller run policy set in the writeChunkExecutor as observed in our tests. @elek , if you really think this seems to be causing issues in tests , we can change it to a lower value. |
I think it cause performance degradation if we use significant higher number of thread than the number of the cores. AFAIK the context switch / thread handling in Java is quite expensive (compared to the Go for example...). And it makes harder to debug the problems. |
|
Agree with @elek chunkexecutor threads are mostly I/O bounded, adding excessive number of threads won't help much but could introduce expensive context switch overhead. I'm +1 on this change. We may later add some documents to set it properly based on the number of physical disks and number of CPU cores. |
|
Can we make it a data driven decision? I looked at the linked document. I didn't understand the test methodology. E.g. when you say 60 thread / spinning disk, is that 60 chunk executor threads or 60 client threads? If we don't have full asynchronous IO - and I don't think we do - then we do need multiple threads per volume to keep more IOs in flight. |
|
I want to add one point here. |
Good point, and I think they should be separated anyway. But it seems that even just 10 threads is enough to write and read. At least this is what the data shows for us (teragen + load test). If you have any idea how to reproduce any related problem, just let me know and I will try it out. But I can't see any problem during the tests. |
There was no client thread in this test. The command was mentioned later in the doc: This is using the ChunkManagerImpl fro 60 threads. If you are interested about an e2e test, please check the teragen results (added in a comment).
100% agree. If you need any more date, just let me know. I think we should have 60 threads only if we have data as more threads mean more overhead. But if we have data to keep it, I am fine with it. Based on the existing test, 10 threads should be enough as a default (IMHO) |
|
Any more comments / suggestions? |
If i remember correctly, number of threads is currently for all pipelines a datanode has which may span over multiple disks. This is not a per disk configuration. I feel with more no of disks per datanode for write, we may need more threads depending upon the load. |
I agree, with 1 hard disk I would use 3-4 threads. I proposed to use 10 which seems to be fine with ~5 disks and seems to be a good, balanced default. Should we use 20 instead of 10? Is there any reason to have exactly 60? |
There is no logic to use exactly 60. I am ok to make to 20 for now. We need to test with multiple disks and varying load to see how it really impacts the performance. We can even decide to make it a factor of number of data disks available per datanode but this can be done in follow up jira. |
bshashikant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's change the value to 20.
|
Let's hold off committing this until we can discuss it further. 20 threads will be insufficient if the number of disks is large. |
Let's discuss it further ;-) (Marked the the PR with red flag to avoid unexpected merge without reading the thread) I am happy with any numbers (even with 60), if you can explain how the number is chosen. As you wrote earlier:
(yes) My questions:
My answers:
|
|
Other proposal to have a consensus: What about checking the number of the disks (number of configured locations) and use the current configuration number as per disk value. (We can use Would it be more safe? |
|
/pending @arp "Let's hold off committing this until we can discuss it further." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking this issue as un-mergeable as requested.
Please use /ready comment when it's resolved.
@arp "Let's hold off committing this until we can discuss it further."
I like this idea, let's do this. |
|
/ready |
Blocking review request is removed.
|
The change looks good to me. In multi-RAFT setup will we end up creating 10numDisksnumPipelines threads, or will the total number of threads be the same as single RAFT? |
I am not sure about the implementation, but IMHO multiraft creates multiple StateMachine. Which means that we will have |
|
ping @arp7 . Can we merge it? |
arp7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
Thanks the review @arp7 @bshashikant @adoroszlai Happy to achieve consensus with the help of a long conversation. Will merge it soon. |
|
Merged with f2b5b10 |
What changes were proposed in this pull request?
As of now we create 60 threads (
dfs.container.ratis.num.write.chunk.threads) to write chunk data to the disk. As the write is limited by the IO I can't see any benefit to have so many threads. High number of thread means a high context switch overhead, therefore it seems to be more reasonable to use only a limited number of threads.For example 10 threads should be enough even with 5 external disk.
If you know any reason to keep the number 60, please let me know...
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-3053
How was this patch tested?
It tested with a custom Freon test (HDDS-3052). The detailed test results available from here: https://hackmd.io/eYNzRqPSRyiMfWQ6mAYSSg