HDDS-3053. Decrease the number of the chunk writer threads #578

elek · 2020-02-21T14:20:33Z

What changes were proposed in this pull request?

As of now we create 60 threads (dfs.container.ratis.num.write.chunk.threads) to write chunk data to the disk. As the write is limited by the IO I can't see any benefit to have so many threads. High number of thread means a high context switch overhead, therefore it seems to be more reasonable to use only a limited number of threads.

For example 10 threads should be enough even with 5 external disk.

If you know any reason to keep the number 60, please let me know...

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3053

How was this patch tested?

It tested with a custom Freon test (HDDS-3052). The detailed test results available from here: https://hackmd.io/eYNzRqPSRyiMfWQ6mAYSSg

adoroszlai

Thanks @elek for proposing this change. I think it makes sense to reduce the default, the actual config can be tuned as needed.

elek · 2020-02-28T14:30:44Z

@mukul1987 @arp7 @bshashikant @lokeshj1703
Let me know if you have any opinion or any suggestion for a test before merging this....

bshashikant · 2020-03-03T05:02:35Z

@mukul1987 @arp7 @bshashikant @lokeshj1703
Let me know if you have any opinion or any suggestion for a test before merging this....

@elek , whatever tests we conducted(using teragen for runs > 100GB), we never changed this limit. Also, it was observed that the queue used to hold the pending task in the write chunk executor used to fill up fast thereby indicating all the threads being busy actually doing write chunk. Though 60 may not be the ideal value, but i would still prefer to do some similar tests before changing the default here.

elek · 2020-03-04T16:36:58Z

Also, it was observed that the queue used to hold the pending task in the write chunk executor used to fill up fast

Fix me if I am wrong, but if the disk is slow, the queue will be filled up very fast and it's independent from the number of the threads.

elek · 2020-03-04T16:37:54Z

but i would still prefer to do some similar tests before changing the default here

You mean a 100G teragen with and without this change? I can do it, if this is the suggestion...

elek · 2020-03-06T08:55:56Z

@bshashikant

I executed teragen and for me it was slightly faster:

https://hackmd.io/P3Hkf09fTLS78Ho4W82YbQ

bshashikant · 2020-03-06T12:06:46Z

@bshashikant

I executed teragen and for me it was slightly faster:

https://hackmd.io/P3Hkf09fTLS78Ho4W82YbQ

@marton, did u check the pending request queue in the leader? how many mappers were used for the test?

elek · 2020-03-10T13:01:01Z

did u check the pending request queue in the leader?

No I didn't. Why is it interesting?

how many mappers were used for the test?

92 (see the link for this and all the other parameters.

bshashikant · 2020-03-11T11:58:52Z

did u check the pending request queue in the leader?

No I didn't. Why is it interesting?

how many mappers were used for the test?

92 (see the link for this and all the other parameters.

With lesser no of threads, if the no pending tasks keep on building on the chunk executor queue bcoz of lesser no of threads available for work, once it hits the max limit, the main thread itself starts executing the write call which degrades the performance as per the caller run policy set in the writeChunkExecutor as observed in our tests.

@elek , if you really think this seems to be causing issues in tests , we can change it to a lower value.

elek · 2020-03-18T17:15:21Z

@elek , if you really think this seems to be causing issues in tests , we can change it to a lower value.

I think it cause performance degradation if we use significant higher number of thread than the number of the cores. AFAIK the context switch / thread handling in Java is quite expensive (compared to the Go for example...). And it makes harder to debug the problems.

xiaoyuyao · 2020-03-18T17:55:59Z

Agree with @elek chunkexecutor threads are mostly I/O bounded, adding excessive number of threads won't help much but could introduce expensive context switch overhead. I'm +1 on this change. We may later add some documents to set it properly based on the number of physical disks and number of CPU cores.

arp7 · 2020-03-18T18:20:30Z

Can we make it a data driven decision? I looked at the linked document. I didn't understand the test methodology. E.g. when you say 60 thread / spinning disk, is that 60 chunk executor threads or 60 client threads?

If we don't have full asynchronous IO - and I don't think we do - then we do need multiple threads per volume to keep more IOs in flight.

bharatviswa504 · 2020-03-18T18:27:11Z

I want to add one point here.
The same chunkExecutor threads are used for readStateMachineData. This will be used by the leader when sending append entry logs to followers. So, reducing this will make request also processing slow

        CompletableFuture.supplyAsync(() -> {
          try {
            future.complete(
                getCachedStateMachineData(entry.getIndex(), entry.getTerm(),
                    requestProto));
          } catch (IOException e) {
            metrics.incNumReadStateMachineFails();
            future.completeExceptionally(e);
          }
          return future;
        }, getChunkExecutor(requestProto.getWriteChunk()));

elek · 2020-03-25T12:35:00Z

So, reducing this will make request also processing slow

Good point, and I think they should be separated anyway. But it seems that even just 10 threads is enough to write and read. At least this is what the data shows for us (teragen + load test).

If you have any idea how to reproduce any related problem, just let me know and I will try it out. But I can't see any problem during the tests.

elek · 2020-03-25T12:40:07Z

Can we make it a data driven decision? I looked at the linked document. I didn't understand the test methodology. E.g. when you say 60 thread / spinning disk, is that 60 chunk executor threads or 60 client threads?

There was no client thread in this test. The command was mentioned later in the doc:

60 thread / spinning disk

Command:

ozone freon --verbose cmdw -n1000000 -s 1024 -t60

This is using the ChunkManagerImpl fro 60 threads.

If you are interested about an e2e test, please check the teragen results (added in a comment).

Can we make it a data driven decision?

100% agree. If you need any more date, just let me know. I think we should have 60 threads only if we have data as more threads mean more overhead. But if we have data to keep it, I am fine with it.

Based on the existing test, 10 threads should be enough as a default (IMHO)

elek · 2020-03-30T11:11:05Z

Any more comments / suggestions?

bshashikant · 2020-03-31T05:54:38Z

Any more comments / suggestions?

If i remember correctly, number of threads is currently for all pipelines a datanode has which may span over multiple disks. This is not a per disk configuration.

I feel with more no of disks per datanode for write, we may need more threads depending upon the load.

elek · 2020-04-01T08:52:43Z

I feel with more no of disks per datanode for write, we may need more threads depending upon the load.

I agree, with 1 hard disk I would use 3-4 threads. I proposed to use 10 which seems to be fine with ~5 disks and seems to be a good, balanced default.

Should we use 20 instead of 10? Is there any reason to have exactly 60?

bshashikant · 2020-04-01T10:05:30Z

I feel with more no of disks per datanode for write, we may need more threads depending upon the load.

I agree, with 1 hard disk I would use 3-4 threads. I proposed to use 10 which seems to be fine with ~5 disks and seems to be a good, balanced default.

Should we use 20 instead of 10? Is there any reason to have exactly 60?

There is no logic to use exactly 60. I am ok to make to 20 for now.

We need to test with multiple disks and varying load to see how it really impacts the performance. We can even decide to make it a factor of number of data disks available per datanode but this can be done in follow up jira.

bshashikant

let's change the value to 20.

arp7 · 2020-04-01T16:36:44Z

Let's hold off committing this until we can discuss it further. 20 threads will be insufficient if the number of disks is large.

elek · 2020-04-02T10:52:51Z

Let's hold off committing this until we can discuss it further

Let's discuss it further ;-) (Marked the the PR with red flag to avoid unexpected merge without reading the thread)

I am happy with any numbers (even with 60), if you can explain how the number is chosen.

As you wrote earlier:

Can we make it a data driven decision?

(yes)

My questions:

What is the expected number of the disks (for default configuration, something like an average)?
What is the number of the threads per disks which are required to have the best performance?
What is the performance penalty of using too much thread (optimizing to big pool of disks)

My answers:

5 disk
4 threads per disk (I am convinced to use 20 instead of 10 ;-) )
10% percent slowness (measured data, see my first link: using 10 threads was ~10% faster than using 60 threads)

elek · 2020-04-02T12:48:34Z

Other proposal to have a consensus:

What about checking the number of the disks (number of configured locations) and use the current configuration number as per disk value. (We can use disk * 10 or disk * 8 thread).

Would it be more safe?

elek · 2020-04-02T12:57:45Z

/pending @arp "Let's hold off committing this until we can discuss it further."

github-actions

Marking this issue as un-mergeable as requested.

Please use /ready comment when it's resolved.

@arp "Let's hold off committing this until we can discuss it further."

arp7 · 2020-04-06T19:21:02Z

What about checking the number of the disks (number of configured locations) and use the current configuration number as per disk value. (We can use disk * 10 or disk * 8 thread).

I like this idea, let's do this.

elek · 2020-04-17T08:13:07Z

/ready

Blocking review request is removed.

arp7 · 2020-04-21T17:54:39Z

The change looks good to me. In multi-RAFT setup will we end up creating 10numDisksnumPipelines threads, or will the total number of threads be the same as single RAFT?

elek · 2020-04-24T08:34:20Z

In multi-RAFT setup will we end up creating 10numDisksnumPipelines threads, or will the total number of threads be the same as single RAFT?

I am not sure about the implementation, but IMHO multiraft creates multiple StateMachine. Which means that we will have 10 * disks * number of statemachines all together. (Same as before, but -- long-term -- we can reduce that, too, if we see any performance problems).

elek · 2020-04-28T14:19:09Z

ping @arp7 . Can we merge it?

arp7

+1

elek · 2020-04-30T09:54:11Z

Thanks the review @arp7 @bshashikant @adoroszlai

Happy to achieve consensus with the help of a long conversation.

Will merge it soon.

elek · 2020-05-04T14:42:09Z

Merged with f2b5b10

HDDS-3053. Decrease the number of the chunk writer threads

49bcf16

adoroszlai approved these changes Feb 27, 2020

View reviewed changes

HDDS-3053. Decrease the number of the chunk writer threads

9395f98

bshashikant approved these changes Apr 1, 2020

View reviewed changes

bshashikant self-requested a review April 1, 2020 10:42

bshashikant approved these changes Apr 1, 2020

View reviewed changes

github-actions bot previously requested changes Apr 2, 2020

View reviewed changes

adoroszlai mentioned this pull request Apr 8, 2020

HDDS-3172. Use DBStore instead of MetadataStore in SCM #700

Closed

elek added 4 commits April 14, 2020 14:09

Merge remote-tracking branch 'origin/master' into HDDS-3053

8ff48da

multiple the number of threads with the number of disks

bf7f40a

Merge remote-tracking branch 'elek/HDDS-3053' into HDDS-3053

562ea15

retrigger build with empty commit

19f7fca

elek added 3 commits April 21, 2020 12:34

Merge remote-tracking branch 'origin/master' into HDDS-3053

3dd7f00

Merge remote-tracking branch 'elek/HDDS-3053' into HDDS-3053

3277f46

fixing merge problem

5e14197

trigger new CI check

cc64320

arp7 approved these changes Apr 28, 2020

View reviewed changes

elek closed this May 4, 2020

HDDS-3053. Decrease the number of the chunk writer threads #578

HDDS-3053. Decrease the number of the chunk writer threads #578

Uh oh!

Conversation

elek commented Feb 21, 2020

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

elek commented Feb 28, 2020

Uh oh!

bshashikant commented Mar 3, 2020

Uh oh!

elek commented Mar 4, 2020

Uh oh!

elek commented Mar 4, 2020

Uh oh!

elek commented Mar 6, 2020

Uh oh!

bshashikant commented Mar 6, 2020

Uh oh!

elek commented Mar 10, 2020

Uh oh!

bshashikant commented Mar 11, 2020

Uh oh!

elek commented Mar 18, 2020

Uh oh!

xiaoyuyao commented Mar 18, 2020

Uh oh!

arp7 commented Mar 18, 2020

Uh oh!

bharatviswa504 commented Mar 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elek commented Mar 25, 2020

Uh oh!

elek commented Mar 25, 2020

Uh oh!

elek commented Mar 30, 2020

Uh oh!

bshashikant commented Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elek commented Apr 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bshashikant commented Apr 1, 2020

Uh oh!

bshashikant left a comment

Choose a reason for hiding this comment

Uh oh!

arp7 commented Apr 1, 2020

Uh oh!

elek commented Apr 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elek commented Apr 2, 2020

Uh oh!

elek commented Apr 2, 2020

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

arp7 commented Apr 6, 2020

Uh oh!

elek commented Apr 17, 2020

Uh oh!

arp7 commented Apr 21, 2020

Uh oh!

elek commented Apr 24, 2020

Uh oh!

elek commented Apr 28, 2020

Uh oh!

arp7 left a comment

Choose a reason for hiding this comment

Uh oh!

elek commented Apr 30, 2020

bharatviswa504 commented Mar 18, 2020 •

edited

Loading

bshashikant commented Mar 31, 2020 •

edited

Loading

elek commented Apr 1, 2020 •

edited

Loading

elek commented Apr 2, 2020 •

edited

Loading