Skip to content

Conversation

@elek
Copy link
Member

@elek elek commented Feb 21, 2020

What changes were proposed in this pull request?

As of now we create 60 threads (dfs.container.ratis.num.write.chunk.threads) to write chunk data to the disk. As the write is limited by the IO I can't see any benefit to have so many threads. High number of thread means a high context switch overhead, therefore it seems to be more reasonable to use only a limited number of threads.

For example 10 threads should be enough even with 5 external disk.

If you know any reason to keep the number 60, please let me know...

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3053

How was this patch tested?

It tested with a custom Freon test (HDDS-3052). The detailed test results available from here: https://hackmd.io/eYNzRqPSRyiMfWQ6mAYSSg

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @elek for proposing this change. I think it makes sense to reduce the default, the actual config can be tuned as needed.

@elek
Copy link
Member Author

elek commented Feb 28, 2020

@mukul1987 @arp7 @bshashikant @lokeshj1703
Let me know if you have any opinion or any suggestion for a test before merging this....

@bshashikant
Copy link
Contributor

@mukul1987 @arp7 @bshashikant @lokeshj1703
Let me know if you have any opinion or any suggestion for a test before merging this....

@elek , whatever tests we conducted(using teragen for runs > 100GB), we never changed this limit. Also, it was observed that the queue used to hold the pending task in the write chunk executor used to fill up fast thereby indicating all the threads being busy actually doing write chunk. Though 60 may not be the ideal value, but i would still prefer to do some similar tests before changing the default here.

@elek
Copy link
Member Author

elek commented Mar 4, 2020

Also, it was observed that the queue used to hold the pending task in the write chunk executor used to fill up fast

Fix me if I am wrong, but if the disk is slow, the queue will be filled up very fast and it's independent from the number of the threads.

@elek
Copy link
Member Author

elek commented Mar 4, 2020

but i would still prefer to do some similar tests before changing the default here

You mean a 100G teragen with and without this change? I can do it, if this is the suggestion...

@elek
Copy link
Member Author

elek commented Mar 6, 2020

@bshashikant

I executed teragen and for me it was slightly faster:

https://hackmd.io/P3Hkf09fTLS78Ho4W82YbQ

@bshashikant
Copy link
Contributor

@bshashikant

I executed teragen and for me it was slightly faster:

https://hackmd.io/P3Hkf09fTLS78Ho4W82YbQ

@marton, did u check the pending request queue in the leader? how many mappers were used for the test?

@elek
Copy link
Member Author

elek commented Mar 10, 2020

did u check the pending request queue in the leader?

No I didn't. Why is it interesting?

how many mappers were used for the test?

92 (see the link for this and all the other parameters.

@bshashikant
Copy link
Contributor

did u check the pending request queue in the leader?

No I didn't. Why is it interesting?

how many mappers were used for the test?

92 (see the link for this and all the other parameters.

With lesser no of threads, if the no pending tasks keep on building on the chunk executor queue bcoz of lesser no of threads available for work, once it hits the max limit, the main thread itself starts executing the write call which degrades the performance as per the caller run policy set in the writeChunkExecutor as observed in our tests.

@elek , if you really think this seems to be causing issues in tests , we can change it to a lower value.

@elek
Copy link
Member Author

elek commented Mar 18, 2020

@elek , if you really think this seems to be causing issues in tests , we can change it to a lower value.

I think it cause performance degradation if we use significant higher number of thread than the number of the cores. AFAIK the context switch / thread handling in Java is quite expensive (compared to the Go for example...). And it makes harder to debug the problems.

@xiaoyuyao
Copy link
Contributor

Agree with @elek chunkexecutor threads are mostly I/O bounded, adding excessive number of threads won't help much but could introduce expensive context switch overhead. I'm +1 on this change. We may later add some documents to set it properly based on the number of physical disks and number of CPU cores.

@arp7
Copy link
Contributor

arp7 commented Mar 18, 2020

Can we make it a data driven decision? I looked at the linked document. I didn't understand the test methodology. E.g. when you say 60 thread / spinning disk, is that 60 chunk executor threads or 60 client threads?

If we don't have full asynchronous IO - and I don't think we do - then we do need multiple threads per volume to keep more IOs in flight.

@bharatviswa504
Copy link
Contributor

bharatviswa504 commented Mar 18, 2020

I want to add one point here.
The same chunkExecutor threads are used for readStateMachineData. This will be used by the leader when sending append entry logs to followers. So, reducing this will make request also processing slow

        CompletableFuture.supplyAsync(() -> {
          try {
            future.complete(
                getCachedStateMachineData(entry.getIndex(), entry.getTerm(),
                    requestProto));
          } catch (IOException e) {
            metrics.incNumReadStateMachineFails();
            future.completeExceptionally(e);
          }
          return future;
        }, getChunkExecutor(requestProto.getWriteChunk()));

@elek
Copy link
Member Author

elek commented Mar 25, 2020

So, reducing this will make request also processing slow

Good point, and I think they should be separated anyway. But it seems that even just 10 threads is enough to write and read. At least this is what the data shows for us (teragen + load test).

If you have any idea how to reproduce any related problem, just let me know and I will try it out. But I can't see any problem during the tests.

@elek
Copy link
Member Author

elek commented Mar 25, 2020

Can we make it a data driven decision? I looked at the linked document. I didn't understand the test methodology. E.g. when you say 60 thread / spinning disk, is that 60 chunk executor threads or 60 client threads?

There was no client thread in this test. The command was mentioned later in the doc:

60 thread / spinning disk

Command:

ozone freon --verbose cmdw -n1000000 -s 1024 -t60

This is using the ChunkManagerImpl fro 60 threads.

If you are interested about an e2e test, please check the teragen results (added in a comment).

Can we make it a data driven decision?

100% agree. If you need any more date, just let me know. I think we should have 60 threads only if we have data as more threads mean more overhead. But if we have data to keep it, I am fine with it.

Based on the existing test, 10 threads should be enough as a default (IMHO)

@elek
Copy link
Member Author

elek commented Mar 30, 2020

Any more comments / suggestions?

@bshashikant
Copy link
Contributor

bshashikant commented Mar 31, 2020

Any more comments / suggestions?

If i remember correctly, number of threads is currently for all pipelines a datanode has which may span over multiple disks. This is not a per disk configuration.

I feel with more no of disks per datanode for write, we may need more threads depending upon the load.

@elek
Copy link
Member Author

elek commented Apr 1, 2020

I feel with more no of disks per datanode for write, we may need more threads depending upon the load.

I agree, with 1 hard disk I would use 3-4 threads. I proposed to use 10 which seems to be fine with ~5 disks and seems to be a good, balanced default.

Should we use 20 instead of 10? Is there any reason to have exactly 60?

@bshashikant
Copy link
Contributor

I feel with more no of disks per datanode for write, we may need more threads depending upon the load.

I agree, with 1 hard disk I would use 3-4 threads. I proposed to use 10 which seems to be fine with ~5 disks and seems to be a good, balanced default.

Should we use 20 instead of 10? Is there any reason to have exactly 60?

There is no logic to use exactly 60. I am ok to make to 20 for now.

We need to test with multiple disks and varying load to see how it really impacts the performance. We can even decide to make it a factor of number of data disks available per datanode but this can be done in follow up jira.

@bshashikant bshashikant self-requested a review April 1, 2020 10:42
Copy link
Contributor

@bshashikant bshashikant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's change the value to 20.

@arp7
Copy link
Contributor

arp7 commented Apr 1, 2020

Let's hold off committing this until we can discuss it further. 20 threads will be insufficient if the number of disks is large.

@elek
Copy link
Member Author

elek commented Apr 2, 2020

Let's hold off committing this until we can discuss it further

Let's discuss it further ;-) (Marked the the PR with red flag to avoid unexpected merge without reading the thread)

I am happy with any numbers (even with 60), if you can explain how the number is chosen.

As you wrote earlier:

Can we make it a data driven decision?

(yes)

My questions:

  1. What is the expected number of the disks (for default configuration, something like an average)?
  2. What is the number of the threads per disks which are required to have the best performance?
  3. What is the performance penalty of using too much thread (optimizing to big pool of disks)

My answers:

  1. 5 disk
  2. 4 threads per disk (I am convinced to use 20 instead of 10 ;-) )
  3. 10% percent slowness (measured data, see my first link: using 10 threads was ~10% faster than using 60 threads)

@elek
Copy link
Member Author

elek commented Apr 2, 2020

Other proposal to have a consensus:

What about checking the number of the disks (number of configured locations) and use the current configuration number as per disk value. (We can use disk * 10 or disk * 8 thread).

Would it be more safe?

@elek
Copy link
Member Author

elek commented Apr 2, 2020

/pending @arp "Let's hold off committing this until we can discuss it further."

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking this issue as un-mergeable as requested.

Please use /ready comment when it's resolved.

@arp "Let's hold off committing this until we can discuss it further."

@arp7
Copy link
Contributor

arp7 commented Apr 6, 2020

What about checking the number of the disks (number of configured locations) and use the current configuration number as per disk value. (We can use disk * 10 or disk * 8 thread).

I like this idea, let's do this.

@elek
Copy link
Member Author

elek commented Apr 17, 2020

/ready

@github-actions github-actions bot dismissed their stale review April 17, 2020 08:13

Blocking review request is removed.

@arp7
Copy link
Contributor

arp7 commented Apr 21, 2020

The change looks good to me. In multi-RAFT setup will we end up creating 10numDisksnumPipelines threads, or will the total number of threads be the same as single RAFT?

@elek
Copy link
Member Author

elek commented Apr 24, 2020

In multi-RAFT setup will we end up creating 10numDisksnumPipelines threads, or will the total number of threads be the same as single RAFT?

I am not sure about the implementation, but IMHO multiraft creates multiple StateMachine. Which means that we will have 10 * disks * number of statemachines all together. (Same as before, but -- long-term -- we can reduce that, too, if we see any performance problems).

@elek
Copy link
Member Author

elek commented Apr 28, 2020

ping @arp7 . Can we merge it?

Copy link
Contributor

@arp7 arp7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@elek
Copy link
Member Author

elek commented Apr 30, 2020

Thanks the review @arp7 @bshashikant @adoroszlai

Happy to achieve consensus with the help of a long conversation.

Will merge it soon.

@elek
Copy link
Member Author

elek commented May 4, 2020

Merged with f2b5b10

@elek elek closed this May 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants