HDDS-5341. Container report processing is single threaded #2338

bharatviswa504 · 2021-06-15T09:30:49Z

What changes were proposed in this pull request?

Make container report processing multi-threaded.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5341

How was this patch tested?

Added UT.

GlenGeng-awx · 2021-06-17T04:54:01Z

cc @GlenGeng @JacksonYao287 @ChenSammi

JacksonYao287

Thanks @bharatviswa504 for this work , and it does make sense to use multi-thread handler for container report handler

...ds/framework/src/main/java/org/apache/hadoop/hdds/server/events/FixedThreadPoolExecutor.java

bharatviswa504 · 2021-06-22T05:10:56Z

Thank You @JacksonYao287 for the review. I have fixed review comments.

sodonnel · 2021-06-22T17:13:17Z

I've got a couple of questions on this topic:

Does it make sense to thread pool the Incremental Reports too? I know they are much smaller, but they are also much more frequent. I have not seen any evidence they are backing up or not, but its worth considering / checking if we can.
I believe the DNs send a FCR every 60 - 90 seconds. Is that frequency really needed? Unknown bugs aside, is there any reason to send a FCR after startup and first registration? ON HDFS datanodes only send full block reports every 6 hours by default. If ICRs carry all the required information for SCM, perhaps we should increase the FCR interval to an hour or more?
Have we been able to capture any profiles (eg flame charts) of processing a large FCR to see if there is anything to be optimised in that flow?

JacksonYao287 · 2021-06-23T03:43:15Z

@sodonnel

Does it make sense to thread pool the Incremental Reports too? I know they are much smaller, but they are also much more frequent. I have not seen any evidence they are backing up or not, but its worth considering / checking if we can.

as far as i know, when a heartbeat from the data node arrives on SCM, It is queued for processing with the time stamp of when the heartbeat arrived. There is a heartbeat processing thread inside SCM that runs at a specified interval. So i think the point is how many reports is queued at SCM in the specified interval and how fast the report hander can deal with these report. the total num of Incremental Report in a specified interval(default 3s) is not very large , but it makes sense to promote it to a thread pool if needed in the future.

I believe the DNs send a FCR every 60 - 90 seconds. Is that frequency really needed? Unknown bugs aside, is there any reason to send a FCR after startup and first registration? ON HDFS datanodes only send full block reports every 6 hours by default. If ICRs carry all the required information for SCM, perhaps we should increase the FCR interval to an hour or more?

yea, i think this makes sense. too many FCR will Increase the burden of SCM

bharatviswa504 · 2021-06-23T05:05:37Z

I've got a couple of questions on this topic:

Does it make sense to thread pool the Incremental Reports too? I know they are much smaller, but they are also much more frequent. I have not seen any evidence they are backing up or not, but its worth considering / checking if we can.

Previously with ICR's, we used to send full container report, that is fixed by HDDS-5111. We shall be testing with this PR and HDDS-5111 with huge container reports from each DN. If we observe issue with ICR, we can add thread pool to ICR also.

I believe the DNs send a FCR every 60 - 90 seconds. Is that frequency really needed? Unknown bugs aside, is there any reason to send a FCR after startup and first registration? ON HDFS datanodes only send full block reports every 6 hours by default. If ICRs carry all the required information for SCM, perhaps we should increase the FCR interval to an hour or more?

During startup/registration we need to send a full container report as the ContainerSafeMode rule is dependent on that to validate its rule. And also we fire container report event, where we process container reports and build container replica set.

But I completely agree with you we can change the full container report interval to a larger value. And I don't think we need to have a large value like HDFS, as compared to HDFS our container report size should be very less.

From our scale testing
With 9 DN's with each data node filled with 500 TB data, we have seen around 350K containers in the cluster. So, there are a total of 1 million replicas will be reported from all DN's.(When compared with HDFS our container report size is far less in size)

Have we been able to capture any profiles (eg flame charts) of processing a large FCR to see if there is anything to be optimized in that flow?

We have not debugged at this level, in future testing, we shall look into this.

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/ScmConfigKeys.java

bshashikant

Looks good.

bharatviswa504 · 2021-06-25T10:17:45Z

Thank You @JacksonYao287 @bshashikant and @sodonnel for the review/comments.
I will commit this shortly.

@sodonnel In our scale testing, we have not observed an issue with ICR processing, in future, if it is needed, we can add that. As of now with this PR we have a general framework needed to support multi-threaded processing.

* master: (28 commits) HDDS-5332. Add a new column family and a service provider in Recon DB for Namespace Summaries (apache#2366) HDDS-5405. Refactor pom files for HadoopRpc and Grpc/Ratis compilation properties. (apache#2386) HDDS-5406. add proto version to all the proto files. (apache#2385) HDDS-5398. Avoid object creation in ReplicationManger debug log statements (apache#2379) HDDS-5396. Fix negligence issue conditional expressions in MockCRLStore.java (apache#2380) HDDS-5395. Avoid unnecessary numKeyOps.incr() call in OMMetrics (apache#2374) HDDS-5389. Include ozoneserviceid in fs.defaultFS when configuring o3fs (apache#2370) HDDS-5383. Eliminate expensive string creation in debug log messages (apache#2372) HDDS-5380. Get more accurate space info for DedicatedDiskSpaceUsage. (apache#2365) HDDS-5341. Container report processing is single threaded (apache#2338) HDDS-5387. ProfileServlet to move the default output location to an ozone specific directory (apache#2368) HDDS-5289. Update container's deleteTransactionId on creation of the transaction in SCM. (apache#2361) HDDS-5369. Cleanup unused configuration related to SCM HA (apache#2359) HDDS-5381. SCM terminated with exit status 1: null. (apache#2362) HDDS-5353. Avoid unnecessary executeBatch call in insertAudits (apache#2342) HDDS-5350 : Add allocate block support in MockOmTransport (apache#2341). Contributed by Uma Maheswara Rao G. HDDS-4926. Support start/stop for container balancer via command line (apache#2278) HDDS-5269. Datandoe with low ratis log volume space should not be considered for new pipeline allocation. (apache#2344) HDDS-5367. Update modification time when updating quota/storageType/versioning (apache#2355) HDDS-5352. java.lang.ClassNotFoundException: org/eclipse/jetty/alpn/ALPN (apache#2347) ...

JacksonYao287 reviewed Jun 19, 2021

View reviewed changes

...ds/framework/src/main/java/org/apache/hadoop/hdds/server/events/FixedThreadPoolExecutor.java Outdated Show resolved Hide resolved

...ds/framework/src/main/java/org/apache/hadoop/hdds/server/events/FixedThreadPoolExecutor.java Outdated Show resolved Hide resolved

bharatviswa504 added 13 commits June 22, 2021 10:36

HDDS-5341. Container report processing is single threaded.

9047b57

add to ozone-default

f325c4c

fix dup method

c61227a

fix usage

b165db6

add metric for scheduled

e172f83

add sample test using fixedthreadpool executor in event queue

4a6f518

fix test failure

fdc7f8a

remove unnecessary return in java doc.

69e7c83

fix cs

2b2698c

fix cs

8c08059

fix cs

14616e5

fix tests

0fd7b20

fix tests

847f800

bharatviswa504 force-pushed the HDDS-5341 branch from 1f57435 to 847f800 Compare June 22, 2021 05:10

fix test unnecessary changes

8fedd12

bharatviswa504 force-pushed the HDDS-5341 branch from ae89bfd to 8fedd12 Compare June 22, 2021 05:13

Merge branch 'master' into HDDS-5341

35e62a8

bshashikant reviewed Jun 25, 2021

View reviewed changes

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/ScmConfigKeys.java Show resolved Hide resolved

bshashikant reviewed Jun 25, 2021

View reviewed changes

bshashikant approved these changes Jun 25, 2021

View reviewed changes

fix cs

462f544

bharatviswa504 merged commit 450c375 into apache:master Jun 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-5341. Container report processing is single threaded #2338

HDDS-5341. Container report processing is single threaded #2338

Uh oh!

bharatviswa504 commented Jun 15, 2021

Uh oh!

GlenGeng-awx commented Jun 17, 2021

Uh oh!

JacksonYao287 left a comment

Uh oh!

Uh oh!

Uh oh!

bharatviswa504 commented Jun 22, 2021

Uh oh!

sodonnel commented Jun 22, 2021

Uh oh!

JacksonYao287 commented Jun 23, 2021 •

edited

Loading

Uh oh!

bharatviswa504 commented Jun 23, 2021 •

edited

Loading

Uh oh!

Uh oh!

bshashikant left a comment

Uh oh!

bharatviswa504 commented Jun 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HDDS-5341. Container report processing is single threaded #2338

HDDS-5341. Container report processing is single threaded #2338

Uh oh!

Conversation

bharatviswa504 commented Jun 15, 2021

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

GlenGeng-awx commented Jun 17, 2021

Uh oh!

JacksonYao287 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bharatviswa504 commented Jun 22, 2021

Uh oh!

sodonnel commented Jun 22, 2021

Uh oh!

JacksonYao287 commented Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bharatviswa504 commented Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bshashikant left a comment

Choose a reason for hiding this comment

Uh oh!

bharatviswa504 commented Jun 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JacksonYao287 commented Jun 23, 2021 •

edited

Loading

bharatviswa504 commented Jun 23, 2021 •

edited

Loading