HDDS-3082. Refactor recon missing containers task to detect under, over and mis-replicated containers. #994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

sodonnel merged 6 commits into apache:master from sodonnel:HDDS-3082-recon-fsck

Jun 4, 2020

Contributor

sodonnel commented May 29, 2020

What changes were proposed in this pull request?

The current Recon "Missing Containers Task" only highlights missing containers in the cluster.

It is desired for it to also detect under, over and mis-replicated containers.

In order to do that, the existing database table MISSING_CONTAINERS has been renamed to UNHEALTHY_CONTAINERS, with the definition:

container_id bigint NOT NULL,
container_state varchar(16) NOT NULL,
in_state_since bigint not null,
expected_replica_count integer,
actual_replica_count integer,
replica_delta integer not null,
reason varchar(500)

The container state can be MISSING, UNDER_REPLICATED, OVER_REPLICATED or MIS_REPLICATED.

A design decision was made so that if a container is MISSING, then it is not in any of the other states.

However, it can be both under and mis-replicated or in theory over and mis-replicated at the same time and this would result in two rows in the database for a single container.

Each time the "Container Health task" runs, it scans all the existing records, updates any counts and removes any records that are no longer valid.

Then it processes all other containers without any records in the unhealthy_containters table.

The reason the job is split into two parts, is to avoid the need to query the database for every single container on each run.

This change only adjusts the job and the backend storage. An additional change is needed to change the rest endpoints to expose the new container states to the users and UI.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3082

How was this patch tested?

New and existing unit tests


          Implement new ContainerHealthTask

a4a2668

sodonnel requested a review from avijayanhwx

May 29, 2020 16:24


          Fix style issue

6765db0

avijayanhwx requested a review from vivekratnavel

May 29, 2020 17:13

vivekratnavel approved these changes

View reviewed changes

Contributor

vivekratnavel left a comment

+1 LGTM.

Posted a few minor suggestions inline.

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java Outdated Show resolved Hide resolved

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java Outdated

+                      containers.forEach(container ->
+                          processContainer(container, currentTime));
+                      recordSingleRunCompletion();
+                      LOG.info("Missing Container task Thread took {} milliseconds for" +

Contributor

vivekratnavel May 29, 2020

Suggested change

      
                    LOG.info("Missing Container task Thread took {} milliseconds for" +
          
                    LOG.info("Container Health task thread took {} milliseconds for" +

Contributor Author

sodonnel Jun 1, 2020

Well spotted. I have fixed this.

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java Outdated Show resolved Hide resolved

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java Show resolved Hide resolved

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java Outdated

+                   * already set to. We only need to run a DB update statement if the record
+                   * has really changed. The methods below ensure we do not update the Jooq
+                   * record unless the values have changed and hence save a DB execution
+                   * when

Contributor

vivekratnavel May 29, 2020

Please fix the dangling statement

Contributor Author

sodonnel Jun 1, 2020

Fixed.

...-ozone/recon/src/test/java/org/apache/hadoop/ozone/recon/fsck/TestContainerHealthStatus.java Outdated Show resolved Hide resolved

...op-ozone/recon/src/test/java/org/apache/hadoop/ozone/recon/fsck/TestContainerHealthTask.java Outdated Show resolved Hide resolved

avijayanhwx reviewed

View reviewed changes

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java Show resolved Hide resolved

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java Outdated Show resolved Hide resolved

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java Outdated Show resolved Hide resolved

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java Show resolved Hide resolved

S O'Donnell added 4 commits

June 1, 2020 13:06


          Address review comments

1473d03


          Trigger CI checks

20bffd9


          Trigger CI checks

1c52fd1


          Refactor code to avoid passing null containers

d9d7e9e

Contributor

avijayanhwx commented Jun 3, 2020

Thank you @sodonnel. LGTM +1

sodonnel merged commit ac64ab6 into apache:master

isahekmat pushed a commit to isahekmat/hadoop-ozone that referenced this pull request


          HDDS-3082. Refactor recon missing containers task to detect under, ov…

8ec0ca2

…er and mis-replicated containers. (apache#994)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet