Skip to content

Conversation

@sodonnel
Copy link
Contributor

What changes were proposed in this pull request?

The current Recon "Missing Containers Task" only highlights missing containers in the cluster.

It is desired for it to also detect under, over and mis-replicated containers.

In order to do that, the existing database table MISSING_CONTAINERS has been renamed to UNHEALTHY_CONTAINERS, with the definition:

container_id bigint NOT NULL,
container_state varchar(16) NOT NULL,
in_state_since bigint not null,
expected_replica_count integer,
actual_replica_count integer,
replica_delta integer not null,
reason varchar(500)

The container state can be MISSING, UNDER_REPLICATED, OVER_REPLICATED or MIS_REPLICATED.

A design decision was made so that if a container is MISSING, then it is not in any of the other states.

However, it can be both under and mis-replicated or in theory over and mis-replicated at the same time and this would result in two rows in the database for a single container.

Each time the "Container Health task" runs, it scans all the existing records, updates any counts and removes any records that are no longer valid.

Then it processes all other containers without any records in the unhealthy_containters table.

The reason the job is split into two parts, is to avoid the need to query the database for every single container on each run.

This change only adjusts the job and the backend storage. An additional change is needed to change the rest endpoints to expose the new container states to the users and UI.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3082

How was this patch tested?

New and existing unit tests

@sodonnel sodonnel requested a review from avijayanhwx May 29, 2020 16:24
@avijayanhwx avijayanhwx requested a review from vivekratnavel May 29, 2020 17:13
Copy link
Contributor

@vivekratnavel vivekratnavel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM.

Posted a few minor suggestions inline.

containers.forEach(container ->
processContainer(container, currentTime));
recordSingleRunCompletion();
LOG.info("Missing Container task Thread took {} milliseconds for" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LOG.info("Missing Container task Thread took {} milliseconds for" +
LOG.info("Container Health task thread took {} milliseconds for" +

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted. I have fixed this.

* already set to. We only need to run a DB update statement if the record
* has really changed. The methods below ensure we do not update the Jooq
* record unless the values have changed and hence save a DB execution
* when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the dangling statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@avijayanhwx
Copy link
Contributor

Thank you @sodonnel. LGTM +1

@sodonnel sodonnel merged commit ac64ab6 into apache:master Jun 4, 2020
isahekmat pushed a commit to isahekmat/hadoop-ozone that referenced this pull request Jun 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants