Fix health-check to check actual health #521

sbernauer · 2023-11-02T07:23:06Z

Affected version

0.0.0-dev

Current and expected behavior

We have lost data in a demo, as Nifi was complaining about to reaching ZooKeeper and the health-checks did not notice it.
Simply restarting the pod solved the problem, which would have done if the livenessProbe would have detected the problem.

Currently the livenessProbe looks like

    livenessProbe:
      failureThreshold: 3
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      tcpSocket:
        port: https
      timeoutSeconds: 1

While the numbers itself are arguable - (e.g why have a initialDelaySeconds when we have a startup probe?) and a readinessProbe is missing - the most important thing is, that a simple check on the port is not enough.

Possible solution

We should instead use https://nifi.apache.org/docs/nifi-docs/rest-api/ to check the actual node health. The most complicated part will be auth I fear (e.g. add a static user with an operator-created random secret and put it in the Authentication chain),

Additional context

No response

Environment

No response

Would you like to work on fixing this bug?

yes

The text was updated successfully, but these errors were encountered:

sbernauer added the type/bug label Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix health-check to check actual health #521

Fix health-check to check actual health #521

sbernauer commented Nov 2, 2023

Fix health-check to check actual health #521

Fix health-check to check actual health #521

Comments

sbernauer commented Nov 2, 2023

Affected version

Current and expected behavior

Possible solution

Additional context

Environment

Would you like to work on fixing this bug?