Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fluentd tailing check is incorrect - observability, routed logs #2025

Closed
svteb opened this issue May 15, 2024 · 0 comments
Closed

[BUG] Fluentd tailing check is incorrect - observability, routed logs #2025

svteb opened this issue May 15, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@svteb
Copy link
Collaborator

svteb commented May 15, 2024

Describe the bug
There seems to be a small bug in function app_tailed_by_fluentd found in fluentbit.cr, fluentd_bitnami.cr, fluentd.cr (personally I've only ran into the issue with the fluentd_bitnami.cr file, but the other files seem to use the exact same function so they are also likely faulty).

The function in question:

def self.app_tailed_by_fluentd?(pod_name, match=nil)
    Log.info { "bitnami app_tailed_by_fluentd pod_name: #{pod_name} match: #{match}"}
    match = match() unless match
    Log.info { "app_tailed_by_fluentd match: #{match}"}
    found = false
    fluentd_pods = KubectlClient::Get.pods_by_digest(match[:digest])
    Log.info { "fluentd_pods match: #{fluentd_pods}"}
    fluentd_pods.each do |fluentd|
      pod_name = fluentd.dig("metadata","name").as_s        <---- reuse of the variable passed to the function (erroneous)
      logs = KubectlClient.logs(pod_name, namespace: TESTSUITE_NAMESPACE)
      Log.debug { "fluentd logs: #{logs}"}
      found = logs[:output].to_s.includes?(pod_name)
    end
    Log.info { "fluentd found match: #{found}"}
    found
  end

What should presumably be happening is that a cnf pod_name gets passed to the function, in spec tests this is coredns-xyz, afterwards the logs of each fluentd pod should be retrieved and checked for presence of the coredns-xyz string. The issue is that the pod_name variable is rewritten and thus the coredns-xyz string is lost in the process and instead the string that is searched for will be in the format fluentd-xyz. This is likely a mistake, but for some reason the test has a success rate of around 10% on my machine (no idea). And what's more the test is somehow consistently passing on github actions.

Furthermore I managed to get this test to pass by simply changing the variable name in the for loop so that it would not get rewritten and it passes correctly.

To Reproduce
Steps to reproduce the behavior:

  1. export LOG_LEVEL=info
  2. crystal build src/cnf-testsuite.cr
  3. ./cnf-testsuite setup
  4. crystal spec spec/workload/observability_spec.cr:155
  5. Should fail.

Expected behavior
The fluentd logs should be checked for the actual cnfs name, not for their own name.

Additional context
There should really be a big big refactor of the fluentd family of functions/files in the project, I really do not see the reason for this replication across 3 files. What I also noticed is that on older versions of kubernetes (1.23.) the fluentd fails after deploying because of readiness and liveness probes which have some faulty endpoints:

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  2m                 default-scheduler  Successfully assigned cnf-testsuite/fluentd-x6fn6 to minikube
  Normal   Pulled     115s               kubelet            Container image "docker.io/bitnami/fluentd:1.17.0-debian-12-r0" already present on machine
  Normal   Created    113s               kubelet            Created container tmp-dir-permissions
  Normal   Started    112s               kubelet            Started container tmp-dir-permissions
  Normal   Pulled     110s               kubelet            Container image "docker.io/bitnami/fluentd:1.17.0-debian-12-r0" already present on machine
  Normal   Created    106s               kubelet            Created container fluentd
  Normal   Started    105s               kubelet            Started container fluentd
  Warning  Unhealthy  39s (x7 over 99s)  kubelet            Readiness probe failed: Get "http://10.244.0.10:9880/fluentd.healthcheck?json=%7B%22ping%22%3A+%22pong%22%7D": dial tcp 10.244.0.10:9880: connect: connection refused
  Warning  Unhealthy  39s                kubelet            Liveness probe failed: Get "http://10.244.0.10:9880/fluentd.healthcheck?json=%7B%22ping%22%3A+%22pong%22%7D": dial tcp 10.244.0.10:9880: connect: connection refused

These endpoints seem to fail on newer versions of kubernetes as well (1.28/1.29), but at least they do not kill the pod and are treated as a warning, there should probably be some additional fluentd configuration.

@svteb svteb added the bug Something isn't working label May 15, 2024
svteb added a commit to svteb/testsuite that referenced this issue May 16, 2024
Ref: cnti-testcatalog#2025 cnti-testcatalog#2016
- The issue explained in cnti-testcatalog#2025 where fluent was tailing itself
instead of the CNF has been resolved.
- The dead helm repo mentioned in cnti-testcatalog#2016 has been replace by link to bitnami repo,
there was a similar occurence in the spec test.
- The family of fluentd functions and files have been refactored into configurable
files (allows for future additions/checking of elastic, aws fluent, etc.).
- The routed_logs test was also modified to conform to the new functions (also shortened).
- Other files have been edited for completion/conformance with naming conventions.
Signed-off-by: svteb <[email protected]>
svteb added a commit to svteb/testsuite that referenced this issue May 17, 2024
svteb added a commit to svteb/testsuite that referenced this issue May 29, 2024
Refs: cnti-testcatalog#2025
- Reworked the config dictionary into classes that inherit
from a base abstract class.
- Renamed FluentManagement module to FluentManager
- Interfaces changed to adhere to the new abstraction

Signed-off-by: svteb <[email protected]>
martin-mat pushed a commit that referenced this issue Jun 9, 2024
Ref: #2025 #2016
- The issue explained in #2025 where fluent was tailing itself
instead of the CNF has been resolved.
- The dead helm repo mentioned in #2016 has been replace by link to bitnami repo,
there was a similar occurence in the spec test.
- The family of fluentd functions and files have been refactored into configurable
files (allows for future additions/checking of elastic, aws fluent, etc.).
- The routed_logs test was also modified to conform to the new functions (also shortened).
- Other files have been edited for completion/conformance with naming conventions.
Signed-off-by: svteb <[email protected]>
martin-mat pushed a commit that referenced this issue Jun 9, 2024
martin-mat pushed a commit that referenced this issue Jun 9, 2024
Refs: #2025
- Reworked the config dictionary into classes that inherit
from a base abstract class.
- Renamed FluentManagement module to FluentManager
- Interfaces changed to adhere to the new abstraction

Signed-off-by: svteb <[email protected]>
martin-mat pushed a commit that referenced this issue Jun 10, 2024
)

* Fix: Routed logs test now correctly checks if CNF is being tailed
Ref: #2025 #2016
- The issue explained in #2025 where fluent was tailing itself
instead of the CNF has been resolved.
- The dead helm repo mentioned in #2016 has been replace by link to bitnami repo,
there was a similar occurence in the spec test.
- The family of fluentd functions and files have been refactored into configurable
files (allows for future additions/checking of elastic, aws fluent, etc.).
- The routed_logs test was also modified to conform to the new functions (also shortened).
- Other files have been edited for completion/conformance with naming conventions.
Signed-off-by: svteb <[email protected]>

* Fix: Corrected a require command in observability_spec.cr
Refs: #2025

Signed-off-by: svteb <[email protected]>

* Refactor: better abstraction of fluent flavors
Refs: #2025
- Reworked the config dictionary into classes that inherit
from a base abstract class.
- Renamed FluentManagement module to FluentManager
- Interfaces changed to adhere to the new abstraction

Signed-off-by: svteb <[email protected]>

---------

Signed-off-by: svteb <[email protected]>
@svteb svteb closed this as completed Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant