Logs are being lost from a service running on Kubernetes. #1545

slopezxrd · 2024-11-27T15:17:52Z

Describe the bug

Our environment is located in a private Kubernetes cluster, for monitoring of the logs it is used fluentd.
The fluentd configuration reads from the log files generated by kubernetes in var/logs/containers/ 
There are different microservices, and for each of one a different regex is applied to extract the information to fields, 
adds the information of kubernetes with @type kubernetes_metadata. Finally, these logs are written in two ELK and in a 
local file. In case of failure there is a secundary with which to insert the failed ones.

One of the microservices has a high traffic and generates a large amount of logs. 
Every certain time, it is observed that traces are lost both in the ELKs and in the files

It is suspected that it has to do with the fast rotation of the log files. At some moment, a rotated file is finished and,
instead of continuing with the next rotated file, it continues with the current log file.

To Reproduce

Fast rotation speed in logs file

Expected behavior

Increase the speed of log processing so that this offset with the rotation of them does not occur.
- In case that offset does occur (temporarily), although it inserts behind, it inserts all traces.

Your Environment

fluent/fluentd-kubernetes-daemonset:v1.16.2-debian-elasticsearch7-1.0, as the main image. 
Although the following images have also been tested:
 - v1.16.2-debian-elasticsearch7-1.0
 - v1.16.2-debian-elasticsearch7-1.1
 - v1.16.2-debian-elasticsearch8-1.0
 - v1.16.2-debian-elasticsearch8-1.1
 - v1.16.3-debian-elasticsearch8-1.0
 - v1.16.3-debian-elasticsearch8-2.0
 - v1.16.3-debian-elasticsearch8-2.1
 - v1.16.5-debian-elasticsearch7-1.3
 - v1.16.5-debian-elasticsearch8-1.3
 - v1.17-debian-elasticsearch8-1

Your Configuration

<source>
      @type tail
      @id in_tail_container_logs
      @label @KUBERNETES
      path /var/logs/containers/**app-ms**.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      read_from_head true
      time_key time
      <parse>
        @type multi_format
			<pattern>
				format json
				time_key time
				time_type string
				time_format %Y-%m-%dT%H:%M:%S.%N%z
				keep_time_key true
			</pattern>
            <pattern>
				format regexp
                expression /^(?<log>.*)$/
                keep_time_key false
        </pattern>
      </parse>
      emit_unmatched_lines true
    </source>

    <label @KUBERNETES>
      <match kubernetes.var.log.containers.fluentd**>
        @type relabel
        @label @FLUENT_LOG
      </match>

      <match kubernetes.var.log.containers.**_kube-system_**>
        @type null
        @id ignore_kube_system_logs
      </match>

      <match fluent.**>
       this tells fluentd to not output its log on stdout
        @type null
      </match>

      <filter kubernetes.**>
        @type concat
        key log
        multiline_start_regexp /^\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2}.\d{3}/
        continuous_line_regexp /^(?!\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2}.\d{3})/
        separator ""
        flush_interval 5
        timeout_label @LOGS
      </filter>
      <match **>
        @type relabel
        @label @LOGS
      </match>
    </label>

    <label @LOGS>
      <filter kubernetes.**app-ms**>
        @type parser
        key_name log
        <parse>
          @type regexp
          expression ^(?<log>((?<time>\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2}.\d{3})\s+(?<level>[^\s]+)?\s+(?<pid>\d+).*?\[\s*(?<thread>.*)\]\s+(?<class>.*)\s+:\s+(?<message>((?<idTraza>[\w-]+)\s-\s)?((?<operacion>[\w-]+)\.?\s-\s)?((?<caja>\w+)\.?\s-)?(.*?codigo":"(?<codigo>\w+?)")?(.*?texto":"(?<texto>[^"]+)")?(.*?txStatus":"(?<txStatus>[^"]+)")?(.*tiempo:\s(?<ms>\d+)\sms)?.*))?(\s|.)*)
        </parse>
      </filter>

      <filter kubernetes.**>
        @type kubernetes_metadata
        merge_json_log true
        preserve_json_log true
      </filter>

      <filter kubernetes.**>
        @type grep
        <regexp>
            key $.kubernetes.labels.fluentd
            pattern true
        </regexp>
      </filter>

       <match **>
        @type relabel
        @label @OUTPUT
       </match>
    </label>

  04_outputs.conf: |-
    <label @OUTPUT>
      <match **app-ms**>
        @type copy
		<store>
            @include elastic.conf_common
            data_stream_name log-pods-pro.app-ms
            <secondary>
              @type secondary_file
              directory /var/log/trazas/error/app
              basename error.app.logs
            </secondary>
         </store>
        <store>
          @include new_elastic.conf_common
          data_stream_name log-pods-pro.app-ms
          <secondary>
            @type secondary_file
            directory /var/log/trazas/error/app
            basename error.app.logs
          </secondary>
        </store>
		<store>
            @include file_save.conf_common
            path /var/log/trazas/file/app
        </store>
      </match>
    </label>
  elastic.conf_common: |-
    @type elasticsearch_data_stream
    host "url-elastic.es"
    port 443
    path ""
    request_timeout 20s
    scheme https
    user fluentd_pro
    password *************
    data_stream_name log-pods-pro.app-ms
    ssl_verify false
    include_tag_key true
    reload_connections true
    sniffer_class_name Fluent::Plugin::ElasticsearchSimpleSniffer
    reconnect_on_error true
    reload_on_failure false
    flush_interval 5s
    num_threads 2
    logstash_format false
    utc_index false
    time_key time
    time_key_exclude_timestamp true
    include_timestamp true
    <buffer>
      flush_thread_count 2
      flush_interval 5s
      retry_max_times 1
    </buffer>
  new_elastic.conf_common: |-
  @type elasticsearch_data_stream
      host "new_url-elastic.es"
      port 443
      path ""
      request_timeout 20s
      scheme https
      user fluentd_pro
      password *************
	  data_stream_name log-pods-pro.app-ms
      ssl_verify false
      include_tag_key true
      reload_connections true
      sniffer_class_name Fluent::Plugin::ElasticsearchSimpleSniffer
      reconnect_on_error true
      reload_on_failure false
      flush_interval 5s
      num_threads 2
      logstash_format false
      utc_index false
      time_key time
      time_key_exclude_timestamp true
      include_timestamp true
      <buffer>
        flush_thread_count 2
        flush_interval 5s
        retry_max_times 1
      </buffer>
  file_save.conf_common: |-
    @type file
      compress gzip
      <format>
        localtime true
      </format>
      <buffer time>
        timekey_wait 1m
        timekey 10m
        timekey_use_utc false
      </buffer>
      <inject>
        time_format %Y%m%dT%H%M%S%z
        localtime false
      </inject>

Your Error Log

No log error

Additional context

Tests performed, without success:

Only insert in the local file 
Only insert in the ELKs 
Manage the log file and its rotation manually 
Modification of the fluentd configuration, source, buffer in the output.

Observations:
When there is a large amount of logs, it is observed that the insertion in ELK is getting behind,
until it stops writing and writes those from a few minutes later.
The files rotate every few minutes (3-4 minutes).
In one of the tests performed, it was perceived that in the file of positions there were two entries for the same file identifier.
One of the new files generated had been created with the same identifier as one that had been deleted (but was maintained
in the file of positions)

The text was updated successfully, but these errors were encountered:

Watson1978 · 2024-12-04T01:40:31Z

I guess this is related to fluent/fluentd#4693 ...

jicowan · 2024-12-12T00:01:15Z

@slopezxrd I've experienced a similar issue that you can read about fluent/fluentd#4693. I have temporarily worked around the issue by increasing the maximum log file size (the default is 10MB) on the kubelet. Increasing the size will reduce the frequency logs are rotated. I suspect the issue is related to inode tracking when logs are rotated quickly and often. I've noticed that when when logs go missing there's usually a warning in the fluentd logs that says Skip update_watcher because watcher has been already updated by other inotify event. I've tried using different types of buffers, e.g. file, hybrid, and memory; increasing the memory and CPU, using multiple workers, etc. None of them seem to have an impact. At significantly high volumes (>1000 log lines per second per container) fluentd has trouble reading and tracking when logs are rotated (I have had up to 10 of these containers running on 1 instance).

slopezxrd added the waiting-for-triage label Nov 27, 2024

Watson1978 added this to Fluentd Kanban Dec 4, 2024

Watson1978 moved this to Triage in Fluentd Kanban Dec 4, 2024

Watson1978 added bug and removed waiting-for-triage labels Dec 4, 2024

Watson1978 self-assigned this Dec 4, 2024

Watson1978 moved this from Triage to Work-In-Progress in Fluentd Kanban Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logs are being lost from a service running on Kubernetes. #1545

Logs are being lost from a service running on Kubernetes. #1545

slopezxrd commented Nov 27, 2024 •

edited

Loading

Watson1978 commented Dec 4, 2024

jicowan commented Dec 12, 2024 •

edited

Loading

Logs are being lost from a service running on Kubernetes. #1545

Logs are being lost from a service running on Kubernetes. #1545

Comments

slopezxrd commented Nov 27, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

Watson1978 commented Dec 4, 2024

jicowan commented Dec 12, 2024 • edited Loading

slopezxrd commented Nov 27, 2024 •

edited

Loading

jicowan commented Dec 12, 2024 •

edited

Loading