Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the encoding parameter being used as the Documentation states ? #483

Open
gadiego92 opened this issue Nov 23, 2023 · 2 comments
Open

Comments

@gadiego92
Copy link

gadiego92 commented Nov 23, 2023

Describe the bug

The Official Documentation states regarding the encoding for the tail plugin:

encoding, from_encoding

type default version
string nil (string encoding is ASCII-8BIT) 0.14.0

Specifies the encoding of reading lines.

By default, in_tail emits string value as ASCII-8BIT encoding.

These options change it:

  • If encoding is specified, in_tail changes string to encoding.

    This uses Ruby's String#force_encoding.

  • If encoding and from_encoding both are specified, in_tail tries to

    encode string from from_encoding to encoding. This uses Ruby's

    String#encode.

source: tail#encoding-from_encoding

I have been checking Fluentd source code and:

  1. Regarding the first bullet.
    I think encoding parameter is not being used as it states in the Documentation.
    I cannot find the function String#force_encoding using the encoding parameter.
    On the other side I have found the String#force_encoding function with the from_encoding parameter in few places.
    I think line 992 might be wrong:
    https://github.com/fluent/fluentd/blob/74db9477f445ef83384eca6da8d6c2049945d8cd/lib/fluent/plugin/in_tail.rb#L992
    If the Documentation is not wrong the function String#force_encoding should use the encoding value not the from_encoding value.

  2. Regarding the second bullet.
    It states the String#encode function is used when from _encoding parameter is set but it seems String#encode is used by default is you set encoding parameter to something different than ASCII-8BIT because from_encoding is set by default to ASCII-8BIT. For example, String#encode is used if you set encoding parameter to UTF-8 but according to the Documentation String#force_encoding should be used when you set the encoding parameter and not String#encode.

To Reproduce

Just start a Fluentd container with GROK plugin.

Then run the command:

td-agent --config /home/td-agent/fluentd.conf

Expected behavior

2023-11-23 15:58:46.005162458 +0100 encoding: {"message":"Zürich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005176717 +0100 encoding: {"message":"Geneva","timestamp":"2023-11-22 18:18:09.823+0100"}

Your Environment

- Fluentd version: 1.11.2
- TD Agent version: 1.11.2
- Operating system: Alma Linux 9
- Kernel version: Linux 5.14.0-284.30.1.el9_2.x86_64 x86_64

Your Configuration

# /home/td-agent/patterns.conf

CUSTOM_LOG_WORKS %{TIMESTAMP_ISO8601:timestamp} %{GREEDYDATA:message}
# HTTPDATE has ä character
# Source: https://github.com/fluent/fluent-plugin-grok-parser/blob/903dfe222984b90c4e1c1151530038d1f242157d/patterns/legacy/grok-patterns#L51
CUSTOM_LOG_FAILS %{HTTPDATE:timestamp} %{NUMBER:response}
# /tmp/encoding-test.log

2023-11-22 18:18:09.823+0100 Testing Zürich
2023-11-22 18:18:09.823+0100 Testing Geneva
# /home/td-agent/fluentd.conf
<source>

  @type tail

  path /tmp/encoding-test.log
  read_from_head true
  encoding UTF-8
  tag encoding

  <parse>

    @type grok

    grok_failure_key grokfailure
    custom_pattern_path /home/td-agent/patterns.conf

    <grok>
       pattern %{CUSTOM_LOG_FAILS:message}
    </grok>

    <grok>
       pattern %{CUSTOM_LOG_WORKS:message}
    </grok>

  </parse>

</source>

<match encoding>

    @type stdout

</match>

Your Error Log

[td-agent@dc60c1c5967e ~]$ /opt/td-agent/bin/fluentd --config /home/td-agent/fluentd.conf
2023-11-23 15:58:45 +0100 [info]: parsing config file is succeeded path="/home/td-agent/fluentd.conf"
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-elasticsearch' version '4.2.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-elasticsearch' version '4.1.1'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-grok-parser' version '2.6.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-kafka' version '0.14.1'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-prometheus' version '1.8.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-prometheus_pushgateway' version '0.0.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-record-modifier' version '2.1.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.3.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-s3' version '1.4.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-systemd' version '1.0.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-td' version '1.1.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-webhdfs' version '1.2.5'
2023-11-23 15:58:45 +0100 [info]: gem 'fluentd' version '1.11.2'
2023-11-23 15:58:45 +0100 [info]: Expanded the pattern %{CUSTOM_LOG_FAILS:message} into (?<message>(?<timestamp>(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))/(?:\b(?:[Jj]an(?:uary|uar)?|[Ff]eb(?:ruary|ruar)?|[Mm](?:a|ä)?r(?:ch|z)?|[Aa]pr(?:il)?|[Mm]a(?:y|i)?|[Jj]un(?:e|i)?|[Jj]ul(?:y|i)?|[Aa]ug(?:ust)?|[Ss]ep(?:tember)?|[Oo](?:c|k)?t(?:ober)?|[Nn]ov(?:ember)?|[Dd]e(?:c|z)(?:ember)?)\b)/(?:(?>\d\d){1,2}):(?:(?!<[0-9])(?:(?:2[0123]|[01]?[0-9])):(?:(?:[0-5][0-9]))(?::(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))(?![0-9])) (?:(?:[+-]?(?:[0-9]+)))) (?<response>(?:(?:(?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))))))
2023-11-23 15:58:45 +0100 [info]: Expanded the pattern %{CUSTOM_LOG_WORKS:message} into (?<message>(?<timestamp>(?:(?>\d\d){1,2})-(?:(?:0?[1-9]|1[0-2]))-(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))[T ](?:(?:2[0123]|[01]?[0-9])):?(?:(?:[0-5][0-9]))(?::?(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))?(?:(?:Z|[+-](?:(?:2[0123]|[01]?[0-9]))(?::?(?:(?:[0-5][0-9])))))?) (?<message>.*))
2023-11-23 15:58:45 +0100 [warn]: 'pos_file PATH' parameter is not set to a 'tail' source.
2023-11-23 15:58:45 +0100 [warn]: this parameter is highly recommended to save the position to resume tailing.
2023-11-23 15:58:45 +0100 [info]: using configuration file: <ROOT>
  <source>
    @type tail
    path "/tmp/encoding-test.log"
    tag "encoding"
    read_from_head true
    encoding "UTF-8"
    <parse>
      @type "grok"
      grok_failure_key "grokfailure"
      custom_pattern_path "/home/td-agent/patterns.conf"
      unmatched_lines
      <grok>
        pattern "%{CUSTOM_LOG_FAILS:message}"
      </grok>
      <grok>
        pattern "%{CUSTOM_LOG_WORKS:message}"
      </grok>
    </parse>
  </source>
  <match encoding>
    @type stdout
  </match>
</ROOT>
2023-11-23 15:58:45 +0100 [info]: starting fluentd-1.11.2 pid=715 ruby="2.7.1"
2023-11-23 15:58:45 +0100 [info]: spawn command to main:  cmdline=["/opt/td-agent/bin/ruby", "-Eascii-8bit:ascii-8bit", "/opt/td-agent/bin/fluentd", "--config", "/home/td-agent/fluentd.conf", "--under-supervisor"]
2023-11-23 15:58:45 +0100 [info]: adding match pattern="encoding" type="stdout"
2023-11-23 15:58:45 +0100 [info]: adding source type="tail"
2023-11-23 15:58:46 +0100 [info]: #0 Expanded the pattern %{CUSTOM_LOG_FAILS:message} into (?<message>(?<timestamp>(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))/(?:\b(?:[Jj]an(?:uary|uar)?|[Ff]eb(?:ruary|ruar)?|[Mm](?:a|ä)?r(?:ch|z)?|[Aa]pr(?:il)?|[Mm]a(?:y|i)?|[Jj]un(?:e|i)?|[Jj]ul(?:y|i)?|[Aa]ug(?:ust)?|[Ss]ep(?:tember)?|[Oo](?:c|k)?t(?:ober)?|[Nn]ov(?:ember)?|[Dd]e(?:c|z)(?:ember)?)\b)/(?:(?>\d\d){1,2}):(?:(?!<[0-9])(?:(?:2[0123]|[01]?[0-9])):(?:(?:[0-5][0-9]))(?::(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))(?![0-9])) (?:(?:[+-]?(?:[0-9]+)))) (?<response>(?:(?:(?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))))))
2023-11-23 15:58:46 +0100 [info]: #0 Expanded the pattern %{CUSTOM_LOG_WORKS:message} into (?<message>(?<timestamp>(?:(?>\d\d){1,2})-(?:(?:0?[1-9]|1[0-2]))-(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))[T ](?:(?:2[0123]|[01]?[0-9])):?(?:(?:[0-5][0-9]))(?::?(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))?(?:(?:Z|[+-](?:(?:2[0123]|[01]?[0-9]))(?::?(?:(?:[0-5][0-9])))))?) (?<message>.*))
2023-11-23 15:58:46 +0100 [warn]: #0 'pos_file PATH' parameter is not set to a 'tail' source.
2023-11-23 15:58:46 +0100 [warn]: #0 this parameter is highly recommended to save the position to resume tailing.
2023-11-23 15:58:46 +0100 [info]: #0 starting fluentd worker pid=720 ppid=715 worker=0
2023-11-23 15:58:46 +0100 [info]: #0 following tail of /tmp/encoding-test.log
2023-11-23 15:58:46.005131856 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005146527 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005152826 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005157747 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005162458 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005176717 +0100 encoding: {"message":"Geneva","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46 +0100 [info]: #0 fluentd worker is now running worker=0

Additional details

If I set both encoding parameters to UTF-8 I get a warning on the Fluentd logs but the special characters are represented.
I don't know if this is the proper way to represent the special characters since I get a warning. Shouldn't this warning be change to info ?

Configuration

    @type tail
    path "/tmp/encoding-test.log"
    tag "encoding"
    read_from_head true
    from_encoding "UTF-8"
    encoding "UTF-8"

Warning

2023-11-23 14:44:12 +0100 [warn]: #0 fluent/log.rb:348:warn: 'encoding' and 'from_encoding' are same encoding. No effect

Output

2023-11-23 14:44:12.044957269 +0100 encoding: {"message":"Zürich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 14:44:12.044962081 +0100 encoding: {"message":"Geneva","timestamp":"2023-11-22 18:18:09.823+0100"}

Documentation not clear or wrong

Another option could be that Fluentd works as expected but the Documentation is not clear enough or it's wrong.

@gadiego92 gadiego92 changed the title Is encoding value being used as the Documentation states ? Is the encoding parameter being used as the Documentation states ? Nov 23, 2023
@ashie
Copy link
Member

ashie commented Dec 8, 2023

Thanks for your report!
Obviously the documentation is wrong.

  • The first bullet is incorrect: When only encoding parameter is set, in_tail tries to convert input string from ASCII-8BIT to encoding
    • Ruby tries to convert the original string from ASCII-8BIT to UTF-8 before converting it to encoding.
  • The second bullet is correct.

@ashie ashie transferred this issue from fluent/fluentd Dec 8, 2023
@gadiego92
Copy link
Author

gadiego92 commented Dec 8, 2023

What do you mean when you say the following ?

  • Ruby tries to convert the original string from ASCII-8BIT to UTF-8 before converting it to encoding.

Do you mean there are two encoding process ?
If I'm not wrong by default both from_encoding and encoding value is ASCII-8BIT. So by default the encode function is not called.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants