Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement new parser for parser_syslog #2599

Merged
merged 8 commits into from
Sep 9, 2019
Merged

Conversation

repeatedly
Copy link
Member

Signed-off-by: Masahiro Nakagawa [email protected]

Which issue(s) this PR fixes:
Fixes #2585

What this PR does / why we need it:
Changing regexp is difficult because it may break existing user environment, so
add non-regexp based parser to support more syslog message.
This parser is also 2x faster.

require 'benchmark/ips'
require 'fluent/engine'
require_relative 'lib/fluent/plugin/parser_syslog'

parser = Fluent::Plugin::SyslogParser.new
parser.configure(Fluent::Config::Element.new('ROOT', '', {'keep_time_key' => true}, []))
log = 'Feb 28 12:00:00 192.168.0.1 fluentd[11111]: [error] Syslog test'
parser2 = Fluent::Plugin::SyslogParser.new
parser2.configure(Fluent::Config::Element.new('ROOT', '', {'with_priority' => true, 'keep_time_key' => true}, []))
log2 = '<1>Feb 28 12:00:00 192.168.0.1 fluentd[11111]: [error] Syslog test'

Benchmark.ips do |x|
  x.report "regexp" do
    parser.parse_plain(log) { |t, r| }
  end
  x.report "string" do
    parser.parse_rfc3164(log) { |t, r| }
  end
  x.report "regexp with pri" do
    parser2.parse_plain(log2) { |t, r| }
  end
  x.report "string with pri" do
    parser2.parse_rfc3164(log2) { |t, r| }
  end
end

Warming up --------------------------------------
              regexp    20.246k i/100ms
              string    42.531k i/100ms
     regexp with pri    18.060k i/100ms
     string with pri    37.281k i/100ms
Calculating -------------------------------------
              regexp    218.482k (± 1.7%) i/s -      1.093M in   5.005516s
              string    488.549k (± 1.5%) i/s -      2.467M in   5.050341s
     regexp with pri    195.467k (± 2.1%) i/s -    993.300k in   5.083917s
     string with pri    408.422k (± 3.7%) i/s -      2.050M in   5.028522s

Docs Changes:
Add new parameters and log example.

Release Note:
Same as title.

@repeatedly repeatedly added the enhancement Feature request or improve operations label Sep 3, 2019
@repeatedly repeatedly marked this pull request as ready for review September 3, 2019 11:55
lib/fluent/plugin/parser_syslog.rb Show resolved Hide resolved
if @with_priority
if text.start_with?(PRI_START_CHAR)
i = text.index('>'.freeze, 1)
pri = text.slice(1, i - 1).to_i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pri can be 0 if i < 2 or text.slice(1, i - 1) contains not number charactors. Is it acceptable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://tools.ietf.org/html/rfc3164#section-4.1.1

The PRI part MUST have three, four, or five characters

Need it check i is less than 5?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so because the purpose of this parser is for supporting more format, not strict parser.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so because the purpose of this parser is for supporting more format, not strict parser.

What kind of format do we want to support?
If this parser does not follow the rfc3164, it's probably better to use new message_format name (rfc3164_ext or like that).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this parser does not follow the rfc3164, it's probably better to use new message_format name (rfc3164_ext or like that).

The problem is existing products uses rfc3164 for it. rfc3164 describes the collection of existing message format and BSD spec but many existing tools doesn't follow rfc3164 strictly. This parser can support strict rfc3164 format, so it doesn't block user's usecase for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pri can be 0 if i < 2 or text.slice(1, i - 1) contains not number charactors. Is it acceptable?

Is it okay?

if time_end == SPLIT_CHAR
time_str = text.slice(start, diff)
start += 16 # time + ' '
elsif time_end == '.'.freeze
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rfc3164 seems not to support subsecond time https://tools.ietf.org/html/rfc3164#section-4.1.2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but some products send rfc3164 syslog message with subsecond time.
regexp version also supports subsecond time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then, why don't you change the method name? this is not parse_rfc3164.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment it on above.

lib/fluent/plugin/parser_syslog.rb Outdated Show resolved Hide resolved
lib/fluent/plugin/parser_syslog.rb Outdated Show resolved Hide resolved
end
text.slice(i + 1, text.bytesize)
else
text.slice(i - diff, text.bytesize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, ident(TAG in rfc) does not exist, does message follow the rfc?

Copy link
Member Author

@repeatedly repeatedly Sep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure but some product sends such message. And this is good for capturing all MSG broken syslog messages.

lib/fluent/plugin/parser_syslog.rb Show resolved Hide resolved
msg.chomp!
record['message'] = msg

time = @time_parser.parse(time_str.squeeze(SPLIT_CHAR))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time_str was sliced with size(15), right? why do we need to call String#squeeze?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember the actual product name but users hit the problem with the number of ' '.
regexp version also calls squeeze for avoiding this problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value.gsub(/ +/, "") is introduced by #68 ce1c410
But we don't need using String#squeeze to parse time string including redundant spaces.
Now we can remove String#squeeze here.

>> Time.strptime("Feb 5 11:11:11", "%b %d %H:%M:%S")
=> 2019-02-05 11:11:11 +0900
>> Time.strptime("Feb  5 11:11:11", "%b %d %H:%M:%S")
=> 2019-02-05 11:11:11 +0900
>> Time.strptime("Feb   5 11:11:11", "%b %d %H:%M:%S")
=> 2019-02-05 11:11:11 +0900
>> Time.strptime("Feb   5    11:11:11", "%b %d %H:%M:%S")
=> 2019-02-05 11:11:11 +0900
>> Time.strptime("Feb  05    11:11:11", "%b %d %H:%M:%S")
=> 2019-02-05 11:11:11 +0900
>> strptime = Strptime.new("%b %d %H:%M:%S")
=> #<Strptime:0x000055e8adc01110>
>> strptime.exec("Feb 5 11:11:11")
=> 2019-02-05 11:11:11 +0900
>> strptime.exec("Feb      5 11:11:11")
=> 2019-02-05 11:11:11 +0900
>> strptime.exec("Feb      05 11:11:11")
=> 2019-02-05 11:11:11 +0900
>> strptime.exec("Feb      05     11:11:11")
=> 2019-02-05 11:11:11 +0900
>> strptime.exec("Feb      5     11:11:11")
=> 2019-02-05 11:11:11 +0900

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing regex does not specify the size of chars. but here specifies the fixed size(15).
So if the character which needs to call squeeze is passed to this filter plugin, the parser will fail to parse it in any case, right?.

host = text.slice(start, diff)
start += (diff + 1)

i = text.index(SPLIT_CHAR, start)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the text is ${TIMESTAMP} ${HOST} test:val, this parser won't work. and I think test:val is valid syslog message.
ident should be test and msg should be val

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very difficult problem because rfc3164 doesn't define how ident and message are separated. For the experience, here is an actual examples:

HEADER ident msg  # No problem
HEADER ident: msg # No problem
HEADER ident:msg  # Existing products send this as only msg, not ident:msg.

So to support both cases, adding option is better.

test/plugin/test_parser_syslog.rb Outdated Show resolved Hide resolved
end

# header part
time_diff = 15 # skip Mmm dd hh:mm:ss
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about time_width or time_size or time_str_size?

cursor += 16 # time + ' '
elsif time_end == '.'.freeze
# support subsecond time
i = text.index(SPLIT_CHAR, time_diff)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about time_end_pos or something instead of i?

return
end

i = text.index(SPLIT_CHAR, cursor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about host_end_pos instead of i?

yield nil, nil
return
end
host_diff = i - cursor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about host_width or host_size?

end
host_diff = i - cursor
host = text.slice(cursor, host_diff)
cursor += (host_diff + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant brackets?

else
if text[i - 1] == ':'.freeze
if text[i - 2] == ']'.freeze
j = text.index('['.freeze, cursor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about left_bracket_pos instead of j?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why .freeze for ":", "[", "]"? Shoud we extract them to constants?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can extract it. This is used at only one point so I use literal instead.

if text[i - 1] == ':'.freeze
if text[i - 2] == ']'.freeze
j = text.index('['.freeze, cursor)
record['ident'] = text.slice(cursor, j - cursor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a local value ident_size = j - cursor?

if text[i - 2] == ']'.freeze
j = text.index('['.freeze, cursor)
record['ident'] = text.slice(cursor, j - cursor)
record['pid'] = text.slice(j + 1, i - j - 3) # remove '[' / ']:'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding local values pid_start_pos = j + 1 and pid_size = i - j - 3?
j may be changed to left_bracket_pos.

record['ident'] = text.slice(cursor, j - cursor)
record['pid'] = text.slice(j + 1, i - j - 3) # remove '[' / ']:'
else
record['ident'] = text.slice(cursor, i - cursor - 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

msg.chomp!
record['message'] = msg

time = @time_parser.parse(time_str.squeeze(SPLIT_CHAR))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value.gsub(/ +/, "") is introduced by #68 ce1c410
But we don't need using String#squeeze to parse time string including redundant spaces.
Now we can remove String#squeeze here.

>> Time.strptime("Feb 5 11:11:11", "%b %d %H:%M:%S")
=> 2019-02-05 11:11:11 +0900
>> Time.strptime("Feb  5 11:11:11", "%b %d %H:%M:%S")
=> 2019-02-05 11:11:11 +0900
>> Time.strptime("Feb   5 11:11:11", "%b %d %H:%M:%S")
=> 2019-02-05 11:11:11 +0900
>> Time.strptime("Feb   5    11:11:11", "%b %d %H:%M:%S")
=> 2019-02-05 11:11:11 +0900
>> Time.strptime("Feb  05    11:11:11", "%b %d %H:%M:%S")
=> 2019-02-05 11:11:11 +0900
>> strptime = Strptime.new("%b %d %H:%M:%S")
=> #<Strptime:0x000055e8adc01110>
>> strptime.exec("Feb 5 11:11:11")
=> 2019-02-05 11:11:11 +0900
>> strptime.exec("Feb      5 11:11:11")
=> 2019-02-05 11:11:11 +0900
>> strptime.exec("Feb      05 11:11:11")
=> 2019-02-05 11:11:11 +0900
>> strptime.exec("Feb      05     11:11:11")
=> 2019-02-05 11:11:11 +0900
>> strptime.exec("Feb      5     11:11:11")
=> 2019-02-05 11:11:11 +0900

@parser.configure('parser_type' => 'string')
@parser.instance.parse("1990 Oct 22 10:52:01 TZ-6 scapegoat.dmz.example.org 10.1.2.32 sched[0]: That's All Folks!") { |time, record|
expected = {'host' => 'scapegoat.dmz.example.org', 'ident' => 'sched', 'pid' => '0', 'message' => "That's All Folks!"}
assert_not_equal(expected, record)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what the result of parsing it? I think assert_equal should be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actual generated record is not important for this case.
This test shows parser can't parse non-standard HEADER/MSG syslog message correctly, so assert_not_equal seems no problem for me.

@repeatedly
Copy link
Member Author

If no other concerns, I will merge this patch.

@repeatedly repeatedly merged commit 1e4e0c1 into master Sep 9, 2019
@repeatedly repeatedly deleted the add-new-parser-to-syslog branch September 9, 2019 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature request or improve operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fluentd input plugin syslog is truncating syslog entry
3 participants