Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement new parser for parser_syslog #2599

Merged
merged 8 commits into from
Sep 9, 2019
107 changes: 104 additions & 3 deletions lib/fluent/plugin/parser_syslog.rb
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,10 @@ class SyslogParser < Parser
config_param :message_format, :enum, list: [:rfc3164, :rfc5424, :auto], default: :rfc3164
desc 'Specify time format for event time for rfc5424 protocol'
config_param :rfc5424_time_format, :string, default: "%Y-%m-%dT%H:%M:%S.%L%z"
desc 'The parser type used to parse syslog message'
config_param :parser_type, :enum, list: [:regexp, :string], default: :regexp
ganmacs marked this conversation as resolved.
Show resolved Hide resolved
desc 'support colonless ident in string parser'
config_param :support_colonless_ident, :bool, default: true

def initialize
super
Expand All @@ -50,10 +54,17 @@ def configure(conf)
@time_parser_rfc3164 = @time_parser_rfc5424 = nil
@time_parser_rfc5424_without_subseconds = nil
@support_rfc5424_without_subseconds = false
@regexp_parser = @parser_type == :regexp
@regexp = case @message_format
when :rfc3164
class << self
alias_method :parse, :parse_plain
if @regexp_parser
class << self
alias_method :parse, :parse_plain
end
else
class << self
alias_method :parse, :parse_rfc3164
end
end
@with_priority ? REGEXP_WITH_PRI : REGEXP
when :rfc5424
Expand Down Expand Up @@ -88,11 +99,16 @@ def parse_auto(text, &block)
@regexp = @with_priority ? REGEXP_RFC5424_WITH_PRI : REGEXP_RFC5424
@time_parser = @time_parser_rfc5424
@support_rfc5424_without_subseconds = true
parse_plain(text, &block)
else
@regexp = @with_priority ? REGEXP_WITH_PRI : REGEXP
@time_parser = @time_parser_rfc3164
if @regexp_parser
parse_plain(text, &block)
else
parse_rfc3164(text, &block)
end
end
parse_plain(text, &block)
end

def parse_plain(text, &block)
Expand Down Expand Up @@ -137,6 +153,91 @@ def parse_plain(text, &block)

yield time, record
end

SPLIT_CHAR = ' '.freeze
PRI_START_CHAR = '<'.freeze

def parse_rfc3164(text, &block)
pri = nil
start = 0
ganmacs marked this conversation as resolved.
Show resolved Hide resolved
if @with_priority
if text.start_with?(PRI_START_CHAR)
i = text.index('>'.freeze, 1)
pri = text.slice(1, i - 1).to_i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pri can be 0 if i < 2 or text.slice(1, i - 1) contains not number charactors. Is it acceptable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://tools.ietf.org/html/rfc3164#section-4.1.1

The PRI part MUST have three, four, or five characters

Need it check i is less than 5?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so because the purpose of this parser is for supporting more format, not strict parser.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so because the purpose of this parser is for supporting more format, not strict parser.

What kind of format do we want to support?
If this parser does not follow the rfc3164, it's probably better to use new message_format name (rfc3164_ext or like that).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this parser does not follow the rfc3164, it's probably better to use new message_format name (rfc3164_ext or like that).

The problem is existing products uses rfc3164 for it. rfc3164 describes the collection of existing message format and BSD spec but many existing tools doesn't follow rfc3164 strictly. This parser can support strict rfc3164 format, so it doesn't block user's usecase for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pri can be 0 if i < 2 or text.slice(1, i - 1) contains not number charactors. Is it acceptable?

Is it okay?

start = i + 1
else
yield nil, nil
return
end
end

# header part
diff = 15 # skip Mmm dd hh:mm:ss
ganmacs marked this conversation as resolved.
Show resolved Hide resolved
time_end = text[start + diff]
if time_end == SPLIT_CHAR
time_str = text.slice(start, diff)
start += 16 # time + ' '
elsif time_end == '.'.freeze
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rfc3164 seems not to support subsecond time https://tools.ietf.org/html/rfc3164#section-4.1.2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but some products send rfc3164 syslog message with subsecond time.
regexp version also supports subsecond time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then, why don't you change the method name? this is not parse_rfc3164.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment it on above.

# support subsecond time
i = text.index(SPLIT_CHAR, diff)
time_str = text.slice(start, i - start)
start = i + 1
else
yield nil, nil
return
end

i = text.index(SPLIT_CHAR, start)
if i.nil?
yield nil, nil
return
end
diff = i - start
ganmacs marked this conversation as resolved.
Show resolved Hide resolved
host = text.slice(start, diff)
start += (diff + 1)

i = text.index(SPLIT_CHAR, start)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the text is ${TIMESTAMP} ${HOST} test:val, this parser won't work. and I think test:val is valid syslog message.
ident should be test and msg should be val

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very difficult problem because rfc3164 doesn't define how ident and message are separated. For the experience, here is an actual examples:

HEADER ident msg  # No problem
HEADER ident: msg # No problem
HEADER ident:msg  # Existing products send this as only msg, not ident:msg.

So to support both cases, adding option is better.

if i.nil?
yield nil, nil
return
end
diff = i - start

record = {'host' => host}
ganmacs marked this conversation as resolved.
Show resolved Hide resolved
record['pri'] = pri if pri

# message part
msg = if text[i - 1] == ':'.freeze
if text[i - 2] == ']'.freeze
j = text.index('['.freeze, start)
record['ident'] = text.slice(start, j - start)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://tools.ietf.org/html/rfc3164#section-4.1.3

The TAG is a string of ABNF alphanumeric characters that MUST NOT exceed 32 characters

So , Need it check the size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need because this is not strict parser and strict parser for rfc3164 is useless on the production.

record['pid'] = text.slice(j + 1, i - j - 3) # remove '[' / ']:'
else
record['ident'] = text.slice(start, i - start - 1)
end
text.slice(i + 1, text.bytesize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you use String#bytesize? if text is テキスト(multi bytes chars), text.bytesize is 12.

$ irb                                                                                                                                           
irb(main):001:0> text = "テキスト"
=> "テキスト"
irb(main):002:0> text.bytesize
=> 12

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slice is for getting remaining data from text.
We can't use slice(i + 1, -1), so I use bytesize for fast calculation.
Other good method? slice(i + 1..-1) is slower...

else
if @support_colonless_ident
if text[i - 1] == ']'.freeze
j = text.index('['.freeze, start)
record['ident'] = text.slice(start, j - start)
record['pid'] = text.slice(j + 1, i - j - 2) # remove '[' / ']'
else
record['ident'] = text.slice(start, i - start)
end
text.slice(i + 1, text.bytesize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

else
text.slice(i - diff, text.bytesize)
ganmacs marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, ident(TAG in rfc) does not exist, does message follow the rfc?

Copy link
Member Author

@repeatedly repeatedly Sep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure but some product sends such message. And this is good for capturing all MSG broken syslog messages.

end
end
msg.chomp!
ganmacs marked this conversation as resolved.
Show resolved Hide resolved
record['message'] = msg

time = @time_parser.parse(time_str.squeeze(SPLIT_CHAR))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time_str was sliced with size(15), right? why do we need to call String#squeeze?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember the actual product name but users hit the problem with the number of ' '.
regexp version also calls squeeze for avoiding this problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value.gsub(/ +/, "") is introduced by #68 ce1c410
But we don't need using String#squeeze to parse time string including redundant spaces.
Now we can remove String#squeeze here.

>> Time.strptime("Feb 5 11:11:11", "%b %d %H:%M:%S")
=> 2019-02-05 11:11:11 +0900
>> Time.strptime("Feb  5 11:11:11", "%b %d %H:%M:%S")
=> 2019-02-05 11:11:11 +0900
>> Time.strptime("Feb   5 11:11:11", "%b %d %H:%M:%S")
=> 2019-02-05 11:11:11 +0900
>> Time.strptime("Feb   5    11:11:11", "%b %d %H:%M:%S")
=> 2019-02-05 11:11:11 +0900
>> Time.strptime("Feb  05    11:11:11", "%b %d %H:%M:%S")
=> 2019-02-05 11:11:11 +0900
>> strptime = Strptime.new("%b %d %H:%M:%S")
=> #<Strptime:0x000055e8adc01110>
>> strptime.exec("Feb 5 11:11:11")
=> 2019-02-05 11:11:11 +0900
>> strptime.exec("Feb      5 11:11:11")
=> 2019-02-05 11:11:11 +0900
>> strptime.exec("Feb      05 11:11:11")
=> 2019-02-05 11:11:11 +0900
>> strptime.exec("Feb      05     11:11:11")
=> 2019-02-05 11:11:11 +0900
>> strptime.exec("Feb      5     11:11:11")
=> 2019-02-05 11:11:11 +0900

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing regex does not specify the size of chars. but here specifies the fixed size(15).
So if the character which needs to call squeeze is passed to this filter plugin, the parser will fail to parse it in any case, right?.

record['time'] = time_str if @keep_time_key

yield time, record
end
end
end
end
113 changes: 94 additions & 19 deletions test/plugin/test_parser_syslog.rb
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,9 @@ def setup
}
end

def test_parse
@parser.configure({})
data('regexp' => 'regexp', 'string' => 'string')
def test_parse(param)
@parser.configure({'parser_type' => param})
@parser.instance.parse('Feb 28 12:00:00 192.168.0.1 fluentd[11111]: [error] Syslog test') { |time, record|
assert_equal(event_time('Feb 28 12:00:00', format: '%b %d %H:%M:%S'), time)
assert_equal(@expected, record)
Expand All @@ -24,17 +25,28 @@ def test_parse
assert_equal("%b %d %H:%M:%S", @parser.instance.patterns['time_format'])
end

def test_parse_with_time_format
@parser.configure('time_format' => '%b %d %M:%S:%H')
data('regexp' => 'regexp', 'string' => 'string')
def test_parse_with_time_format(param)
@parser.configure('time_format' => '%b %d %M:%S:%H', 'parser_type' => param)
@parser.instance.parse('Feb 28 00:00:12 192.168.0.1 fluentd[11111]: [error] Syslog test') { |time, record|
assert_equal(event_time('Feb 28 12:00:00', format: '%b %d %H:%M:%S'), time)
assert_equal(@expected, record)
}
assert_equal('%b %d %M:%S:%H', @parser.instance.patterns['time_format'])
end

def test_parse_with_priority
@parser.configure('with_priority' => true)
data('regexp' => 'regexp', 'string' => 'string')
def test_parse_with_subsecond_time(param)
@parser.configure('time_format' => '%b %d %H:%M:%S.%N', 'parser_type' => param)
@parser.instance.parse('Feb 28 12:00:00.456 192.168.0.1 fluentd[11111]: [error] Syslog test') { |time, record|
assert_equal(event_time('Feb 28 12:00:00.456', format: '%b %d %H:%M:%S.%N'), time)
assert_equal(@expected, record)
}
end

data('regexp' => 'regexp', 'string' => 'string')
def test_parse_with_priority(param)
@parser.configure('with_priority' => true, 'parser_type' => param)
@parser.instance.parse('<6>Feb 28 12:00:00 192.168.0.1 fluentd[11111]: [error] Syslog test') { |time, record|
assert_equal(event_time('Feb 28 12:00:00', format: '%b %d %H:%M:%S'), time)
assert_equal(@expected.merge('pri' => 6), record)
Expand All @@ -43,8 +55,9 @@ def test_parse_with_priority
assert_equal("%b %d %H:%M:%S", @parser.instance.patterns['time_format'])
end

def test_parse_without_colon
@parser.configure({})
data('regexp' => 'regexp', 'string' => 'string')
def test_parse_without_colon(param)
@parser.configure({'parser_type' => param})
@parser.instance.parse('Feb 28 12:00:00 192.168.0.1 fluentd[11111] [error] Syslog test') { |time, record|
assert_equal(event_time('Feb 28 12:00:00', format: '%b %d %H:%M:%S'), time)
assert_equal(@expected, record)
Expand All @@ -53,35 +66,85 @@ def test_parse_without_colon
assert_equal("%b %d %H:%M:%S", @parser.instance.patterns['time_format'])
end

def test_parse_with_keep_time_key
data('regexp' => 'regexp', 'string' => 'string')
def test_parse_with_keep_time_key(param)
@parser.configure(
'time_format' => '%b %d %M:%S:%H',
'keep_time_key'=>'true',
'parser_type' => param
)
text = 'Feb 28 00:00:12 192.168.0.1 fluentd[11111]: [error] Syslog test'
@parser.instance.parse(text) do |time, record|
assert_equal "Feb 28 00:00:12", record['time']
end
end

def test_parse_various_characters_for_tag
data('regexp' => 'regexp', 'string' => 'string')
def test_parse_various_characters_for_tag(param)
ident = '~!@#$%^&*()_+=-`]{};"\'/?\\,.<>'
@parser.configure({})
@parser.configure({'parser_type' => param})
@parser.instance.parse("Feb 28 12:00:00 192.168.0.1 #{ident}[11111]: [error] Syslog test") { |time, record|
assert_equal(event_time('Feb 28 12:00:00', format: '%b %d %H:%M:%S'), time)
assert_equal(@expected.merge('ident' => ident), record)
}
end

def test_parse_various_characters_for_tag_with_priority
data('regexp' => 'regexp', 'string' => 'string')
def test_parse_various_characters_for_tag_with_priority(param)
ident = '~!@#$%^&*()_+=-`]{};"\'/?\\,.<>'
@parser.configure('with_priority' => true)
@parser.configure('with_priority' => true, 'parser_type' => param)
@parser.instance.parse("<6>Feb 28 12:00:00 192.168.0.1 #{ident}[11111]: [error] Syslog test") { |time, record|
assert_equal(event_time('Feb 28 12:00:00', format: '%b %d %H:%M:%S'), time)
assert_equal(@expected.merge('pri' => 6, 'ident' => ident), record)
}
end

sub_test_case 'Check the difference of regexp and string parser' do
# examples from rfc3164
data('regexp' => 'regexp', 'string' => 'string')
test 'wrong result with no ident message by default' do |param|
@parser.configure('parser_type' => param)
@parser.instance.parse('Feb 5 17:32:18 10.0.0.99 Use the BFG!') { |time, record|
assert_equal({'host' => '10.0.0.99', 'ident' => 'Use', 'message' => 'the BFG!'}, record)
}
end

test "proper result with no ident message by 'support_colonless_ident false'" do
@parser.configure('parser_type' => 'string', 'support_colonless_ident' => false)
@parser.instance.parse('Feb 5 17:32:18 10.0.0.99 Use the BFG!') { |time, record|
assert_equal({'host' => '10.0.0.99', 'message' => 'Use the BFG!'}, record)
}
end

data('regexp' => 'regexp', 'string' => 'string')
test "both parsers can't parse broken syslog message" do |param|
@parser.configure('parser_type' => param)
if param == 'string'
ganmacs marked this conversation as resolved.
Show resolved Hide resolved
@parser.instance.parse("1990 Oct 22 10:52:01 TZ-6 scapegoat.dmz.example.org 10.1.2.32 sched[0]: That's All Folks!") { |time, record|
expected = {'host' => 'scapegoat.dmz.example.org', 'ident' => 'sched', 'pid' => '0', 'message' => "That's All Folks!"}
assert_not_equal(expected, record)
}
else
assert_raise(Fluent::TimeParser::TimeParseError) {
@parser.instance.parse("1990 Oct 22 10:52:01 TZ-6 scapegoat.dmz.example.org 10.1.2.32 sched[0]: That's All Folks!") { |time, record| }
}
end
end

data('regexp' => 'regexp', 'string' => 'string')
test "':' included message breaks regexp parser" do |param|
@parser.configure('parser_type' => param)
@parser.instance.parse('Aug 10 12:00:00 127.0.0.1 test foo:bar') { |time, record|
expected = {'host' => '127.0.0.1', 'ident' => 'test', 'message' => 'foo:bar'}
if param == 'string'
assert_equal(expected, record)
else
assert_not_equal(expected, record)
end
}
end
end

class TestRFC5424Regexp < self
def test_parse_with_rfc5424_message
@parser.configure(
Expand Down Expand Up @@ -273,10 +336,12 @@ def test_parse_with_rfc5424_message_both_timestamp
end

class TestAutoRegexp < self
def test_auto_with_legacy_syslog_message
data('regexp' => 'regexp', 'string' => 'string')
def test_auto_with_legacy_syslog_message(param)
@parser.configure(
'time_format' => '%b %d %M:%S:%H',
'message_format' => 'auto',
'parser_type' => param
)
text = 'Feb 28 00:00:12 192.168.0.1 fluentd[11111]: [error] Syslog test'
@parser.instance.parse(text) do |time, record|
Expand All @@ -286,11 +351,13 @@ def test_auto_with_legacy_syslog_message
assert_equal(Fluent::Plugin::SyslogParser::REGEXP, @parser.instance.patterns['format'])
end

def test_auto_with_legacy_syslog_priority_message
data('regexp' => 'regexp', 'string' => 'string')
def test_auto_with_legacy_syslog_priority_message(param)
@parser.configure(
'time_format' => '%b %d %M:%S:%H',
'with_priority' => true,
'message_format' => 'auto',
'parser_type' => param
)
text = '<6>Feb 28 12:00:00 192.168.0.1 fluentd[11111]: [error] Syslog test'
@parser.instance.parse(text) do |time, record|
Expand All @@ -300,11 +367,13 @@ def test_auto_with_legacy_syslog_priority_message
assert_equal(Fluent::Plugin::SyslogParser::REGEXP_WITH_PRI, @parser.instance.patterns['format'])
end

def test_parse_with_rfc5424_message
data('regexp' => 'regexp', 'string' => 'string')
def test_parse_with_rfc5424_message(param)
@parser.configure(
'time_format' => '%Y-%m-%dT%H:%M:%S.%L%z',
'message_format' => 'auto',
'with_priority' => true,
'parser_type' => param
)
text = '<16>1 2017-02-06T13:14:15.003Z 192.168.0.1 fluentd - - - Hi, from Fluentd!'
@parser.instance.parse(text) do |time, record|
Expand All @@ -318,11 +387,13 @@ def test_parse_with_rfc5424_message
@parser.instance.patterns['format'])
end

def test_parse_with_rfc5424_structured_message
data('regexp' => 'regexp', 'string' => 'string')
def test_parse_with_rfc5424_structured_message(param)
@parser.configure(
'time_format' => '%Y-%m-%dT%H:%M:%S.%L%z',
'message_format' => 'auto',
'with_priority' => true,
'parser_type' => param
)
text = '<16>1 2017-02-06T13:14:15.003Z 192.168.0.1 fluentd 11111 ID24224 [exampleSDID@20224 iut="3" eventSource="Application" eventID="11211"] Hi, from Fluentd!'
@parser.instance.parse(text) do |time, record|
Expand All @@ -337,12 +408,14 @@ def test_parse_with_rfc5424_structured_message
@parser.instance.patterns['format'])
end

def test_parse_with_both_message_type
data('regexp' => 'regexp', 'string' => 'string')
def test_parse_with_both_message_type(param)
@parser.configure(
'time_format' => '%b %d %M:%S:%H',
'rfc5424_time_format' => '%Y-%m-%dT%H:%M:%S.%L%z',
'message_format' => 'auto',
'with_priority' => true,
'parser_type' => param
)
text = '<1>Feb 28 12:00:00 192.168.0.1 fluentd[11111]: [error] Syslog test'
@parser.instance.parse(text) do |time, record|
Expand Down Expand Up @@ -382,12 +455,14 @@ def test_parse_with_both_message_type
@parser.instance.patterns['format'])
end

def test_parse_with_both_message_type_and_priority
data('regexp' => 'regexp', 'string' => 'string')
def test_parse_with_both_message_type_and_priority(param)
@parser.configure(
'time_format' => '%b %d %M:%S:%H',
'rfc5424_time_format' => '%Y-%m-%dT%H:%M:%S.%L%z',
'with_priority' => true,
'message_format' => 'auto',
'parser_type' => param
)
text = '<6>Feb 28 12:00:00 192.168.0.1 fluentd[11111]: [error] Syslog test'
@parser.instance.parse(text) do |time, record|
Expand Down