Skip to content

Conversation

@wiibaa
Copy link
Contributor

@wiibaa wiibaa commented Sep 26, 2014

As reported in LOGSTASH-2246, xml filter xpath fails with non-ascii content,
here is a test case and fix.

It is due to Nokogiri to_s calling to_xhml or to_xml without encoding parameters.
using the to_str method extract the content differently (hidden in the C or Java implementation) and do not suffer encoding issue

@elasticsearch-release
Copy link

Can one of the admins verify this patch?

@jordansissel
Copy link
Contributor

I can confirm that without the patch to filters/xml.rb, the test you added fails (as expected):

  1) LogStash::Filters::Xml parse correctly non ascii content with xpath "{"xmldata":"<foo><key>Français</key></foo>"}" when processed
     Failure/Error: pipeline.filter(e) {|new_event| results << new_event }
     Encoding::UndefinedConversionError:
       ""\xC3\xA7"" from UTF-8 to US-ASCII

@jordansissel
Copy link
Contributor

I am reviewing to see if the mysterious to_str is the correct way to do this.

@jordansissel
Copy link
Contributor

Nokogiri::XML is shorthand for Nokogiri::XML::Document.parse(), which has this method signature:

parse(string_or_io, url = nil, encoding = nil, options = ParseOptions::DEFAULT_XML, &block) Show Source
(docs here)

@jordansissel
Copy link
Contributor

>> t = Nokogiri::XML("<foo><key>Français</key></foo>").xpath("/foo/key/text()")
>> t.to_s
Encoding::UndefinedConversionError: ""\xC3\xA7"" from UTF-8 to US-ASCII

Definitely fails! Nasty.

I think the parse encoding is the way to fix this:

>> t = Nokogiri::XML("<foo><key>Français</key></foo>", nil, "UTF-8").xpath("/foo")[0].to_s
=> "<foo><key>Français</key></foo>"
>> t = Nokogiri::XML("<foo><key>Français</key></foo>", nil, "UTF-8").xpath("/foo/key/text()")[0].to_s
=> "Français"

jordansissel added a commit to jordansissel/logstash that referenced this pull request Sep 29, 2014
@jordansissel
Copy link
Contributor

jordansissel@d5dfc5a is my proposal to fix this. It appears to work and in testing this still passes your new specs. Woo!

jordansissel added a commit that referenced this pull request Sep 30, 2014
jordansissel added a commit that referenced this pull request Sep 30, 2014
@jordansissel
Copy link
Contributor

Closing in favor of #1803

@jordansissel
Copy link
Contributor

@wiibaa thank you for figuring out the problem and solving it! <3

@jordansissel jordansissel added this to the v1.5.0 milestone Sep 30, 2014
@jordansissel jordansissel self-assigned this Sep 30, 2014
@wiibaa
Copy link
Contributor Author

wiibaa commented Sep 30, 2014

@jordansissel thanks you but the credits should go to the original reporter in jira.
While you are toying with xml filter filter, any chance to look at #1255 ;)

@samoin
Copy link

samoin commented Sep 30, 2014

thanks you. I've tried to submit it to #1804.

------------------ 原始邮件 ------------------
发件人: "Wiibaa";[email protected];
发送时间: 2014年9月30日(星期二) 中午11:32
收件人: "elasticsearch/logstash"[email protected];

主题: Re: [logstash] filter/xml fix for LOGSTASH-2246: extract non-asciicontent with xpath (#1790)

@jordansissel thanks you but the credits should go to the original reporter in jira.
While you are toying with xml filter filter, any chance to look at #1255 ;)


Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants