EmojiFilter doesn't work on strings that don't contain HTML #133

wideopenspaces · 2014-07-15T00:21:42Z

When I pass this string...

"I can do this.\r\n:scream: Juice 3: Whoa, that's a LOT of cayenne!"

...to a pipeline containing EmojiFilter, it does not replace the emoji-cheat-sheet code with the Emoji as expected.

I tracked the problem down to here:

irb(main):204:0> doc.search('text()')
=> []

What does happen is that the DocumentFragment in doc contains one child Nokogiri::XML::Text node, and doc.text contains the same text that html contains. So....

Armed with that knowledge, I made the following changes:

def call
- doc.search('text()').each do |node|
+ nodes(doc).each do |node|
    content = node.to_html
    next if !content.include?(':')
    next if has_ancestor?(node, %w(pre code))
    html = emoji_image_filter(content)
    next if html == content
    node.replace(html)
  end
  doc
end

# Look for text nodes in the DocumentFragment
# 
# If doc's text is the same as original string,
# just nab its children to get the proper nodes.
# Otherwise do a search for text nodes.
+ def nodes(doc)
+   doc.text == html ? doc.children : doc.search('text()')
+ end

... and that fixed it for me.

Anyone see any problems with that fix? If not, I'll work up a PR as soon as I can.

The text was updated successfully, but these errors were encountered:

jch · 2014-07-15T04:42:55Z

I wonder why doc is a DocumentFragment instead of a Nokogiri::HTML::Document. The line that parses it is https://github.com/jch/html-pipeline/blob/master/lib/html/pipeline.rb#L53. When I search your sample with a Document, it works as expected:

irb(main):018:0> Nokogiri::HTML('hi').search("text()")
=> [#<Nokogiri::XML::Text:0x3fd03a089798 "hi">]

You implementation works, but I'd be worried about the performance of doc.text == html. Both to create a string object from the doc, and to compare it against the existing value. Another implementation would be to iterate through all the child nodes and only work upon text nodes:

doc.children.each do |node|
  next unless node.text?
  # snip...
end

Thanks for digging in on this bug. Could you open a PR with a test and we can continue the discussion from there?

wideopenspaces · 2014-07-16T04:23:30Z

@jch I will dig deeper. When this wasn't working, I created a test pipeline with only EmojiFilter in it, so I know it wasn't any of the custom filters I built, but it's quite possible I did do something wrong.

If it's not an ID10T error on my part, I'll certainly work up a PR!

jch · 2014-07-28T21:45:59Z

@wideopenspaces any luck?

wideopenspaces · 2014-07-28T21:47:57Z

Work got in the way the last two weeks. I'll see if I can set aside a few hours this week to tackle this. Thanks for reminding me!

Razer6 · 2014-09-08T10:35:04Z

@wickedshimmy Hitting the same issue. Had you any success?

Razer6 · 2014-09-08T10:46:55Z

@jch Your approach works. I could provide PR if would acceptable.

jch · 2014-09-08T16:03:25Z

@Razer6 👍 a PR would be awesome. I'd be happy to review and test it for compatibility.

Fixes gjtorikian#133

jch · 2014-09-15T16:58:14Z

Fixed by #146

aroben · 2014-09-25T14:25:28Z

I ran into this problem too. I think some versions of libxml2 don't return top-level text nodes inside a DocumentFragment when using .search("text()"). I was finding that things worked fine on my Mac laptop, which as a new-ish version of libxml2, but not on a Linux server with an older version.

HTML::Pipeline normally avoids this by wrapping everything inside a <div> in PlainTextInputFilter. If you use PlainTextInputFilter this problem never occurs because there are no top-level text nodes.

@wideopenspaces Were you using PlainTextInputFilter? If you weren't, then you're probably opening yourself to XSS attacks or at least bad parsing/rendering (e.g., if your input string happens to contain HTML).

aroben · 2014-09-25T14:26:13Z

So I guess what I'm saying is that #146 seems unnecessary. It seems like the correct fix is "Use PlainTextInputFilter". (Unless you were using it, in which case you and I weren't seeing the same bug.)

jch · 2014-09-25T16:29:36Z

Eeeenteresting. I had forgotten about PlainTextInputFilter. I suppose this problem doesn't just apply to the EmojiFilter, but to all filters that work on text without a root node.

@aroben is there a downside to inlining PlainTextInputFilter behavior into all pipelines to avoid this gotcha? It would add a some overhead to all pipelines, but it feels like an implicit dependency as it is.

aroben · 2014-09-25T16:32:48Z

Well if you are in fact starting with HTML, not plaintext, you should not use PlainTextInputFilter.

jch · 2014-09-25T16:44:31Z

Ya, then I guess you'd add unnecessary overhead. I'll document this better and revert #146 then.

cc @Razer6

aroben · 2014-09-25T17:03:19Z

Not just overhead. All your HTML would get escaped and thus rendered as plain text. I.e. you'd see HTML tags in your output.

jch · 2014-09-25T17:47:14Z

Ooo ya. Good point.

wideopenspaces · 2014-09-25T18:28:41Z

@aroben No, we are sanitizing separately. I will look into whether or not PlainTextFilter will work for us.

aroben · 2014-09-25T18:31:31Z

@wideopenspaces If you already have HTML-escaped text on your hands then you could manually wrap it in a <div> like PlainTextInputFilter does.

wideopenspaces · 2014-09-25T18:44:15Z

Yep, that may be the best solution. Thanks for chiming in! And @Razer6 thanks for picking up my slack!

Razer6 · 2014-09-27T08:02:35Z

HTML::Pipeline normally avoids this by wrapping everything inside a <div>

I proposed this solution also in #144. Didn't see that there is a dedicated filter doing that. Thanks for pointing out 👍

Eeeenteresting. I had forgotten about PlainTextInputFilter. I suppose this problem doesn't just apply to > the EmojiFilter, but to all filters that work on text without a root node.

This is right.

@jch @aroben Thanks for finding the correct solution 👍

Razer6 added a commit to Razer6/html-pipeline that referenced this issue Sep 10, 2014

Do a second xpath search to match elements without parent node

7b2b4ea

Fixes gjtorikian#133

Razer6 added a commit to Razer6/html-pipeline that referenced this issue Sep 10, 2014

Union two queries quieries to match elements without tags

6c023a5

Fixes gjtorikian#133

Razer6 mentioned this issue Sep 14, 2014

Search for text nodes on DocumentFragments without root tags #146

Merged

jch closed this as completed Sep 15, 2014

jch mentioned this issue Sep 27, 2014

Readme tweaks: add FAQ section, refresh 3rd party extensions #150

Merged

jch mentioned this issue Oct 7, 2014

Revert "Search for text nodes on DocumentFragments without root tags" #158

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EmojiFilter doesn't work on strings that don't contain HTML #133

EmojiFilter doesn't work on strings that don't contain HTML #133

wideopenspaces commented Jul 15, 2014

jch commented Jul 15, 2014

wideopenspaces commented Jul 16, 2014

jch commented Jul 28, 2014

wideopenspaces commented Jul 28, 2014

Razer6 commented Sep 8, 2014

Razer6 commented Sep 8, 2014

jch commented Sep 8, 2014

jch commented Sep 15, 2014

aroben commented Sep 25, 2014

aroben commented Sep 25, 2014

jch commented Sep 25, 2014

aroben commented Sep 25, 2014

jch commented Sep 25, 2014

aroben commented Sep 25, 2014

jch commented Sep 25, 2014

wideopenspaces commented Sep 25, 2014

aroben commented Sep 25, 2014

wideopenspaces commented Sep 25, 2014

Razer6 commented Sep 27, 2014

EmojiFilter doesn't work on strings that don't contain HTML #133

EmojiFilter doesn't work on strings that don't contain HTML #133

Comments

wideopenspaces commented Jul 15, 2014

jch commented Jul 15, 2014

wideopenspaces commented Jul 16, 2014

jch commented Jul 28, 2014

wideopenspaces commented Jul 28, 2014

Razer6 commented Sep 8, 2014

Razer6 commented Sep 8, 2014

jch commented Sep 8, 2014

jch commented Sep 15, 2014

aroben commented Sep 25, 2014

aroben commented Sep 25, 2014

jch commented Sep 25, 2014

aroben commented Sep 25, 2014

jch commented Sep 25, 2014

aroben commented Sep 25, 2014

jch commented Sep 25, 2014

wideopenspaces commented Sep 25, 2014

aroben commented Sep 25, 2014

wideopenspaces commented Sep 25, 2014

Razer6 commented Sep 27, 2014