Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EmojiFilter doesn't work on strings that don't contain HTML #133

Closed
wideopenspaces opened this issue Jul 15, 2014 · 19 comments
Closed

EmojiFilter doesn't work on strings that don't contain HTML #133

wideopenspaces opened this issue Jul 15, 2014 · 19 comments

Comments

@wideopenspaces
Copy link

When I pass this string...

"I can do this.\r\n:scream: Juice 3: Whoa, that's a LOT of cayenne!"

...to a pipeline containing EmojiFilter, it does not replace the emoji-cheat-sheet code with the Emoji as expected.

I tracked the problem down to here:

irb(main):204:0> doc.search('text()')
=> []

What does happen is that the DocumentFragment in doc contains one child Nokogiri::XML::Text node, and doc.text contains the same text that html contains. So....

Armed with that knowledge, I made the following changes:

def call
- doc.search('text()').each do |node|
+ nodes(doc).each do |node|
    content = node.to_html
    next if !content.include?(':')
    next if has_ancestor?(node, %w(pre code))
    html = emoji_image_filter(content)
    next if html == content
    node.replace(html)
  end
  doc
end

# Look for text nodes in the DocumentFragment
# 
# If doc's text is the same as original string,
# just nab its children to get the proper nodes.
# Otherwise do a search for text nodes.
+ def nodes(doc)
+   doc.text == html ? doc.children : doc.search('text()')
+ end

... and that fixed it for me.

Anyone see any problems with that fix? If not, I'll work up a PR as soon as I can.

@jch
Copy link
Contributor

jch commented Jul 15, 2014

I wonder why doc is a DocumentFragment instead of a Nokogiri::HTML::Document. The line that parses it is https://github.com/jch/html-pipeline/blob/master/lib/html/pipeline.rb#L53. When I search your sample with a Document, it works as expected:

irb(main):018:0> Nokogiri::HTML('hi').search("text()")
=> [#<Nokogiri::XML::Text:0x3fd03a089798 "hi">]

You implementation works, but I'd be worried about the performance of doc.text == html. Both to create a string object from the doc, and to compare it against the existing value. Another implementation would be to iterate through all the child nodes and only work upon text nodes:

doc.children.each do |node|
  next unless node.text?
  # snip...
end

Thanks for digging in on this bug. Could you open a PR with a test and we can continue the discussion from there?

@wideopenspaces
Copy link
Author

@jch I will dig deeper. When this wasn't working, I created a test pipeline with only EmojiFilter in it, so I know it wasn't any of the custom filters I built, but it's quite possible I did do something wrong.

If it's not an ID10T error on my part, I'll certainly work up a PR!

@jch
Copy link
Contributor

jch commented Jul 28, 2014

@wideopenspaces any luck?

@wideopenspaces
Copy link
Author

Work got in the way the last two weeks. I'll see if I can set aside a few hours this week to tackle this. Thanks for reminding me!

@Razer6
Copy link
Contributor

Razer6 commented Sep 8, 2014

@wickedshimmy Hitting the same issue. Had you any success?

@Razer6
Copy link
Contributor

Razer6 commented Sep 8, 2014

@jch Your approach works. I could provide PR if would acceptable.

@jch
Copy link
Contributor

jch commented Sep 8, 2014

@Razer6 👍 a PR would be awesome. I'd be happy to review and test it for compatibility.

@jch
Copy link
Contributor

jch commented Sep 15, 2014

Fixed by #146

@jch jch closed this as completed Sep 15, 2014
@aroben
Copy link
Contributor

aroben commented Sep 25, 2014

I ran into this problem too. I think some versions of libxml2 don't return top-level text nodes inside a DocumentFragment when using .search("text()"). I was finding that things worked fine on my Mac laptop, which as a new-ish version of libxml2, but not on a Linux server with an older version.

HTML::Pipeline normally avoids this by wrapping everything inside a <div> in PlainTextInputFilter. If you use PlainTextInputFilter this problem never occurs because there are no top-level text nodes.

@wideopenspaces Were you using PlainTextInputFilter? If you weren't, then you're probably opening yourself to XSS attacks or at least bad parsing/rendering (e.g., if your input string happens to contain HTML).

@aroben
Copy link
Contributor

aroben commented Sep 25, 2014

So I guess what I'm saying is that #146 seems unnecessary. It seems like the correct fix is "Use PlainTextInputFilter". (Unless you were using it, in which case you and I weren't seeing the same bug.)

@jch
Copy link
Contributor

jch commented Sep 25, 2014

Eeeenteresting. I had forgotten about PlainTextInputFilter. I suppose this problem doesn't just apply to the EmojiFilter, but to all filters that work on text without a root node.

@aroben is there a downside to inlining PlainTextInputFilter behavior into all pipelines to avoid this gotcha? It would add a some overhead to all pipelines, but it feels like an implicit dependency as it is.

@aroben
Copy link
Contributor

aroben commented Sep 25, 2014

Well if you are in fact starting with HTML, not plaintext, you should not use PlainTextInputFilter.

@jch
Copy link
Contributor

jch commented Sep 25, 2014

Ya, then I guess you'd add unnecessary overhead. I'll document this better and revert #146 then.

cc @Razer6

@aroben
Copy link
Contributor

aroben commented Sep 25, 2014

Not just overhead. All your HTML would get escaped and thus rendered as plain text. I.e. you'd see HTML tags in your output.

@jch
Copy link
Contributor

jch commented Sep 25, 2014

Ooo ya. Good point.

@wideopenspaces
Copy link
Author

@aroben No, we are sanitizing separately. I will look into whether or not PlainTextFilter will work for us.

@aroben
Copy link
Contributor

aroben commented Sep 25, 2014

@wideopenspaces If you already have HTML-escaped text on your hands then you could manually wrap it in a <div> like PlainTextInputFilter does.

@wideopenspaces
Copy link
Author

Yep, that may be the best solution. Thanks for chiming in! And @Razer6 thanks for picking up my slack!

@Razer6
Copy link
Contributor

Razer6 commented Sep 27, 2014

HTML::Pipeline normally avoids this by wrapping everything inside a <div>

I proposed this solution also in #144. Didn't see that there is a dedicated filter doing that. Thanks for pointing out 👍

Eeeenteresting. I had forgotten about PlainTextInputFilter. I suppose this problem doesn't just apply to > the EmojiFilter, but to all filters that work on text without a root node.

This is right.

@jch @aroben Thanks for finding the correct solution 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants