Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RubyConf] Create scrubber for replacing double breakpoints into paragraph nodes #284

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

josecolella
Copy link

@josecolella josecolella commented May 8, 2024

Why?

  • Users would like to have functionality within loofah that provides the ability to replace double breakpoints into paragraph nodes

What?

  • Creating a new scrubber (:double_breakpoint) that replaces double breakpoints into paragraph nodes (thank you @flavorjones )

How did we test?

  • Created a new integration test case.

Important

There is a failing test right now where the expectation and actual result match except for newline characters. In discussing with @flavorjones, this might be related to minitest and how it formats html

(ruby) doc.xpath("/html/body").inner_html
"<p>Some text here in a logical paragraph.</p><p>Some more text, apparently a second paragraph.</p><p>Et cetera...</p>"
(ruby) BREAKPOINT_RESULT
"<p>Some text here in a logical paragraph.</p><p>Some more text, apparently a second paragraph.</p><p>Et cetera...</p>"
(ruby) BREAKPOINT_RESULT == doc.xpath("/html/body").inner_html

References #279

@josecolella josecolella changed the title Jc th add breakpoint scrubber [RubyConf] Create scrubber for replacing double breakpoints into paragraph nodes May 8, 2024
@josecolella josecolella marked this pull request as ready for review May 8, 2024 19:08
@flavorjones
Copy link
Owner

@torihuang @josecolella Thanks again for this PR!

Looking at the failing tests, it seems like the HTML4 and the HTML5 parser handle newlines differently, and that's causing a failure one way or the other depending on whether there are newlines in the expected result or not.

Demonstration:

#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "loofah"
end

input = "<html><body><h1>Hello</h1><div>World</div></body></html>"

doc = Nokogiri::HTML4::Document.parse(input)
doc.at_css("h1").add_next_sibling("<b>there</b>")
doc.at_css("body").inner_html
# => "<h1>Hello</h1>\n" + "<b>there</b><div>World</div>"

doc = Nokogiri::HTML4::Document.parse(input)
doc.at_css("h1").add_next_sibling("<p>there</p>")
doc.at_css("body").inner_html
# => "<h1>Hello</h1>\n" + "<p>there</p>\n" + "<div>World</div>"

doc = Nokogiri::HTML5::Document.parse(input)
doc.at_css("h1").add_next_sibling("<b>there</b>")
doc.at_css("body").inner_html
# => "<h1>Hello</h1><b>there</b><div>World</div>"

doc = Nokogiri::HTML5::Document.parse(input)
doc.at_css("h1").add_next_sibling("<p>there</p>")
doc.at_css("body").inner_html
# => "<h1>Hello</h1><p>there</p><div>World</div>"

In any case, now that I understand what's going on, I'll wrap this up by tomorrow!

@josecolella
Copy link
Author

Sounds good. Thanks @flavorjones

@josecolella
Copy link
Author

@flavorjones were you still interested in getting this merged?

@flavorjones
Copy link
Owner

@josecolella Totally! I've been really distracted the last few weeks, but I will absolutely circle back on this.

@josecolella
Copy link
Author

@flavorjones Any update here?

@flavorjones
Copy link
Owner

@josecolella Really sorry for the delay. This was harder than I expected to wrap up (at least in a way that I didn't think was gross). I will be spending a few weeks (at my new job!) on the sanitizer stack starting in late October and will do my best to get this merged then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants