-
-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parse_html ignoring white-spaces and newlines for <pre><code> ... </pre></code> html #107
Comments
Sauron uses parse_html and with this input: let html = r#"<div><p> test </p>
<pre><code>
0
1
<p>foo</p>
2
3</code></pre>
</div>"#; In this code passage: pub fn parse_html<MSG>(html: &str) -> Result<Option<Node<MSG>>, ParseError> {
let doc = Doc::parse(
html,
ParseOptions {
case_sensitive_tagname: false,
allow_self_closing: true,
auto_fix_unclosed_tag: true,
auto_fix_unexpected_endtag: true,
auto_fix_unescaped_lt: true,
},
)?;
println!("xxxx: {}", doc.render(&Default::default()));
process_node(doc.get_root_node().borrow().deref()) xxx doc.render(&Default::default()));<div><p> test </p>
<pre><code>
0
1
<p>foo</p>
2
3</code></pre>
</div> this looks good node AST: parse_html outputnode: Element(
Element {
namespace: None,
tag: "div",
attrs: [],
children: [
Element(
Element {
namespace: None,
tag: "p",
attrs: [],
children: [
Leaf(
Text(
" test ",
),
),
],
self_closing: false,
},
),
Element(
Element {
namespace: None,
tag: "pre",
attrs: [],
children: [
Element(
Element {
namespace: None,
tag: "code",
attrs: [],
children: [
Leaf(
Text(
"\n0\n 1\n ",
),
),
Element(
Element {
namespace: None,
tag: "p",
attrs: [],
children: [
Leaf(
Text(
"foo",
),
),
],
self_closing: false,
},
),
Leaf(
Text(
"\n 2\n3",
),
),
],
self_closing: false,
},
),
],
self_closing: false,
},
),
],
self_closing: false,
},
) Took me quite some time to understand this issue, but here is what I know now:
|
rphtml is at fault! It parses #[test]
fn test_childs() -> HResult {
let code = r##"<pre><p>aaa</p></pre>"##;
let doc = parse(code)?;
let root = doc.get_root_node();
let childs = &root.borrow().childs;
let childs = childs.as_ref().unwrap();
for child in childs {
println!(" - child: {:#?}\n", child);
}
assert_eq!(1,2);
Ok(())
}
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When using parse_html() it seems that
<pre><code>
sections are only parsed correctly when no nested tag(s) are used but instead only text nodes. as soon as a nested html element as a<span>...</span>
is used, it looses the formatting as spaces and newlines (probably tabs, too).After checking that the parser works according to html specification at fefit/rphtml#4 I think that the error I'm seeing comes from
process_node(...)
now.I had added these to html_parser_tests.rs
After a bunch of tests I discoverd:
<pre><code>
is a string, formatting works correctly.2
(so any tag) it still works.3
again, it fails to indenttest 1
result:
test 2
result
test 3
result
test 4
result
The text was updated successfully, but these errors were encountered: