Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse_html ignoring white-spaces and newlines for <pre><code> ... </pre></code> html #107

Open
qknight opened this issue Mar 4, 2025 · 2 comments

Comments

@qknight
Copy link

qknight commented Mar 4, 2025

When using parse_html() it seems that <pre><code> sections are only parsed correctly when no nested tag(s) are used but instead only text nodes. as soon as a nested html element as a <span>...</span> is used, it looses the formatting as spaces and newlines (probably tabs, too).

After checking that the parser works according to html specification at fefit/rphtml#4 I think that the error I'm seeing comes from process_node(...) now.

I had added these to html_parser_tests.rs

After a bunch of tests I discoverd:

  • when the first element in the <pre><code> is a string, formatting works correctly.
  • If the next element is a

    2

    (so any tag) it still works.
  • However, if the next one afterwards is a tag like

    3

    again, it fails to indent
  • but works with a text node. So the tags parser introduces a state somehow.

test 1

#[test]
fn test_pre_code() {
    let html = r#"<div><p> test </p>
<pre><code>
0
  1
  <p>foo</p>
  2
3</code></pre>
</div>"#;
let expected = r#"<div><p> test </p><pre><code>
0
  1
  <p>foo</p>
  2
3</code></pre></div>"#;
    let node: Node<()> = parse_html(html).ok().flatten().expect("must parse");
    //println!("node: {:#?}", node);
    println!("html: {}", html);
    println!("render: {}", node.render_to_string());
    assert_eq!(expected, node.render_to_string());
}

result:

cargo test -p sauron --test html_parser_test
test test_pre_code ... ok

test 2

#[test]
fn test_pre_code_2() {
    let html = r#"<pre><code>
0
<span>asdf</span>
  <span>asdf</span>
  <span>asdf</span>
</code></pre>"#;
let expected = r#"<pre><code>
<span>asdf</span>
  <span>asdf</span>
  <span>asdf</span>
</code></pre>"#;

    let node: Node<()> = parse_html(html).ok().flatten().expect("must parse");
    //println!("node: {:#?}", node);
    println!("html: {}", html);
    println!("render: {}", node.render_to_string());
    assert_eq!(expected, node.render_to_string());
}

result

cargo test -p sauron --test html_parser_test
test test_pre_code2 ... FAILED

failures:

---- test_pre_code_2 stdout ----
html: <pre><code>
0
<span>asdf</span>
  <span>asdf</span>
  <span>asdf</span>
</code></pre>
render: <pre><code>
0
<span>asdf</span><span>asdf</span><span>asdf</span></code></pre>
thread 'test_pre_code_2' panicked at tests/html_parser_test.rs:97:5:
assertion `left == right` failed
  left: "<pre><code>\n<span>asdf</span>\n  <span>asdf</span>\n  <span>asdf</span>\n</code></pre>"
 right: "<pre><code>\n0\n<span>asdf</span><span>asdf</span><span>asdf</span></code></pre>"
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

test 3

#[test]
fn test_pre_code3() {
    let html = r#"<div><p> test </p><pre><code>
0
  1
  2
3
</code></pre>
</div>"#;
let expected = r#"<div><p> test </p><pre><code>
0
  1
  2
3
</code></pre></div>"#;

    let node: Node<()> = parse_html(html).ok().flatten().expect("must parse");
    //println!("node: {:#?}", node);
    println!("html: {}", html);
    println!("render: {}", node.render_to_string());
    assert_eq!(expected, node.render_to_string());
}

result

cargo test -p sauron --test html_parser_test
test test_pre_code3 ... ok

test 4

#[test]
fn test_pre_code3_paragraphs_mix() {
    let html = r#"<div><p> test </p><pre><code>
  0
  <p>1</p>
  2
<p>3</p>
  4
</code></pre>
</div>"#;
let expected = r#"<div><p> test </p><pre><code>
  0
  <p>1</p>
  2
<p>3</p>
  4
</code></pre></div>"#;

    let node: Node<()> = parse_html(html).ok().flatten().expect("must parse");
    //println!("node: {:#?}", node);
    println!("html: {}", html);
    println!("render: {}", node.render_to_string());
    assert_eq!(expected, node.render_to_string());
    // right "<div><p> test </p><pre><code>\n  0\n  <p>1</p><p>2</p><p>3</p></code></pre></div>"
}

result

test test_pre_code3_paragraphs_mix ... ok
@qknight
Copy link
Author

qknight commented Mar 6, 2025

Sauron uses parse_html and with this input:

let html = r#"<div><p> test </p>
<pre><code>
0
  1
  <p>foo</p>
  2
3</code></pre>
</div>"#;

In this code passage:

pub fn parse_html<MSG>(html: &str) -> Result<Option<Node<MSG>>, ParseError> {
    let doc = Doc::parse(
        html,
        ParseOptions {
            case_sensitive_tagname: false,
            allow_self_closing: true,
            auto_fix_unclosed_tag: true,
            auto_fix_unexpected_endtag: true,
            auto_fix_unescaped_lt: true,
        },
    )?;
    println!("xxxx: {}", doc.render(&Default::default()));
    process_node(doc.get_root_node().borrow().deref())

xxx doc.render(&Default::default()));

<div><p> test </p>
<pre><code>
0
  1
  <p>foo</p>
  2
3</code></pre>
</div>

this looks good

node AST: parse_html output

node: Element(
    Element {
        namespace: None,
        tag: "div",
        attrs: [],
        children: [
            Element(
                Element {
                    namespace: None,
                    tag: "p",
                    attrs: [],
                    children: [
                        Leaf(
                            Text(
                                " test ",
                            ),
                        ),
                    ],
                    self_closing: false,
                },
            ),
            Element(
                Element {
                    namespace: None,
                    tag: "pre",
                    attrs: [],
                    children: [
                        Element(
                            Element {
                                namespace: None,
                                tag: "code",
                                attrs: [],
                                children: [
                                    Leaf(
                                        Text(
                                            "\n0\n  1\n  ",
                                        ),
                                    ),
                                    Element(
                                        Element {
                                            namespace: None,
                                            tag: "p",
                                            attrs: [],
                                            children: [
                                                Leaf(
                                                    Text(
                                                        "foo",
                                                    ),
                                                ),
                                            ],
                                            self_closing: false,
                                        },
                                    ),
                                    Leaf(
                                        Text(
                                            "\n  2\n3",
                                        ),
                                    ),
                                ],
                                self_closing: false,
                            },
                        ),
                    ],
                    self_closing: false,
                },
            ),
        ],
        self_closing: false,
    },
)

Took me quite some time to understand this issue, but here is what I know now:

  • the Element { namespace: None, tag: "code", is already wrong, it must be a text node and contain everything up to (and including the </code>
  • rphtml gets the output right but i'm not yet sure if it also parses the tags inside <pre>...</pre>

@qknight
Copy link
Author

qknight commented Mar 7, 2025

rphtml is at fault! It parses <p> as node_type: Tag and it should be of type node_type: Text because of the <pre>

#[test]
fn test_childs() -> HResult {
	let code = r##"<pre><p>aaa</p></pre>"##;
	let doc = parse(code)?;
	let root = doc.get_root_node();
	let childs = &root.borrow().childs;
	let childs = childs.as_ref().unwrap();

  for child in childs {
    println!(" - child: {:#?}\n", child);
  }

  assert_eq!(1,2);
	Ok(())
}
---- test_childs stdout ----
 - child: RefCell {
    value: Node {
        index: 0,
        node_type: Tag,
        begin_at: 0,
        end_at: 0,
        content: None,
        childs: Some(
            [
                RefCell {
                    value: Node {
                        index: 0,
                        node_type: Tag,
                        begin_at: 5,
                        end_at: 5,
                        content: None,
                        childs: Some(
                            [
                                RefCell {
                                    value: Node {
                                        index: 0,
                                        node_type: Text,
                                        begin_at: 8,
                                        end_at: 11,
                                        content: Some(
                                            [
                                                'a',
                                                'a',
                                                'a',
                                            ],
                                        ),
                                        childs: None,
                                        meta: None,
                                        end_tag: None,
                                        parent: true,
                                        root: true,
                                        document: false,
                                    },
                                },
                            ],
                        ),
                        meta: Some(
                            RefCell {
                                value: TagMeta {
                                    code_in: Wait,
                                    is_void: false,
                                    self_closed: false,
                                    auto_fix: false,
                                    name: [
                                        'p',
                                    ],
                                    attrs: [],
                                    lc_name_map: {},
                                },
                            },
                        ),
                        end_tag: Some(
                            RefCell {
                                value: Node {
                                    index: 0,
                                    node_type: TagEnd,
                                    begin_at: 11,
                                    end_at: 15,
                                    content: Some(
                                        [
                                            'p',
                                        ],
                                    ),
                                    childs: None,
                                    meta: None,
                                    end_tag: None,
                                    parent: true,
                                    root: false,
                                    document: false,
                                },
                            },
                        ),
                        parent: true,
                        root: true,
                        document: false,
                    },
                },
            ],
        ),
        meta: Some(
            RefCell {
                value: TagMeta {
                    code_in: Wait,
                    is_void: false,
                    self_closed: false,
                    auto_fix: false,
                    name: [
                        'p',
                        'r',
                        'e',
                    ],
                    attrs: [],
                    lc_name_map: {},
                },
            },
        ),
        end_tag: Some(
            RefCell {
                value: Node {
                    index: 0,
                    node_type: TagEnd,
                    begin_at: 15,
                    end_at: 21,
                    content: Some(
                        [
                            'p',
                            'r',
                            'e',
                        ],
                    ),
                    childs: None,
                    meta: None,
                    end_tag: None,
                    parent: true,
                    root: false,
                    document: false,
                },
            },
        ),
        parent: true,
        root: true,
        document: false,
    },
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant