Fix accuracy of sourcepos for inline elements #452

digitalmoksha · 2024-08-10T00:26:33Z

This issue is for discussing how we might fix the accuracy of sourcepos for inline elements.

Simple recreation

A
 B

This should generate an AST that looks like this

<document sourcepos="1:1-2:2" xmlns="http://commonmark.org/xml/1.0">
  <paragraph sourcepos="1:1-2:2">
    <text sourcepos="1:1-1:1" xml:space="preserve">A</text>
    <softbreak sourcepos="1:2-1:2" />
    <text sourcepos="2:2-2:2" xml:space="preserve">B</text>
  </paragraph>
</document>

Instead, the last text has sourcepos="2:1-2:1"

Related issue Unexpected sourcepos and PR Inline sourcepos fixes which moved outputting of inline sourcepos into an option, experimental_inline_sourcepos

Two other examples captured in tests are

comrak/src/tests/core.rs

Lines 580 to 603 in 9f4d391

    
           // Ignored per https://github.com/kivikakk/comrak/pull/439#issuecomment-2225129960. 
        
           #[ignore] 
        
           #[test] 
        
           fn link_sourcepos_truffle_bergamot() { 
        
               assert_ast_match!( 
        
                   [], 
        
                   "- A\n   [![B](/B.png)](/B)\n", 
        
                   (document (1:1-2:21) [ 
        
                       (list (1:1-2:21) [ 
        
                           (item (1:1-2:21) [ 
        
                               (paragraph (1:3-2:21) [ 
        
                                   (text (1:3-1:3) "A") 
        
                                   (softbreak (1:4-1:4)) 
        
                                   (link (2:4-2:21) [ 
        
                                       (image (2:5-2:16) [ 
        
                                           (text (2:7-2:7) "B") 
        
                                       ]) 
        
                                   ]) 
        
                               ]) 
        
                           ]) 
        
                       ]) 
        
                   ]) 
        
               ); 
        
           }

and

comrak/src/tests/core.rs

Lines 532 to 555 in 9f4d391

    
           // Ignored per https://github.com/kivikakk/comrak/pull/439#issuecomment-2225129960. 
        
           #[ignore] 
        
           #[test] 
        
           fn link_sourcepos_truffle() { 
        
               assert_ast_match!( 
        
                   [], 
        
                   "- A\n[![B](/B.png)](/B)\n", 
        
                   (document (1:1-2:18) [ 
        
                       (list (1:1-2:18) [ 
        
                           (item (1:1-2:18) [ 
        
                               (paragraph (1:3-2:18) [ 
        
                                   (text (1:3-1:3) "A") 
        
                                   (softbreak (1:4-1:4)) 
        
                                   (link (2:1-2:18) [ 
        
                                       (image (2:2-2:13) [ 
        
                                           (text (2:4-2:4) "B") 
        
                                       ]) 
        
                                   ]) 
        
                               ]) 
        
                           ]) 
        
                       ]) 
        
                   ]) 
        
               ); 
        
           }

The text was updated successfully, but these errors were encountered:

digitalmoksha · 2024-08-10T00:29:19Z

Including the comment from @kivikakk in the PR, as a starting point of discussion:

How awkward. In all these cases:

- A
  B

- A
B

- A
   B

the inline parser is initialised with the subject "A\nB" and a block_offset of 2, and doesn't have any distinguishing information. This is a consequence of cmark/Comrak's particular way of collecting block content before handing it off to the inline parser — it normalises things such that the inline parser doesn't need to worry about the minutae of what falls into a block and what doesn't, but we also lose information which we cannot recover.

This will require something of a rearchitecting — sourcepos info needs to be associated and carried through from much earlier in the block processing stage, with the possibility of multiple spans per content block[1] — and I have no real inclination to do such work at present.

In light of all this, for now I'm going to adjust to match upstream (#301 (comment)) and not output inline-level sourcepos in HTML in the same cases as cmark. Block-level sourcepos will be unaffected, as it's not known to have issues.

[1] In the examples above, we still need to be passing through the same inline Subject::input of "A\nB", but with sourcepos marker collections [(1:3-1:4),(2:3-2:3)], [(1:3-1:4),(2:1-2:1)] and [(1:3-1:4),(2:4-2:4)] respectively (where (1:4) aligns with the \n which is turned into a soft break). Right now we only pass a single point (1:3) and infer the rest incrementally.

This changes a lot of the bookkeeping on the block parser side, and a complete rework of sourcepos handling on the inline parser side, so I can't even say for sure that I won't hit an even bigger roadblock. Also worth noting in conjunction is this comment about autolink sourcepos unsoundness.

digitalmoksha · 2024-08-10T00:42:57Z

I've started researching this a little, trying to get familiar again with how this works, and play around with a few things.

When I take simple text, such as

I get

    <text sourcepos="1:3-1:3" xml:space="preserve">A</text>

So it doesn't count any leading spaces. So it seems like we would want the same behavior for a following line in the same paragraph. Meaning

  A
   B

should yield

    <text sourcepos="1:3-1:3" xml:space="preserve">A</text>
    <softbreak sourcepos="1:4-1:4" />
    <text sourcepos="2:4-2:4" xml:space="preserve">A</text>

But right now it seems like we're always adding the block_offset for the second line, rather than calculating a new start position like we do for the first line. I'm wondering if we can fix that without larger changes 🤔

digitalmoksha · 2024-08-10T00:45:30Z

As an additional researching angle, I've been looking at the output from https://github.com/jgm/commonmark-hs.

For example

√ commonmark-hs/commonmark-cli (master= [])
> cabal run exes -- --sourcepos
Up to date
  A
   B

<p data-sourcepos="stdin@1:3-2:1;2:4-3:1"><span data-sourcepos="stdin@1:3-1:4">A</span>
<span data-sourcepos="stdin@2:4-2:5">B</span></p>

digitalmoksha · 2024-08-12T15:51:18Z

I know you said this

the inline parser is initialised with the subject "A\nB" and a block_offset of 2, and doesn't have any distinguishing information. This is a consequence of cmark/Comrak's particular way of collecting block content before handing it off to the inline parser — it normalises things such that the inline parser doesn't need to worry about the minutae of what falls into a block and what doesn't, but we also lose information which we cannot recover.

but it wasn't until I stepped through the code that it clicked into place. Any leading spaces have been stripped off, so we have no idea how many spaces on each line were stripped off.

kivikakk · 2024-08-12T16:14:30Z

Yes, indeed; the necessary bookkeeping must start at the block parser, since it's the one doing the normalising. Both parsers would need to be adjusted to communicate that information.

What's likely to be simplest would be having the block parser give the inline parser a list of tuples like (nchars, startrow, startcol) alongside the subject string, where the sum of nchars is the length of the subject string. Thus spans of text are associated with a start location, and the existing bookkeeping would handle the intermediate locations.

This would probably remove some work from the inline parser, such as the bookkeeping done in Subject::handle_newline, itself of course being the site of the current woes!

But the changes in the block parser itself to enable this would be significant, in a way that's hard to estimate without first looking carefully (and probably trying a speculative approach or two). It's not set up for this kind of thing at all — presumably why it wasn't added to cmark upstream in the first place.

digitalmoksha · 2024-08-12T17:38:36Z

My initial thought was maybe having a vector of offsets, say line_offsets, in the AST for each line that gets added in add_line in the block parser. It looks like spaces are only trimmed from the beginning of the line. Then as we process each line in the inline parser, rather than just relying on the block_offset and the column_offset, we can add in the correct offset for the line.

kivikakk · 2024-08-13T11:00:03Z

If it works, it works! 😸

digitalmoksha · 2024-08-13T22:46:41Z

I spun up #453 to play with some ideas

digitalmoksha · 2024-08-18T22:50:30Z

I think we can close this with #453 merged. Will open another issue if any other problems are found.

There may come a day when we can move inline sourcepos out of experimental status, but it certainly needs more battle testing before that can happen.

@kivikakk would you have time to release a 0.27.0 version? I can certainly do it if needed, but I defer to you.

kivikakk · 2024-08-19T10:32:36Z

Will do.

digitalmoksha mentioned this issue Aug 13, 2024

Track line offsets for better accuracy of inline sourcepos #453

Merged

digitalmoksha closed this as completed Aug 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix accuracy of sourcepos for inline elements #452

Fix accuracy of sourcepos for inline elements #452

digitalmoksha commented Aug 10, 2024

digitalmoksha commented Aug 10, 2024

digitalmoksha commented Aug 10, 2024

digitalmoksha commented Aug 10, 2024

digitalmoksha commented Aug 12, 2024

kivikakk commented Aug 12, 2024

digitalmoksha commented Aug 12, 2024 •

edited

Loading

kivikakk commented Aug 13, 2024

digitalmoksha commented Aug 13, 2024

digitalmoksha commented Aug 18, 2024

kivikakk commented Aug 19, 2024

Fix accuracy of sourcepos for inline elements #452

Fix accuracy of sourcepos for inline elements #452

Comments

digitalmoksha commented Aug 10, 2024

digitalmoksha commented Aug 10, 2024

digitalmoksha commented Aug 10, 2024

digitalmoksha commented Aug 10, 2024

digitalmoksha commented Aug 12, 2024

kivikakk commented Aug 12, 2024

digitalmoksha commented Aug 12, 2024 • edited Loading

kivikakk commented Aug 13, 2024

digitalmoksha commented Aug 13, 2024

digitalmoksha commented Aug 18, 2024

kivikakk commented Aug 19, 2024

digitalmoksha commented Aug 12, 2024 •

edited

Loading