Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Add option to allow users to turn off escaping special character #51

Open
ytmimi opened this issue Dec 26, 2022 · 6 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@ytmimi
Copy link
Contributor

ytmimi commented Dec 26, 2022

Summary

First, thanks for the project!

Currently, I'm trying to add support for formatting rust code blocks in markdown files to rustfmt (rust-lang/rustfmt#2036), and it's been nice to use cmark_resume_with_options to handle the markdown rendering. If you're curious you can check out my work on the proof of concept here.

Ideally the feature I'm trying to add to rustfmt would only format the content of code blocks and leave the rest of the file as is (apart from some nice standardization of newlines between markdown items that cmark_resume_with_options provides 😁).

My main concern is that linking to types like [`Vec`] when reformatted will turn into \[`Vec`\], and I'm not sure if that will prevent rustdoc from generating links when rustdoc renders the docs as HTML. The proof of concept linked above hacks around this (See the TypeLinkFormatter if you're curious about the workaround.).

I'm not sure if there are other scenarios where something the user typed would be escaped, but it would be great to have the option to turn off all escaping so that we can emit the same content the user originally wrote.

I'm not as versed in the world of generating markdown so please let me know if there are technical limitations that I don't fully understand.

Design

I think this could be implemented by adding another field to Options. Maybe call it escape_special_characters and set it to true by default.

Then we could make the following change to escape_leading_special_characters:

    fn escape_leading_special_characters<'a>(
        t: &'a str,
        is_in_block_quote: bool,
        options: &Options<'a>,
    ) -> Cow<'a, str> {

-       if is_in_block_quote || t.is_empty() {
+       if !options.escape_special_characters || is_in_block_quote || t.is_empty() {
            return Cow::Borrowed(t);
        }

        let first = t.chars().next().expect("at least one char");
        if options.special_characters().contains(first) {
            let mut s = String::with_capacity(t.len() + 1);
            s.push('\\');
            s.push(first);
            s.push_str(&t[1..]);
            Cow::Owned(s)
        } else {
            Cow::Borrowed(t)
        }
    }

Happy to open a PR for this!

@Byron Byron added enhancement New feature or request help wanted Extra attention is needed labels Dec 27, 2022
@Byron
Copy link
Owner

Byron commented Dec 27, 2022

Awesome! Yes, I see why this would be useful and I'd love to see this crate help making rustfmt better ❤️.

To me it seems absolutely fair to have a special option to support this requirement, and I wouldn't mind to keep adding those to make this crate more suitable for rustfmt.

A PR is definitely welcome, and you should find everything you need to write some unit tests for that as well.

Regarding the suggested solution, I'd hope we can get around adding an option at all (which admittedly might not be possible, but here I go): this crate should ideally not alter any input, and I wonder why it feels the need to escape these characters anyway - they have not been escaped before and should automatically not be escaped when printing them.

Doing so might be possible if the event in question provides enough information explicitly or implicitly, and it's definitely something you could study in greater detail one you have setup a test or two. Having an option for this specifically feels like a crutch that we can put on but only if there truly is no other way.

@ytmimi
Copy link
Contributor Author

ytmimi commented Dec 27, 2022

Thanks for the quick reply! I also appreciate your willingness to accommodating some of rustfmt's needs 😁.

I'll continue to look into what's going on with the special character escaping to see if we can avoid adding the additional option. I've written two test and here's what I've managed to figure out so far:

    #[test]
    fn rustdoc_type_link_shortcut_code_links() {
        assert_eq!(
            fmts("[`Vec`]"),
            (
                "\\[`Vec`\\]".into(),
                State {
                    newlines_before_start: 2,
                    ..Default::default()
                }
            )
        )
    }

And the input generates the following events:

Event: Start(Paragraph)
Event: Text(Borrowed("["))
Event: Code(Borrowed("Vec"))
Event: Text(Borrowed("]"))
Event: End(Paragraph)

When we modify the test to use \\['Vec'\\] as the input we surprisingly get the same events:

    #[test]
    fn start_with_escaped_rustdoc_type_link_shortcut_code_links() {
        assert_eq!(
            fmts("\\[`Vec`\\]"),
            (
                "\\[`Vec`\\]".into(),
                State {
                    newlines_before_start: 2,
                    ..Default::default()
                }
            )
        )
    }
Event: Start(Paragraph)
Event: Text(Borrowed("["))
Event: Code(Borrowed("Vec"))
Event: Text(Borrowed("]"))
Event: End(Paragraph)

If the Parser is able to produce the same events in either case do we need to escape the values at all?

For my own curiosity, I was wondering if you could help me better understand the comment in Options::special_characters. Why do those characters always need to be escaped?

  impl<'a> Options<'a> {
      pub fn special_characters(&self) -> Cow<'static, str> {
          // These always need to be escaped, even if reconfigured.
          const BASE: &str = "#\\_*<>`|[]";

When it comes to adding other features that I'd like to use for rustfmt would you prefer that I open an issue first so we can have a discussion about it or would you prefer I just submit a PR?

@Byron
Copy link
Owner

Byron commented Dec 28, 2022

Great research right there! It looks like with the current version of the parser escaping of these isn't necessary, which might mean we can just remove the [] portion and be done with it. When doing so, it might be interesting to check the baseline test and how many more or less tests pass with that.

cargo test spec -- --nocapture

# left: `649`,
# right: `425`', tests/spec.rs:92:5

Please feel free to submit a PR for making this work.

Why do those characters always need to be escaped?

It's a great question. In theory these have a special meaning and if they appear as text they must should be escaped. At least with [] it's flexible enough when parsed with the Rust parser. You can try and see what happens with other parsers as well.

* [hi]
* <hi>
* #hi#
* _hi_
* *hi*
* `hi`
* |hi|
  • [hi]
  • #hi#
  • hi
  • hi
  • hi
  • |hi|

It seems the formatting characters really have to be escaped along with <>, but the others are less sensitive to be interpreted, or need a more specific context to.

Maybe you can conjure up more tests that break the assumption that [] is always fine in that regard, and if there is no way to do that even better.


When it comes to adding other features that I'd like to use for rustfmt would you prefer that I open an issue first so we can have a discussion about it or would you prefer I just submit a PR?

Submitting a PR will be preferable even as it allows to look at facts via tests from the start. Ideally GitHub would allow to turn issues into PRs easily when it's not your own repository.

Screenshot 2022-12-28 at 07 13 30

If this shows up for you, you might be able to convert this issue right away, otherwise, please feel free to submit a PR for it and start with PRs in the future. Whatever suits you best.

@ytmimi
Copy link
Contributor Author

ytmimi commented Dec 28, 2022

I really appreciate the detailed explanation 🙏🏼. I'll continue to investigate and see what I can find, and I'll submit some PRs when I think they're ready for a review!

@pacak
Copy link

pacak commented Aug 26, 2023

Got bitten by the same issue, rustdoc doesn't like escapes.

@SichangHe
Copy link
Contributor

It is beneficial to preserve the original escapes (and not add more) if using as a formatter.

I guess, by comparing the length of Event::Text's content and the Range length, we should be able to tell whether the [] were originally escaped, or. But, this gets complicated because we have to then count other characters that could be escaped.

One idea would be at least to not escape Event::Text(Borrowed('a str)) because they are clearly not escaped originally.

SichangHe added a commit to SichangHe/Byron--pulldown-cmark-to-cmark that referenced this issue May 25, 2024
Byron added a commit that referenced this issue Jun 10, 2024
Option to preserve special character escapes (for #51)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants