Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character encoding issue with autolinking #388

Open
whatupdave opened this issue Jun 10, 2014 · 16 comments
Open

Character encoding issue with autolinking #388

whatupdave opened this issue Jun 10, 2014 · 16 comments

Comments

@whatupdave
Copy link

Not sure what's causing this:

> ruby -e "require 'redcarpet'; puts Redcarpet::Markdown.new(Redcarpet::Render::HTML, autolink: true).render('[email protected]ü')"
<p><a href="mailto:[email protected]%C3">[email protected]�</a>�</p>

› ruby -e "require 'redcarpet'; puts Redcarpet::Markdown.new(Redcarpet::Render::HTML, autolink: true).render('[email protected]ü').inspect"
"<p><a href=\"mailto:[email protected]%C3\">[email protected]\xC3</a>\xBC</p>\n"

It's fine without autolinking:

› ruby -e "require 'redcarpet'; puts Redcarpet::Markdown.new(Redcarpet::Render::HTML, autolink: false).render('[email protected]ü')"
<p>[email protected]ü</p>
@neilmiddleton
Copy link

Yup, I've just hit this same issue.

@david50407
Copy link

I've hit this same issue, too.

It will spilt out my UTF-8 char, into link with first part of bytes and other bytes keep outside the link.

example [email protected]\u300D into <a href="mailto:[email protected]%E3">mailto:[email protected]\xE3</a>\x80\x8D

@ericgoodwin
Copy link

I'm having the same issue as well. Any ideas of a fix for this @vmg?

@david50407
Copy link

I think the problem is the same as #358

But why a UTF-8 char can be splited...

@david50407
Copy link

I've traced the code and extract the function of sd_autolink__email into my test code, but it works well.

It's so wired, because after copying the link into buffer in sd_autolink__email it calls the callback of autolink with passing the link.

But if sd_autolink__email is functioning normally, the callback wouldn't get the wrong link.

@david50407
Copy link

BTW, Rinku has the same issue.
https://github.com/vmg/rinku

@david50407
Copy link

I found the point here: https://github.com/vmg/redcarpet/blob/master/ext/redcarpet/autolink.c#L227

    for (link_end = 0; link_end < size; ++link_end) {
        uint8_t c = data[link_end];

        if (isalnum(c)) /* HERE */
            continue;

        if (c == '@')
            nb++;
        else if (c == '.' && link_end < size - 1)
            np++;
        else if (c != '-' && c != '_')
            break;
    }

That when passing (\xE3\x80\x8D), it returns TRUE from isalnum(0xE3).

When I modified the if statement into if (isalnum(c) && c < 0x7f), it works fine.

@ryrych
Copy link

ryrych commented Mar 19, 2016

Not sure if it is redcarpet related (or upstream-kramdown), but I have the same problem when header contains a UTF-8 character:

# dupa
## dópa
redcarpet --render with_toc_data test.md
<h1 id="dupa">dupa</h1>
<h2 id="d�pa">dópa</h2>

When jekyll makes a build I get the following exception:

Liquid Exception: invalid byte sequence in UTF-8 in feed.xml
jekyll 2.4.0 | Error:  invalid byte sequence in UTF-8

Normally I'd use an urlify implementation like this one: https://github.com/beastaugh/urlify, but it seems that the escaping is done with C… well I don't have a slightest idea how to debug it ;)

@vmg hope it helps someway :)

catphish pushed a commit to krystal/redcarpet that referenced this issue Apr 26, 2017
@MadPositron
Copy link

MadPositron commented Sep 6, 2018

I'm getting invalid byte sequence in UTF-8, trying to render markdown w/ redcarpet on the following char, but only if it's in the (bash) code block. Outside of the codeblock it works fine. The char is on the first line of the code block.

```bash ¢ ```

@mdchaney
Copy link

mdchaney commented Oct 9, 2019

I'm still getting this issue when using autolinking. UTF-8 characters are being split apart when they appear after a piece of text that will be autolinked. For instance:

Email me at “[email protected]

Is going to cause problems. Is there a fix for this?

@david50407
Copy link

@mdchaney patch is already here... #463

@mdchaney
Copy link

Okay, I'll just pull from repo then. Are there plans of another release?

@david50407
Copy link

I have no idea that is this repo going to merge the patch or not.
So, just apply the patch by yourself. lol

@mdchaney
Copy link

Yeah, I realized that. Ugh. Looks like redcarpet has been abandoned - one of us probably should fork it and apply outstanding merge requests. This particular one is a biggy.

@jstewart
Copy link

@vmg - Any chance of a fix for this? This one is bitting me as well. This bug can be easily reproduced like this:

renderer = Redcarpet::Render::HTML.new(with_toc_data: true)
md = Redcarpet::Markdown.new(renderer, no_intra_emphasis: true, tables: true, autolink: true, quote: true)
md.render("“[email protected]“")

# => "<p>“<a href=\"mailto:[email protected]%E2\">[email protected]\xE2</a>\x80\x9C</p>\n"
# irb(main):008:0> md.render("“[email protected]“").valid_encoding?
# => false

david50407 added a commit to CatCafe/redcarpet that referenced this issue Feb 18, 2022
@fwolfst
Copy link

fwolfst commented Jul 19, 2024

Just checked why we are maintaining an own fork as well. @robin850 thanks for your last merges and releases. Do you see any chance to merge this one? Do you need any help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants