Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error "incompatible character encodings: UTF-8 and ASCII-8BIT" when combined with a rails app #9

Closed
oboxodo opened this issue Jul 7, 2015 · 28 comments

Comments

@oboxodo
Copy link

oboxodo commented Jul 7, 2015

I think this might not be a commonmarker problem, BUT the error is not raised when using pandoc-ruby nor redcarpet, so it has something to do with commonmarker.

Here you can see a test run from the command line with both cmark and commonmarker and there's no problem:

$ cat test-curly-quotes.md
This curly quote “makes commonmarker throw an exception”.

$ cmark --version
cmark 0.20.0 - CommonMark converter
(C) 2014, 2015 John MacFarlane

$ cmark test-curly-quotes.md
<p>This curly quote “makes commonmarker throw an exception”.</p>

$ gem list --local commonmarker

*** LOCAL GEMS ***

commonmarker (0.2.0)

$ cat test-curly-quotes.md | ruby -r commonmarker -e "puts CommonMarker.render_html(gets)"
<p>This curly quote “makes commonmarker throw an exception”.</p>

That said, I'm testing different markdown parsers/renderers for our rails 4.1.12 (ruby 2.2.2) based app and I'm getting the following error:

ActionView::Template::Error (incompatible character encodings: UTF-8 and ASCII-8BIT):
    12:       - if user_signed_in?
    13:         .outline-content
    14:           = commonmarker_markdown(@quimbee_outline.source)
  app/views/outlines/show.html.slim:15:in `_app_views_outlines_show_html_slim___3317075370232322437_70158621096300'


  Rendered /Users/oboxodo/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/templates/rescues/_trace.html.erb (2.9ms)
  Rendered /Users/oboxodo/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/templates/rescues/_request_and_response.html.erb (1.7ms)
  Rendered /Users/oboxodo/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/templates/rescues/template_error.html.erb within rescues/layout (69.1ms)

I have these helpers:

# encoding: UTF-8
module ApplicationHelper
  def commonmarker_markdown(text)
    CommonMarker.render_html(text, :smart).html_safe
  end

  def pandoc_markdown(text)
    converter = PandocRuby.new(text, from: :markdown, to: :html)
    converter.convert.html_safe
  end

  def redcarpet_markdown(text)
    # ...
  end
end

Changing the call to commonmarker_markdown to either pandoc_markdown or redcarpet_markdown renders the expected result with no errors.

It's not a DB (postgresql) encoding problem either as hardcoding the test phrase in place of the text variable (no DB involved) causes the same problem.

Any ideas about what could be happening?

@gjtorikian
Copy link
Owner

Amazing write up, thank you. I'll take a look at this within the day. There might need to be a forced UTF-8 encoding.

@gjtorikian
Copy link
Owner

I have bad news and good news.

The bad news is, I cannot get the exception to throw. I started a new rails project, jumped into console, and tried to see what would happen if I passed the same data:

irb(main):007:0> require 'commonmarker'
=> false
irb(main):008:0> CommonMarker::VERSION
=> "0.2.0"
irb(main):009:0> c = "This curly quote “makes commonmarker throw an exception”."
=> "This curly quote “makes commonmarker throw an exception”."
irb(main):010:0> CommonMarker.render_html(c, :smart).html_safe
=> "<p>This curly quote \xE2\x80\x9Cmakes commonmarker throw an exception\xE2\x80\x9D.</p>\n"

The "good" news is that there's definitely something weird going on with those escape codes. I would expect “...” to come back. It does worry me that I can't reproduce the exception, though.

I wonder if this is specific to ActionView::Template. The "quick" answer would be to append .force_encoding('UTF-8'):

irb(main):011:0> CommonMarker.render_html(c, :smart).force_encoding('utf-8')
=> "<p>This curly quote “makes commonmarker throw an exception”.</p>\n"

But that seems wrong/unfair/not the responsibility of the consumer.

@gjtorikian
Copy link
Owner

But that seems wrong/unfair/not the responsibility of the consumer.

To finish my thought: probably this library should do the force_encoding. Could you verify that force_encoding fixes the problem for you? If so I'll do a patch release for this.

@oboxodo
Copy link
Author

oboxodo commented Jul 7, 2015

You nailed it! It works. Thanks.

@oboxodo oboxodo closed this as completed Jul 7, 2015
@oboxodo
Copy link
Author

oboxodo commented Jul 7, 2015

BTW... I'm using slim. Maybe it's related?

@gjtorikian gjtorikian mentioned this issue Jul 7, 2015
@duhaime
Copy link

duhaime commented Mar 10, 2022

@gjtorikian we're seeing this same issue. Is there a way to traverse all nodes and convert each to utf8? Any pointers you can provide would be greatly appreciated!

@gjtorikian
Copy link
Owner

Which version of commonmarker are you using?

@duhaime
Copy link

duhaime commented Mar 12, 2022

@gjtorikian We're on version 0.23.4

@gjtorikian
Copy link
Owner

So you can absolutely walk the AST tree: https://github.com/gjtorikian/commonmarker#example-walking-the-ast

But that's very slow/time-consuming, and ideally shouldn't be necessary. Are you able to share your markdown doc or create a small (failing) test to show the error?

@duhaime
Copy link

duhaime commented Mar 14, 2022

@gjtorikian thank you for your response. I'm trying to paste a minimal example but it appears Github's editor is stripping out the problematic character from the following:

s = "hello: <https://world.com​>"
doc = CommonMarker.render_doc(s, :DEFAULT)

parsed = ""
doc.walk do |node|
  if node.type == :link
    text_node = node
    text_node = text_node.first_child until [:text, :code].include? text_node.type
    if node.url.include?(text_node.string_content)
      puts(node.url)
    end
  end
end

You may need to insert the missing 0x200b character locally so as to achieve:

Screen Shot 2022-03-13 at 8 50 47 PM

We solved this problem with:

s = "hello: <https://world.com​>"
doc = CommonMarker.render_doc(s, :DEFAULT)

parsed = ""
doc.walk do |node|
  if node.type == :link
    text_node = node
    text_node = text_node.first_child until [:text, :code].include? text_node.type
    if node.url.force_encoding("UTF-8").include?(text_node.string_content.force_encoding("UTF-8"))
      puts(node.url)
    end
  end
end

but it would be great if commonmarker gave us an option to treat the whole document's tree as utf-8, so we don't need to force all encodings. Would that be feasible?

@gjtorikian
Copy link
Owner

Yes, it should be. I agree that forcing the encoding is not ideal!

@gjtorikian
Copy link
Owner

@duhaime Hm. One thing that's different here is that when I run your code, with the encoded character placed, my tree doesn't recognize any link nodes at all:

#<CommonMarker::Node(document): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>34} children=[#<CommonMarker::Node(paragraph): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>34} children=[#<CommonMarker::Node(text): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>34}, string_content="hello: <https://world.com<0x200b>>">]>]>
#<CommonMarker::Node(paragraph): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>34} children=[#<CommonMarker::Node(text): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>34}, string_content="hello: <https://world.com<0x200b>>">]>
#<CommonMarker::Node(text): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>34}, string_content="hello: <https://world.com<0x200b>>">

Could you change your sample code to

doc.walk do |node|
  puts node
  # ...
end

And list the walked nodes as I've done here?

@gjtorikian gjtorikian reopened this Mar 14, 2022
@duhaime
Copy link

duhaime commented Mar 15, 2022

Hmm, the plot thickens!

I get:

=> "hello: <https://world.com​>"
=> #<CommonMarker::Node(document): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>29} children=[#<CommonMarker::Node(paragraph): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>29} children=[#<CommonMarker::Node(text): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>7}, string_content="hello: ">, #<CommonMarker::Node(link): sourcepos={:start_line=>1, :start_column=>8, :end_line=>1, :end_column=>29}, url="https://world.com\xE2\x80\x8B", title="" children=[#<CommonMarker::Node(text): sourcepos={:start_line=>1, :start_column=>9, :end_line=>1, :end_column=>28}, string_content="https://world.com​">]>]>]>

=> ""
#<CommonMarker::Node:0x000000010e422980>
#<CommonMarker::Node:0x000000010e4daf30>
#<CommonMarker::Node:0x000000010e6ae730>
#<CommonMarker::Node:0x000000010e6ae140>
Traceback (most recent call last):
        3: from (irb):31
        2: from (irb):36:in `block in irb_binding'
        1: from (irb):36:in `include?'
Encoding::CompatibilityError (incompatible character encodings: ASCII-8BIT and UTF-8)

And:

> CommonMarker::VERSION
=> "0.23.4"

Why would these results look so different? I'm using the Rails console instead of irb above--is that relevant?

@gjtorikian
Copy link
Owner

Ah, I think I misread your example. The string is literally <https://world.com​0x200b>, not <https://world.com​<0x200b>>. Is that right?

@duhaime
Copy link

duhaime commented Mar 15, 2022

Oh no, sorry, it should be exactly as it appears in the image above (the latter in your comment above).

@gjtorikian
Copy link
Owner

Strange!

What version of Ruby do you have running?

@duhaime
Copy link

duhaime commented Mar 16, 2022

2.6.8 via rbenv:

(base) % ruby -v
ruby 2.6.8p205 (2021-07-07 revision 67951) [arm64-darwin21]
(base) % which ruby
/Users/doug/.rbenv/shims/ruby

@gjtorikian
Copy link
Owner

I simply can't reproduce this. And even CI, running Ruby 2.6.6 on Windows/Ubuntu/MacOS. I booted a Rails 7 app to test the logic in the console, and it worked fine, too.

Just to be extra explicit, this is the code I'm using to test:

    str = "hello: <https://world.com<0x200b>>"
    doc = CommonMarker.render_doc(str, :DEFAULT)

    doc.walk do |node|
      puts node.type
    end

A couple of things to note:

  1. GitHub's editor isn't stripping out that character, so I'm not sure why it is for you
  2. This still doesn't detect a link, and causes no "incompatible character encoding" errors.

I'm afraid without more information I'm not sure what I can do to solve this.

@duhaime
Copy link

duhaime commented Mar 16, 2022

Ah I think your example just needs to be updated. Your snippet has:

str = "hello: <https://world.com<0x200b>>"

In this case, your string literally contains the characters in the Unicode character that's causing the issue. I think we just need to update the string you're using. As it turns out, the string I posted initially (s = "hello: <https://world.com​>") does contain the character--you should see it if you paste it in your Rails console:

Screen Shot 2022-03-16 at 7 14 24 PM

You can see the codepoints of the string if you use s.unpack('U*') [and can combine the codepoints back into a string like so: s.unpack('U*').pack("U*")].

Does this help you reproduce the situation?

@gjtorikian
Copy link
Owner

gjtorikian commented Mar 17, 2022

Got it. In ruby the convention is to use \u to indicate a unicode hexadecimal:

irb(main):006:0> str = "hello: <https://world.com\u200b>"
=> "hello: <https://world.com​ >"

I can now reproduce the problem; now we're getting somewhere.

@gjtorikian
Copy link
Owner

Oh, and how what's the code snippet for how you're rendering the string? CommonMarker.render_doc(str, :DEFAULT).to_html ?

@duhaime
Copy link

duhaime commented Mar 17, 2022

Yes, or CommonMarker.render_doc(str, :DEFAULT).to_plaintext

@gjtorikian
Copy link
Owner

@duhaime Can you try pointing the gem to the encodaroni branch? I believe this will have the fix, and if so, I will push out a new bug release.

@duhaime
Copy link

duhaime commented Mar 21, 2022

Hmm, the change looks good but I'm still getting the same error. This must be user error. Here's what I'm doing:

gem uninstall commonmarker
git clone https://github.com/gjtorikian/commonmarker
cd commonmarker && gem build commonmarker.gemspec
gem install commonmarker-0.23.4.gem
irb

Then in the irb console:

require 'commonmarker'

s = "hello: <https://world.com​>"
doc = CommonMarker.render_doc(s, :DEFAULT)

parsed = ""
doc.walk do |node|
  if node.type == :link
    text_node = node
    text_node = text_node.first_child until [:text, :code].include? text_node.type
    if node.url.include?(text_node.string_content)
      puts(node.url)
    end
  end
end

Which throws:

Traceback (most recent call last):
       13: from /Users/doug/.rbenv/versions/2.6.8/bin/irb:23:in `<main>'
       12: from /Users/doug/.rbenv/versions/2.6.8/bin/irb:23:in `load'
       11: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
       10: from (irb):7
        9: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:17:in `walk'
        8: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:72:in `each'
        7: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:18:in `block in walk'
        6: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:17:in `walk'
        5: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:72:in `each'
        4: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:18:in `block in walk'
        3: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:16:in `walk'
        2: from (irb):11:in `block in irb_binding'
        1: from (irb):11:in `include?'
Encoding::CompatibilityError (incompatible character encodings: ASCII-8BIT and UTF-8)

Should I be doing something differently to test?

@gjtorikian
Copy link
Owner

With the repo cloned, try:

  • script/bootstrap
  • bundle exec rake clean compile test

@duhaime
Copy link

duhaime commented Mar 21, 2022

Interesting, I ran those steps on a fresh rbenv env, and I still get the same result. Do you get a different result with the code block I posted above?

@gjtorikian
Copy link
Owner

Interesting, I ran those steps on a fresh rbenv env, and I still get the same result. Do you get a different result with the code block I posted above?

Oh shoot, I do. Ok. I'll make time for this today.

@gjtorikian
Copy link
Owner

Due to #186, walking over nodes has been removed in v1.0.0. Users can use https://github.com/gjtorikian/html-pipeline if they wish to iterate over HTML after the fact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants