Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incompatible character encodings: ASCII-8BIT and UTF-8 in EmailReplyFilter #229

Closed
rymohr opened this issue Oct 23, 2015 · 4 comments
Closed

Comments

@rymohr
Copy link
Contributor

rymohr commented Oct 23, 2015

We've been running into random encoding errors with email replies for a while now. While I still haven't been able to get to the bottom of it, I've at least been able to reduce it down to a simple test case:

# reply.txt
wtf…

> On Oct 22, 2015, at 2:36 PM, Ryan wrote:
> Test
> —
> Ryan
filter = HTML::Pipeline::EmailReplyFilter.new(IO.read("reply.txt"))
filter.call

Encoding::CompatibilityError:
  incompatible character encodings: ASCII-8BIT and UTF-8
# /Users/ryan/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/html-pipeline-2.2.1/lib/html/pipeline/email_reply_filter.rb:64:in `join'
# /Users/ryan/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/html-pipeline-2.2.1/lib/html/pipeline/email_reply_filter.rb:64:in `call'

The incoming emails are delivered by Mandrill as JSON and the content appears to be valid UTF-8. Not sure what's going on.

It's worth noting that the error goes away if I inline the reply in the test case instead of reading it from disk. In that case I get the following difference in byte sequence:

# inline string
119,116,102,226,128,166,10,10,32,32,32,32,62,32,79,110,32,79,99,116,32,50,50,44,32,50,48,49,53,44,32,97,116,32,50,58,51,54,32,80,77,44,32,82,121,97,110,32,119,114,111,116,101,58,10,32,32,32,32,62,32,84,101,115,116,10,32,32,32,32,62,32,226,128,148,10,32,32,32,32,62,32,82,121,97,110,10

# read from disk
119,116,102,226,128,166,10,10,62,32,79,110,32,79,99,116,32,50,50,44,32,50,48,49,53,44,32,97,116,32,50,58,51,54,32,80,77,44,32,82,121,97,110,32,119,114,111,116,101,58,10,62,32,84,101,115,116,10,62,32,226,128,148,10,62,32,82,121,97,110,10
@jch
Copy link
Contributor

jch commented Oct 27, 2015

I asked around and I suspect it may be due to email_reply_parser forcing encoding to binary.

We've been running into random encoding errors with email replies for a while now.

It seems strange that it would affect some emails, but not others.

cc @brianmario in case I missed anything.

@rymohr
Copy link
Contributor Author

rymohr commented Oct 27, 2015

@jch thanks for asking around. Most of the replies we're working with only use standard ASCII chars so there aren't any issues. This only seems to happen when replies include extended ASCII chars, like the ellipsis above.

I think you're right about forcing the encoding to binary being the problem. What's the status on github/email_reply_parser#36?

@jch
Copy link
Contributor

jch commented Oct 27, 2015

@rymohr nice. Thanks for finding the PR. I think it's stalled, but I'm not clear who the owner is on it.

@rymohr
Copy link
Contributor Author

rymohr commented Aug 9, 2016

Fixed by github/email_reply_parser#36

@rymohr rymohr closed this as completed Aug 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants