-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decoding issue (Used in Mailpit) #340
Comments
I have finally worked out why and where it is happening (for reference), I just don't understand the reasoning. Basically the logic here states that the character set for content should be "auto detected" from the content itself if, among other things, the length of the content is greater than 100 characters. This 100 characters happened to be the difference in your working and broken examples. I'm not sure I agree with this logic, because, as far as I'm concerned, if an email part claims it is UTF-8-encoded then that is what should be used. There may be good reason the enmime relies more on auto-detection than what is stated in the email itself. @jhillyerd - are you able to shed any light on why this is the case? |
Thanks for looking into this, I suspected that chardet would be the cause but didn't have time right now to dig into it. #81 provides the original reasoning (and #132 sets the 100 rune threshold) -- essentially here is a lot of poorly encoded email out there, and enmime get used often in systems parsing mail archives and the like where the source material cannot be fixed. Spam is also often intentionally encoded poorly to try and bypass filters. Now that we support decoding options, it would probably make sense to allow the character set detection to be disabled entirely, and the 100 character threshold to be modified by the caller. |
Thanks for the response @jhillyerd - that makes a lot of sense. Ironically in my case (ie: Mailpit) I actually want it to decode with the supplied encoding (assuming there is one) to help identify issues (ie: the point of Mailpit). Detection would probably be helpful if none was provided, but that is really secondary. If an email says a part is What would you suggest is my best approach?
I'd rather not fork if I don't have to, but I also realise that your use-case for enmime differs from mine (you're doing your best to decode things despite what the encoding tells you, whereas I want it to break when it's incorrect) ;-) Open to suggestions / thoughts. |
@axllent the Options start out with Go's default value (ie false for bool), to maintain backwards compatibility, we'd want |
Thanks @jhillyerd, your direction here was very helpful. I have added a PR to add this option - hopefully it's OK? It seems to resolve the issue outlines above and, provided there is a character set in the message part, enforces decoding with whatever is set. I added a test too which works as expected. I had to add a couple of extra characters in the test content to beet the Please let me know if you require any other information and/or changes. Thanks. |
A tool we use at work, uses your library. It's creator forwarded me to your library, so here I am. The original issue is at axllent/mailpit#348
@axllent also commented a go example for smth, you might wanna check it out.
Basically, an email shown by this tool formats the text with wrong characters.
Here are a working case (any length shorter or equal to this will work)
example-works.eml.txt
And an example where anything longer or equal to this length (not sure if this is actually true) will break
example-broken.eml.txt
when being sent to any other mailcatcher software or opening in MS Outlook, it is displayed correctly
I hope this can help you identify the issue :)
The text was updated successfully, but these errors were encountered: