Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding issue (Used in Mailpit) #340

Closed
D3strukt0r opened this issue Aug 18, 2024 · 5 comments · Fixed by #342
Closed

Decoding issue (Used in Mailpit) #340

D3strukt0r opened this issue Aug 18, 2024 · 5 comments · Fixed by #342

Comments

@D3strukt0r
Copy link

D3strukt0r commented Aug 18, 2024

A tool we use at work, uses your library. It's creator forwarded me to your library, so here I am. The original issue is at axllent/mailpit#348

@axllent also commented a go example for smth, you might wanna check it out.

Basically, an email shown by this tool formats the text with wrong characters.

Here are a working case (any length shorter or equal to this will work)

example-works.eml.txt

image

And an example where anything longer or equal to this length (not sure if this is actually true) will break

example-broken.eml.txt

image

when being sent to any other mailcatcher software or opening in MS Outlook, it is displayed correctly

image

I hope this can help you identify the issue :)

@axllent
Copy link
Contributor

axllent commented Aug 27, 2024

I have finally worked out why and where it is happening (for reference), I just don't understand the reasoning.

Basically the logic here states that the character set for content should be "auto detected" from the content itself if, among other things, the length of the content is greater than 100 characters. This 100 characters happened to be the difference in your working and broken examples.

I'm not sure I agree with this logic, because, as far as I'm concerned, if an email part claims it is UTF-8-encoded then that is what should be used. There may be good reason the enmime relies more on auto-detection than what is stated in the email itself. @jhillyerd - are you able to shed any light on why this is the case?

@jhillyerd
Copy link
Owner

jhillyerd commented Aug 27, 2024

Thanks for looking into this, I suspected that chardet would be the cause but didn't have time right now to dig into it. #81 provides the original reasoning (and #132 sets the 100 rune threshold) -- essentially here is a lot of poorly encoded email out there, and enmime get used often in systems parsing mail archives and the like where the source material cannot be fixed. Spam is also often intentionally encoded poorly to try and bypass filters.

Now that we support decoding options, it would probably make sense to allow the character set detection to be disabled entirely, and the 100 character threshold to be modified by the caller.

@axllent
Copy link
Contributor

axllent commented Aug 28, 2024

Thanks for the response @jhillyerd - that makes a lot of sense. Ironically in my case (ie: Mailpit) I actually want it to decode with the supplied encoding (assuming there is one) to help identify issues (ie: the point of Mailpit). Detection would probably be helpful if none was provided, but that is really secondary. If an email says a part is ISO-8859-1 then I really want to decode it as that, regardless if the detection is 100% sure it's not.

What would you suggest is my best approach?

  1. Wait for an update in enmime in which I can possibly set this via the decoding options, or
  2. Create a "Mailpit fork" specifically to handle this

I'd rather not fork if I don't have to, but I also realise that your use-case for enmime differs from mine (you're doing your best to decode things despite what the encoding tells you, whereas I want it to break when it's incorrect) ;-) Open to suggestions / thoughts.

@jhillyerd
Copy link
Owner

@axllent the Part struct where this is happening already contains the Parser with the options on it, so if you have the time to fork and fix, it shouldn't require much to convert it into a PR with an option to control chardet.

Options start out with Go's default value (ie false for bool), to maintain backwards compatibility, we'd want func DisableCharacterDetection(v bool) Option in options.go, and a similarly named field added to the Parser struct.

@axllent
Copy link
Contributor

axllent commented Aug 29, 2024

Thanks @jhillyerd, your direction here was very helpful. I have added a PR to add this option - hopefully it's OK? It seems to resolve the issue outlines above and, provided there is a character set in the message part, enforces decoding with whatever is set.

I added a test too which works as expected. I had to add a couple of extra characters in the test content to beet the minCharsetRuneLength threshold for the test ("1233"). I tried adding a random sentence which caused the character detection (gogs/chardet) to correctly detect the content as UTF-8, so it seems this issue is quite an edge case. For this reason I did not add a test to compare with and without the disableCharacterDetection option set as I don't want the tests breaking if there was an upstream change in gogs/chardet for instance.

Please let me know if you require any other information and/or changes. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants