Decoding issue (Used in Mailpit) #340

D3strukt0r · 2024-08-18T14:54:40Z

A tool we use at work, uses your library. It's creator forwarded me to your library, so here I am. The original issue is at axllent/mailpit#348

@axllent also commented a go example for smth, you might wanna check it out.

Basically, an email shown by this tool formats the text with wrong characters.

Here are a working case (any length shorter or equal to this will work)

example-works.eml.txt

And an example where anything longer or equal to this length (not sure if this is actually true) will break

example-broken.eml.txt

when being sent to any other mailcatcher software or opening in MS Outlook, it is displayed correctly

I hope this can help you identify the issue :)

axllent · 2024-08-27T05:53:35Z

I have finally worked out why and where it is happening (for reference), I just don't understand the reasoning.

Basically the logic here states that the character set for content should be "auto detected" from the content itself if, among other things, the length of the content is greater than 100 characters. This 100 characters happened to be the difference in your working and broken examples.

I'm not sure I agree with this logic, because, as far as I'm concerned, if an email part claims it is UTF-8-encoded then that is what should be used. There may be good reason the enmime relies more on auto-detection than what is stated in the email itself. @jhillyerd - are you able to shed any light on why this is the case?

jhillyerd · 2024-08-27T15:58:40Z

Thanks for looking into this, I suspected that chardet would be the cause but didn't have time right now to dig into it. #81 provides the original reasoning (and #132 sets the 100 rune threshold) -- essentially here is a lot of poorly encoded email out there, and enmime get used often in systems parsing mail archives and the like where the source material cannot be fixed. Spam is also often intentionally encoded poorly to try and bypass filters.

Now that we support decoding options, it would probably make sense to allow the character set detection to be disabled entirely, and the 100 character threshold to be modified by the caller.

axllent · 2024-08-28T05:04:21Z

Thanks for the response @jhillyerd - that makes a lot of sense. Ironically in my case (ie: Mailpit) I actually want it to decode with the supplied encoding (assuming there is one) to help identify issues (ie: the point of Mailpit). Detection would probably be helpful if none was provided, but that is really secondary. If an email says a part is ISO-8859-1 then I really want to decode it as that, regardless if the detection is 100% sure it's not.

What would you suggest is my best approach?

Wait for an update in enmime in which I can possibly set this via the decoding options, or
Create a "Mailpit fork" specifically to handle this

I'd rather not fork if I don't have to, but I also realise that your use-case for enmime differs from mine (you're doing your best to decode things despite what the encoding tells you, whereas I want it to break when it's incorrect) ;-) Open to suggestions / thoughts.

jhillyerd · 2024-08-28T15:58:56Z

@axllent the Part struct where this is happening already contains the Parser with the options on it, so if you have the time to fork and fix, it shouldn't require much to convert it into a PR with an option to control chardet.

Options start out with Go's default value (ie false for bool), to maintain backwards compatibility, we'd want func DisableCharacterDetection(v bool) Option in options.go, and a similarly named field added to the Parser struct.

Resolves jhillyerd#340

axllent · 2024-08-29T05:40:31Z

Thanks @jhillyerd, your direction here was very helpful. I have added a PR to add this option - hopefully it's OK? It seems to resolve the issue outlines above and, provided there is a character set in the message part, enforces decoding with whatever is set.

I added a test too which works as expected. I had to add a couple of extra characters in the test content to beet the minCharsetRuneLength threshold for the test ("1233"). I tried adding a random sentence which caused the character detection (gogs/chardet) to correctly detect the content as UTF-8, so it seems this issue is quite an edge case. For this reason I did not add a test to compare with and without the disableCharacterDetection option set as I don't want the tests breaking if there was an upstream change in gogs/chardet for instance.

Please let me know if you require any other information and/or changes. Thanks.

rfay mentioned this issue Aug 19, 2024

Update Mailpit to latest version 1.20.2 ddev/ddev#6495

Closed

1 task

jhillyerd added enhancement decoding labels Aug 27, 2024

axllent added a commit to axllent/enmime that referenced this issue Aug 29, 2024

feat: add option to disable character detection

3986ec7

Resolves jhillyerd#340

axllent mentioned this issue Aug 29, 2024

feat: add option to disable character detection #342

Merged

jhillyerd closed this as completed in #342 Aug 31, 2024

jhillyerd closed this as completed in a9fae7a Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoding issue (Used in Mailpit) #340

Decoding issue (Used in Mailpit) #340

D3strukt0r commented Aug 18, 2024 •

edited

Loading

axllent commented Aug 27, 2024

jhillyerd commented Aug 27, 2024 •

edited

Loading

axllent commented Aug 28, 2024 •

edited

Loading

jhillyerd commented Aug 28, 2024

axllent commented Aug 29, 2024

Decoding issue (Used in Mailpit) #340

Decoding issue (Used in Mailpit) #340

Comments

D3strukt0r commented Aug 18, 2024 • edited Loading

axllent commented Aug 27, 2024

jhillyerd commented Aug 27, 2024 • edited Loading

axllent commented Aug 28, 2024 • edited Loading

jhillyerd commented Aug 28, 2024

axllent commented Aug 29, 2024

D3strukt0r commented Aug 18, 2024 •

edited

Loading

jhillyerd commented Aug 27, 2024 •

edited

Loading

axllent commented Aug 28, 2024 •

edited

Loading