Failing to detect ISO-8859 encodings #8

donlencho · 2018-08-03T11:58:20Z

I have good results detecting Unicode encodings and Asian codepages, but really poor results with common European languages files saved in the ISO-8859 family, which are really common and this problem makes compact_enc_det unusable for me.
Encoding is always detected as ASCII (and reliable is set to true) for these encodings.
ISO-8859-6 for Arabic is OK.
Am I the only one?
Thanks for letting me know, so I can check if there is a problem or just look for an alternative.

ghost · 2018-09-12T14:14:51Z

Same problem here.

JinsukKim · 2018-09-13T01:15:14Z

Unfortunately I don't have plans to improve the detection quality. Could you share the data you get poor results with? I can take a look. Hopefully there will be some things to do to get around the issue.

Thanks.

ghost · 2018-09-13T07:07:07Z

See the attached file, which is encoded in Windows-1252 but detected as GB18030. Thanks for helping!

ansi.txt

donlencho · 2018-09-13T09:28:55Z

I'm joining test files and the results I get, as you can see I'm satisfied with the unicode detections (and mostly for asian encodings) but really disappointed by ISO, particularly for western European languages (ISO-1 and 15 for instance, which are really widespread, see:
https://www.terena.org/activities/multiling/ml-docs/iso-8859.html ).
Thank you in any case!

big5-hkscs.txt		→	BIG5_HKSCS, reliable: false		→ OK
big5.txt		→	BIG5, reliable: false			→ OK
BIG5.txt		→	BIG5, reliable: true			→ OK
euc-jp.txt		→	GB (=GBA8030?), reliable: false		→ ~OK
euc-kr.txt		→	KSC (=?), reliable: false		→ OK
gbk.txt			→	GB (=GBA8030?), reliable: false		→ OK
IBM855.txt		→	CP-1256, reliable: false		→ Not OK
ISO-8859-15-CRLF.srt	→	ASCII, reliable: true			→ Not OK
ISO-8859-15 euro.txt	→	ASCII, reliable: true			→ Not OK
ISO-8859-15 petit test.txt	→	CP1250, reliable: true		→ Not OK
ISO-8859-15.srt		→	ASCII, reliable: true			→ Not OK
ISO-8859-1.srt		→	ASCII, reliable: true			→ Not OK
ISO-8859-6.srt		→	Arabic, reliable: true			→ OK
shift_jis.txt		→	SJC, reliable: true			→ OK
UTF16BE.srt		→	UTF16BE, reliable: false		→ OK
UTF16LE.srt		→	UTF16LE, reliable: false		→ OK
UTF-7.txt		→	ASCII-7 bits, reliable: true		→ OK
UTF8BOM.srt		→	UTF8, reliable: true			→ OK
utf-8 CN.txt		→	UTF8, reliable: true			→ OK
UTF8CRLF.srt		→	UTF8, reliable: true			→ OK
UTF8CR.srt		→	UTF8, reliable: true			→ OK
UTF8LF.srt		→	UTF8, reliable: true			→ OK

encodings.tar.gz

Lord-Kamina · 2019-01-26T03:53:04Z

Just on the off-chance... do you have any idea on how this might be tackled; in case somebody else wants to take a crack at it?

hsivonen mentioned this issue Apr 23, 2019

Reassess the fallback encoding for Greek whatwg/html#4558

Open

ArkadiuszMichalski mentioned this issue Dec 20, 2021

Incorrect character set detection notepad-plus-plus/notepad-plus-plus#10916

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing to detect ISO-8859 encodings #8

Failing to detect ISO-8859 encodings #8

donlencho commented Aug 3, 2018

ghost commented Sep 12, 2018

JinsukKim commented Sep 13, 2018

ghost commented Sep 13, 2018

donlencho commented Sep 13, 2018

Lord-Kamina commented Jan 26, 2019

Failing to detect ISO-8859 encodings #8

Failing to detect ISO-8859 encodings #8

Comments

donlencho commented Aug 3, 2018

ghost commented Sep 12, 2018

JinsukKim commented Sep 13, 2018

ghost commented Sep 13, 2018

donlencho commented Sep 13, 2018

Lord-Kamina commented Jan 26, 2019