Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to detect ISO-8859 encodings #8

Open
donlencho opened this issue Aug 3, 2018 · 5 comments
Open

Failing to detect ISO-8859 encodings #8

donlencho opened this issue Aug 3, 2018 · 5 comments

Comments

@donlencho
Copy link

I have good results detecting Unicode encodings and Asian codepages, but really poor results with common European languages files saved in the ISO-8859 family, which are really common and this problem makes compact_enc_det unusable for me.
Encoding is always detected as ASCII (and reliable is set to true) for these encodings.
ISO-8859-6 for Arabic is OK.
Am I the only one?
Thanks for letting me know, so I can check if there is a problem or just look for an alternative.

@ghost
Copy link

ghost commented Sep 12, 2018

Same problem here.

@JinsukKim
Copy link
Collaborator

Unfortunately I don't have plans to improve the detection quality. Could you share the data you get poor results with? I can take a look. Hopefully there will be some things to do to get around the issue.

Thanks.

@ghost
Copy link

ghost commented Sep 13, 2018

See the attached file, which is encoded in Windows-1252 but detected as GB18030. Thanks for helping!

ansi.txt

@donlencho
Copy link
Author

I'm joining test files and the results I get, as you can see I'm satisfied with the unicode detections (and mostly for asian encodings) but really disappointed by ISO, particularly for western European languages (ISO-1 and 15 for instance, which are really widespread, see:
https://www.terena.org/activities/multiling/ml-docs/iso-8859.html ).
Thank you in any case!

big5-hkscs.txt		→	BIG5_HKSCS, reliable: false		→ OK
big5.txt		→	BIG5, reliable: false			→ OK
BIG5.txt		→	BIG5, reliable: true			→ OK
euc-jp.txt		→	GB (=GBA8030?), reliable: false		→ ~OK
euc-kr.txt		→	KSC (=?), reliable: false		→ OK
gbk.txt			→	GB (=GBA8030?), reliable: false		→ OK
IBM855.txt		→	CP-1256, reliable: false		→ Not OK
ISO-8859-15-CRLF.srt	→	ASCII, reliable: true			→ Not OK
ISO-8859-15 euro.txt	→	ASCII, reliable: true			→ Not OK
ISO-8859-15 petit test.txt	→	CP1250, reliable: true		→ Not OK
ISO-8859-15.srt		→	ASCII, reliable: true			→ Not OK
ISO-8859-1.srt		→	ASCII, reliable: true			→ Not OK
ISO-8859-6.srt		→	Arabic, reliable: true			→ OK
shift_jis.txt		→	SJC, reliable: true			→ OK
UTF16BE.srt		→	UTF16BE, reliable: false		→ OK
UTF16LE.srt		→	UTF16LE, reliable: false		→ OK
UTF-7.txt		→	ASCII-7 bits, reliable: true		→ OK
UTF8BOM.srt		→	UTF8, reliable: true			→ OK
utf-8 CN.txt		→	UTF8, reliable: true			→ OK
UTF8CRLF.srt		→	UTF8, reliable: true			→ OK
UTF8CR.srt		→	UTF8, reliable: true			→ OK
UTF8LF.srt		→	UTF8, reliable: true			→ OK

encodings.tar.gz

@Lord-Kamina
Copy link

Just on the off-chance... do you have any idea on how this might be tackled; in case somebody else wants to take a crack at it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants