Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-16 without BOM not detected correctly #5

Open
GoogleCodeExporter opened this issue Mar 17, 2015 · 1 comment
Open

UTF-16 without BOM not detected correctly #5

GoogleCodeExporter opened this issue Mar 17, 2015 · 1 comment

Comments

@GoogleCodeExporter
Copy link

What steps will reproduce the problem?
1. Create a text file encoded as UTF-16 little endian.
2. Edit hex and remove the BOM from the file.  Yes, this is purposely modifying 
the file to cause a problem but I have been encountering many examples of 
UTF-16 encoded files lacking a BOM as provided to me from other applications.  
And not having a BOM does not invalidate the file.
3. Test Ude.Example by passing path to this BOM-less UTF-16LE file
4. When UniversalDetector is called the first check is to look for a BOM.
5. Not having a BOM, the evaluation passes to the deeper analysis which returns 
a result of encoding = ANSI 1252 which is wrong.

What is the expected output? 

Expected output is encoding = "UTF-16"

What do you see instead?

"Charset: ASCII, confidence: 1"


What version of the product are you using? On what operating system?

Ude C# port with all current code changes applied
Window 7 Ultimate SP1 64-bit

Please provide any additional information below.

Larger files (1000kb+) lacking the BOM tend to show result of "Charset: 
windows-1252, confidence: 0.5"

Original issue reported on code.google.com by [email protected] on 17 Sep 2012 at 10:52

@marcussacana
Copy link

I tried this file and he said is a 1252 codepage. (it's a utf16)
I think it's obvious is a utf16 because have a sequence of null bytes.
this is a test.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants