UTF-16 without BOM not detected correctly #5

GoogleCodeExporter · 2015-03-17T15:17:01Z

What steps will reproduce the problem?
1. Create a text file encoded as UTF-16 little endian.
2. Edit hex and remove the BOM from the file.  Yes, this is purposely modifying 
the file to cause a problem but I have been encountering many examples of 
UTF-16 encoded files lacking a BOM as provided to me from other applications.  
And not having a BOM does not invalidate the file.
3. Test Ude.Example by passing path to this BOM-less UTF-16LE file
4. When UniversalDetector is called the first check is to look for a BOM.
5. Not having a BOM, the evaluation passes to the deeper analysis which returns 
a result of encoding = ANSI 1252 which is wrong.

What is the expected output? 

Expected output is encoding = "UTF-16"

What do you see instead?

"Charset: ASCII, confidence: 1"


What version of the product are you using? On what operating system?

Ude C# port with all current code changes applied
Window 7 Ultimate SP1 64-bit

Please provide any additional information below.

Larger files (1000kb+) lacking the BOM tend to show result of "Charset: 
windows-1252, confidence: 0.5"

Original issue reported on code.google.com by [email protected] on 17 Sep 2012 at 10:52

The text was updated successfully, but these errors were encountered:

marcussacana · 2019-04-14T00:13:49Z

I tried this file and he said is a 1252 codepage. (it's a utf16)
I think it's obvious is a utf16 because have a sequence of null bytes.
this is a test.txt

GoogleCodeExporter added Type-Defect Priority-Medium auto-migrated labels Mar 17, 2015

304NotModified mentioned this issue Apr 7, 2017

UTF-16 without BOM not detected correctly CharsetDetector/UTF-unknown#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-16 without BOM not detected correctly #5

UTF-16 without BOM not detected correctly #5

GoogleCodeExporter commented Mar 17, 2015

marcussacana commented Apr 14, 2019

UTF-16 without BOM not detected correctly #5

UTF-16 without BOM not detected correctly #5

Comments

GoogleCodeExporter commented Mar 17, 2015

marcussacana commented Apr 14, 2019