Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use UTF-8 encoding for Windows? #652

Open
brodycj opened this issue Mar 19, 2017 · 4 comments
Open

Use UTF-8 encoding for Windows? #652

brodycj opened this issue Mar 19, 2017 · 4 comments

Comments

@brodycj
Copy link
Contributor

brodycj commented Mar 19, 2017

I discovered that the Windows platform version uses the UTF-16le internal database encoding while the other platform versions use the UTF-8 database encoding. The results of using the HEX function on TEXT string values indicate that Android/iOS WebKit Web SQL uses the UTF-8 encoding as well. I found the following official descriptions:

It is very clear in those and other places that the necessary conversions are done automatically and there should be no difference between UTF-8 and UTF-16 database encoding at the API level. However I discovered some hidden Gotchas:

  • The result of using the sqlite HEX function with a string value is different depending on which internal database encoding is used.
  • According to http://www.sqlite.org/pragma.html#pragma_encoding:
    • It is not possible to change the internal encoding of an existing sqlite database.
    • There is no way to ATTACH to a database created with a different encoding.
  • According to https://www.sqlite.org/version3.html it should be possible for a developer to store and retrieve TEXT strings with ISO-8859 (Latin-1) encoded characters in case of UTF-8 database encoding.

From some research I discovered that it is generally more efficient to store the data in UTF-8 format:

For the reasons above I think it would be beneficial to fix the Windows version to use the UTF-8 encoding by default. (The easy way is to use PRAGMA encoding right after opening the database.) The user can then change the internal database encoding using PRAGMA encoding before writing any data. (See http://www.sqlite.org/pragma.html#pragma_encoding.)

ADDITIONAL IMPORTANT READING: http://sqlite.1065341.n5.nabble.com/UTF-16-API-a-second-class-citizen-td46048.html links to https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ which looks like essential reading for all serious sqlite users.

@brodycj
Copy link
Contributor Author

brodycj commented Nov 8, 2017

@brodycj
Copy link
Contributor Author

brodycj commented Dec 12, 2017

General solution:

  • convert database file name from UTF-16 wstring to UTF-8 std::string (to get UTF-8 char *) needed to call sqlite3_open (or sqlite3_open_v2) instead of sqlite3_open16, as needed to create the database with UTF-8 TEXT encoding if it does not already exist ref: https://www.sqlite.org/c3ref/open.html

How to convert the database file name:

@brodycj
Copy link
Contributor Author

brodycj commented Jun 5, 2018

The test suite was updated to check the result of PRAGMA encoding, currently expected to report encoding value of UTF-16le on Windows, UTF-8 on all other platforms. In addition there is a number of HEX encoding tests that expect result of SQLite3 HEX function to show UTF-16le encoding on Windows plugin and (WebKit) Web SQL on Android 4.1-4.3, UTF-8 otherwise.

@brodycj
Copy link
Contributor Author

brodycj commented Jun 5, 2018

I am a bit concerned by the thread at http://sqlite.1065341.n5.nabble.com/Filename-encoding-on-Unix-platforms-td102210.html, especially the comment by Richard Hipp in http://sqlite.1065341.n5.nabble.com/Filename-encoding-on-Unix-platforms-td102210.html#a102215.

My concern is that using various forms of conversion to deal with locale-specific or platform-specific character conversions, especially on the database file name, may lead to an unpleasant surprise down the road (when someone least expects it and is least prepared for it).

Further analysis may be needed to reduce the risk of such a surprise as much as possible on Windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant