Support for globalization invariant mode #2784
Replies: 1 comment 4 replies
-
The 2nd point certainly sounds like the expected behavior... If the user has explicitly opted into globalization invariant mode, it definitely makes sense for SqlDataReader to use the invariant culture. For the former, an additional alternative to bundling mappings would be to allow globalization invariant mode only if the user's database is configured with an invariant locale on the SQL Server side (note that I know very little about this in SQL Server so I may be getting things wrong). In other words, it may be OK to have global invariant mode work in the narrow cases where the database doesn't actually contain data that would require ICU (and to throw otherwise, similar to today). Otherwise I'd still be interested in exactly why SqlClient requires a code page here... At least on the PostgreSQL/Npgsql side, PostgreSQL simply sends data in some encoding (UTF8), and that's all Npgsql ever needs to know about. |
Beta Was this translation helpful? Give feedback.
-
I've been thinking about what SqlClient would require to support globalization invariant mode. We currently throw an exception if we detect this, but I think we can handle this in a way which enables support.
Background: Globalization invariant mode
This was first introduced in .NET Core 2.0, with its design documented here. When this is enabled, .NET Core runs without accessing the ICU information. This in turn means that these packages can be removed from the underlying containers, saving ~28MB. The Alpine and the chiselled Ubuntu images for .NET are published without these packages to reduce their size.
In .NET 6.0, a breaking change enabled case-insensitive comparisons for all Unicode-defined characters.
It's recently seen more use in ASP.NET Core - one PR enabled this by default in the API template, and this was then partially reverted because it wasn't supported by SqlClient.
Background: SqlClient's compatibility with globalization invariant mode
As far as I can see, the first issue here was recorded in #81, and support was requested in #1249. This was closed, on the grounds that SqlClient needs the ICU data to correctly handle the decoding/encoding of data in remote collations.
Although that's broadly accurate, it's worth being precise. Text encoding and decoding works in globalization invariant mode, and various codepages function as users would expect. The base class for all codepage-based encodings loads certain encoding-based data from an embedded resource stream (here.)
Current state and proposed changes
At present, I think the only places SqlClient actually uses ICU data are:
TdsParser.GetCodePage
, which maps a Windows LCID to a code pageFieldNameLookup.LinearIndexOf
, which is used to allow clients of SqlDataReader to look up column names using the case, kana and width sensitivity rules of the database's LCIDI think we can make SqlClient compatible with globalization invariant mode by bundling the LCID-to-codepage mappings, and forcing SqlDataReader to use the invariant culture's case/kana/width sensitivity rules when in globalization invariant mode. There's some prior art here - we already special-case some cultures in TdsParser by mapping LCID 0x827 to the codepage for LCID 0x427 and hardcoding LCID 0x43F to codepage 1251.
Essentially, we're replacing a ~28MB ICU package with a few KB of largely static data (which can be auto-generated if necessary.)
Risks/impacts
Reading data
Historically, we've said that a lack of ICU data could potentially lead to corruption/misinterpretation. With the LCID-to-codepage mapping in place though, I don't think this is true any more. To consider the possible variations when SqlClient is in globalization invariant mode:
Appendix A of the MS-LCID specification includes a list of language IDs and their supported Windows versions. We can see that this list of languages changes very rarely, and it's possible that by the time SQL Server has support for the language as a collation there's an even greater delay. If we wanted to further reduce the risk of this LCID-to-codepage mapping drifting then I think it'd be possible to use a source generator, but that might be overkill.
Writing data
Unless specified in the
SqlParameter
class, the collation of string parameter values is set to the current collation of the connection. It's then written out to the network in UTF16. I don't think this would be affected by the proposed changes - UTF16 is available whether we're in globalization invariant mode or not, and we're not changing the way we determine the current collation, just the way it's mapped to a codepage.This looks like a fairly simple set of public-facing code changes - most of the work required is to confirm that this doesn't introduce any data corruption bugs and to expand the tests to handle code points which aren't encoded in the same way across collations. There might also be a case to allow an AppContext switch to force the use of ICU data (for troubleshooting purposes.)
Are there any other areas where SqlClient needs ICU data to avoid data corruption, or anything else which might need to be considered?
Beta Was this translation helpful? Give feedback.
All reactions