Support for globalization invariant mode #2784

edwardneal · 2024-08-17T16:51:45Z

edwardneal
Aug 17, 2024

I've been thinking about what SqlClient would require to support globalization invariant mode. We currently throw an exception if we detect this, but I think we can handle this in a way which enables support.

Background: Globalization invariant mode

This was first introduced in .NET Core 2.0, with its design documented here. When this is enabled, .NET Core runs without accessing the ICU information. This in turn means that these packages can be removed from the underlying containers, saving ~28MB. The Alpine and the chiselled Ubuntu images for .NET are published without these packages to reduce their size.

In .NET 6.0, a breaking change enabled case-insensitive comparisons for all Unicode-defined characters.

It's recently seen more use in ASP.NET Core - one PR enabled this by default in the API template, and this was then partially reverted because it wasn't supported by SqlClient.

Background: SqlClient's compatibility with globalization invariant mode

As far as I can see, the first issue here was recorded in #81, and support was requested in #1249. This was closed, on the grounds that SqlClient needs the ICU data to correctly handle the decoding/encoding of data in remote collations.

Although that's broadly accurate, it's worth being precise. Text encoding and decoding works in globalization invariant mode, and various codepages function as users would expect. The base class for all codepage-based encodings loads certain encoding-based data from an embedded resource stream (here.)

Current state and proposed changes

At present, I think the only places SqlClient actually uses ICU data are:

TdsParser.GetCodePage, which maps a Windows LCID to a code page
FieldNameLookup.LinearIndexOf, which is used to allow clients of SqlDataReader to look up column names using the case, kana and width sensitivity rules of the database's LCID

I think we can make SqlClient compatible with globalization invariant mode by bundling the LCID-to-codepage mappings, and forcing SqlDataReader to use the invariant culture's case/kana/width sensitivity rules when in globalization invariant mode. There's some prior art here - we already special-case some cultures in TdsParser by mapping LCID 0x827 to the codepage for LCID 0x427 and hardcoding LCID 0x43F to codepage 1251.

Essentially, we're replacing a ~28MB ICU package with a few KB of largely static data (which can be auto-generated if necessary.)

Risks/impacts

Reading data

Historically, we've said that a lack of ICU data could potentially lead to corruption/misinterpretation. With the LCID-to-codepage mapping in place though, I don't think this is true any more. To consider the possible variations when SqlClient is in globalization invariant mode:

New LCID allocated, added to ICU/Windows but not SqlClient: LCID can't be mapped, exception is thrown before data can be interpreted. New issue is raised.
LCID removed from ICU/Windows but not SqlClient: no impact.
LCID-to-codepage mapping changed: this is the core risk - but we already face this risk. The ICU data on the client could be different to the server in exactly the same way, and we'd face the same risk of data corruption. I'm not sure what to look for to test this, but I wouldn't personally be surprised if these sort of changes required a new LCID.

Appendix A of the MS-LCID specification includes a list of language IDs and their supported Windows versions. We can see that this list of languages changes very rarely, and it's possible that by the time SQL Server has support for the language as a collation there's an even greater delay. If we wanted to further reduce the risk of this LCID-to-codepage mapping drifting then I think it'd be possible to use a source generator, but that might be overkill.

Writing data

Unless specified in the SqlParameter class, the collation of string parameter values is set to the current collation of the connection. It's then written out to the network in UTF16. I don't think this would be affected by the proposed changes - UTF16 is available whether we're in globalization invariant mode or not, and we're not changing the way we determine the current collation, just the way it's mapped to a codepage.

This looks like a fairly simple set of public-facing code changes - most of the work required is to confirm that this doesn't introduce any data corruption bugs and to expand the tests to handle code points which aren't encoded in the same way across collations. There might also be a case to allow an AppContext switch to force the use of ICU data (for troubleshooting purposes.)

Are there any other areas where SqlClient needs ICU data to avoid data corruption, or anything else which might need to be considered?

roji · 2024-08-19T12:36:53Z

roji
Aug 19, 2024
Collaborator

I think we can make SqlClient compatible with globalization invariant mode by bundling the LCID-to-codepage mappings, and forcing SqlDataReader to use the invariant culture's case/kana/width sensitivity rules when in globalization invariant mode.

The 2nd point certainly sounds like the expected behavior... If the user has explicitly opted into globalization invariant mode, it definitely makes sense for SqlDataReader to use the invariant culture.

For the former, an additional alternative to bundling mappings would be to allow globalization invariant mode only if the user's database is configured with an invariant locale on the SQL Server side (note that I know very little about this in SQL Server so I may be getting things wrong). In other words, it may be OK to have global invariant mode work in the narrow cases where the database doesn't actually contain data that would require ICU (and to throw otherwise, similar to today).

Otherwise I'd still be interested in exactly why SqlClient requires a code page here... At least on the PostgreSQL/Npgsql side, PostgreSQL simply sends data in some encoding (UTF8), and that's all Npgsql ever needs to know about.

4 replies

edwardneal Aug 19, 2024
Author

I agree on the latter point - I don't think it's a significant breaking change, and it doesn't bear the same risk of data corruption as the former.

For the former, an additional alternative to bundling mappings would be to allow globalization invariant mode only if the user's database is configured with an invariant locale on the SQL Server side (note that I know very little about this in SQL Server so I may be getting things wrong). In other words, it may be OK to have global invariant mode work in the narrow cases where the database doesn't actually contain data that would require ICU (and to throw otherwise, similar to today).

Otherwise I'd still be interested in exactly why SqlClient requires a code page here... At least on the PostgreSQL/Npgsql side, PostgreSQL simply sends data in some encoding (UTF8), and that's all Npgsql ever needs to know about.

PostgreSQL's approach sounds much better than SQL Server's! This part of the TDS protocol is... unique. If SQL Server needs to send the results of a simple query with one result set, two varchar columns and one row, it'll send three tokens:

COLMETADATA (column 1)
ROW
DONE

The query I choose to run may be:

select '€' collate Cyrillic_General_CI_AI as [Column1], '€' collate Latin1_General_100_CI_AI as [Column2]

In such a case, the ROW and DONE tokens will be:

ROW: D1 01 00 88 01 00 80
DONE: FD 10 00 C1 00 01 00 00 00 00 00 00 00

We don't need to care too much about the DONE token - the ROW token is the one we care about here. We see a token type of D1 (as defined in the TDS spec), then we see two varchars with an endian-flipped length of 00 01 and character values of 88 and 80. This is because € is character 0x88 in Windows-1251 and 0x80 in Windows-1252.

The COLMETADATA token tells us which codepage to use to decode this, in the form of a collation. In this case, its value for the first column is:
00 00 00 00 00 00 20 00 A7 01 00 19 04 F0 00 00 07 [7-character column name]

The five bytes preceding the column name's length describe the collation for this column's data: 19 04 F0 00 00. We parse this according to the relevant part of the MS-TDS spec., and derive an LCID of 0x0419.

It's at this point that we see the point of TdsParser.GetCodePage. It uses the collation data from the COLMETADATA token, extracting the collation's LCID, then the LCID's code page (Windows-1251 for 0x0419, Windows-1252 for 0x0409.) This code page is used to decode the field contents of the ROW token.

If we wanted to whitelist specific LCIDs, we'd essentially want to know which LCIDs' codepages are a strict subset of the invariant codepage. That's a workable way to implement it, but it'd rapidly boil back down to a mapping between LCID and encoding, to ensure we don't try to use the invariant codepage to encode data which the SQL Server collation can't handle.

It's also worth keeping in mind that collation exists at several levels:

Statement (the COLLATE clause in a query)
Column
Database
Instance

If any of these were out-of-bounds then we'd throw an exception. The default database of our SQL Server login might be a database with a Windows-1251-based collation; we wouldn't be able to complete the login because part of the login process involves receiving an ENVCHANGE token with the default collation details. Changing the collation of a database is a painful process, and changing the collation of a SQL Server instance requires the installation media. I'm personally in favour of having a larger LCID/codepage mapping because this situation is difficult to get out of.

roji Aug 20, 2024
Collaborator

Thanks for the above low-level details - that does shed some light on the situation... In short, it seems that the SQL Server "collation" concept also determines on-wire encoding of text... In PostgreSQL, the collation is purely responsible for the how texts are compared/sorted in the database, whereas the on-wire encoding is completely distinct. In fact, the encoding used by PG when sending data to the client can be configured by the client (SET client_encoding='...'), and generally is almost always UTF8 in modern applications.

If in SQL Server, the on-wire text encoding is determined by the database (Windows-1251-based by default), and there's no way for SqlClient to request another encoding (e.g. as part of the login process?), then yeah, I agree it makes sense to include LCID/codepage mappings, as anything else basically sounds unfeasiable...

edwardneal Aug 20, 2024
Author

That's correct, yes. It also controls how the data is stored in database pages, which is why changing the database collation only impacts newly-created columns.

SQL Server can come slightly closer to the PostgreSQL behaviour: nvarchars are handled a little more consistently for most purposes, and there are specific collations which use UTF8 to store varchars. There's nothing in the prelogin/login handshake though, so it's not close enough to work with unfortunately...

roji Aug 20, 2024
Collaborator

Right, all makes sense. Thanks again for all the detail!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for globalization invariant mode #2784

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Support for globalization invariant mode #2784

edwardneal Aug 17, 2024

Background: Globalization invariant mode

Background: SqlClient's compatibility with globalization invariant mode

Current state and proposed changes

Risks/impacts

Reading data

Writing data

Replies: 1 comment · 4 replies

roji Aug 19, 2024 Collaborator

edwardneal Aug 19, 2024 Author

roji Aug 20, 2024 Collaborator

edwardneal Aug 20, 2024 Author

roji Aug 20, 2024 Collaborator

edwardneal
Aug 17, 2024

Replies: 1 comment 4 replies

roji
Aug 19, 2024
Collaborator

edwardneal Aug 19, 2024
Author

roji Aug 20, 2024
Collaborator

edwardneal Aug 20, 2024
Author

roji Aug 20, 2024
Collaborator