Encapsulate encoding information in string conversion #955

jtv · 2025-02-25T01:36:11Z

For array conversions in particular, the string conversion API may need to know the client encoding group.

I'd prefer not to expose encoding groups, since they're not set in stone and we may later discover that we need more detailed information; and also, I'm not completely confident that they correspond exactly to something that the client would want to know. They're more of an internal data type. So, it would seem to make sense to encapsulate the information in a small class.

We then need to pass an instance of that class to some of the string conversions (only to_buf() and into_buf() I think), which means a further extension of the conversion API.

Unresolved question: is there a way to pass this only when needed? It seems like a stupid question, but there's no such thing as a default encoding really. Perhaps there should be a way of expressing "don't care" as an encoding group, ideally at compile time.

The text was updated successfully, but these errors were encountered:

tt4g · 2025-02-25T03:46:33Z

In my experience, the most popular pattern is to have the environment information accessed through an object named “context”.
The advantage of using “contexts” is that when you want to make new information accessible, you do not have to add new arguments to the function (the function signature remains unchanged).
Also, the cost of creating a “context” can be reduced by implementing the function in such a way that the information is retrieved the first time access to the required data from the “context” is encountered for the first time.

jtv · 2025-02-25T10:00:56Z

Thanks @tt4g, that sounds like a good idea. Perhaps the std::source_location ought to go in there as well.

I guess then if we want to add more optional members to the context, we can just default-construct those.

For my other question, about what the default value for the encoding group should be... I think we need an enum value for "unknown." That way, the few conversions that care about the encoding can just check the encoding and throw an error if the conversion is dangerous.

Frankly, if the entier SJIS//GB/BIG5 families of encodings had been deprecated in favour f UTF-8 a decade ago, that would probably make my life a lot easier!

tt4g · 2025-02-25T12:18:09Z

I think the encoding gotten by PQclientEncoding(const PGconn *) (https://www.postgresql.org/docs/17/libpq-control.html) should be the default.
However, this means that all encodings supported by libpq must be supported.
Since that would lead to endless work, I think it is acceptable to have its own “encoding groups supported by libpqxx” to support only popular character encodings.

EDIT: I didn't write why I think PQclientEncoding(const PGconn *) should be the default encoding.
The reason is that prepared statements (SQL) not processed by to_buf() should be processed by libpq as PQclientEncoding(const PGconn *) character encoding.
What is not processed by to_buf() is simple SQL like SELECT VERSION() (e.g. tx.query1<std::string>("SELECT VERSION()");).
Since this SQL is passed directly to libpq, libpq should interpret it as the character encoding of PQclientEncoding(const PGconn *) and send it to the server.
If libpqxx uses a different character encoding than PQclientEncoding(const PGconn *), it would cause confusion because data would be sent with a different character encoding.

jtv · 2025-02-26T10:52:25Z

@tt4g does libpq support any encodings that libpqxx does not?

(Bearing in mind of course that libpqxx really only cares about encoding groups — a custom abstraction that lets us treat encodings with the same encoding mechanism as identical for our purposes.)

I'd like to avoid the call to obtain the client encoding where possible, since there's going to be a small performance cost to calling out to the C library. Very very few conversions will need the information, and we do so many conversions that I want them to be really fast.

Caching could be an option but then we need to worry about invalidation.

tt4g · 2025-02-26T12:19:06Z

A list is available on https://www.postgresql.org/docs/17/multibyte.html

I am not familiar with all encodings, but perhaps the following encoding support is missing:

KOI8R
KOI8U
UHC
WIN874
WIN1250
WIN1251
WIN1252
WIN1253
WIN1254
WIN1255
WIN1256
WIN1257
WIN1258

Tip

WIN*** encoding may be a character encoding unique to Windows.
There is some doubt about the need for these supports.

libpqxx/include/pqxx/internal/encoding_group.hxx

Lines 18 to 38 in 6af956b

    
           enum class encoding_group 
        
           { 
        
             // Handles all single-byte fixed-width encodings 
        
             MONOBYTE, 
        
             // Multibyte encodings. 
        
             // Many of these can embed ASCII-like bytes inside multibyte characters, 
        
             // notably Big5, SJIS, SHIFT_JIS_2004, GP18030, GBK, JOHAB, UHC. 
        
             BIG5, 
        
             EUC_CN, 
        
             EUC_JP, 
        
             EUC_KR, 
        
             EUC_TW, 
        
             GB18030, 
        
             GBK, 
        
             JOHAB, 
        
             MULE_INTERNAL, 
        
             SJIS, 
        
             UHC, 
        
             UTF8, 
        
           };

It may also need to support behavior compatible with SQL_ASCII.

The SQL_ASCII setting behaves considerably differently from the other settings. When the server character set is SQL_ASCII, the server interprets byte values 0–127 according to the ASCII standard, while byte values 128–255 are taken as uninterpreted characters. No encoding conversion will be done when the setting is SQL_ASCII. Thus, this setting is not so much a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding. In most cases, if you are working with any non-ASCII data, it is unwise to use the SQL_ASCII setting because PostgreSQL will be unable to help you by converting or validating non-ASCII characters.

jtv · 2025-03-07T23:00:23Z

@tt4g it looks to me like libpqxx currently supports all encodings in that list! Most of them fall under MONOBYTE.

tt4g · 2025-03-07T23:52:23Z

@jtv That's good.

jtv · 2025-03-07T23:54:32Z

Oh and by the way @tt4g I'm using a "context" struct for the new string conversion API, like you suggested.

tt4g · 2025-03-08T00:00:45Z

@jtv Yeah. I saw it added in a new commit. Glad to see my opinion was adopted.

jtv added the 8.0 label Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encapsulate encoding information in string conversion #955

Encapsulate encoding information in string conversion #955

jtv commented Feb 25, 2025

tt4g commented Feb 25, 2025

jtv commented Feb 25, 2025

tt4g commented Feb 25, 2025 •

edited

Loading

jtv commented Feb 26, 2025

tt4g commented Feb 26, 2025

jtv commented Mar 7, 2025

tt4g commented Mar 7, 2025

jtv commented Mar 7, 2025

tt4g commented Mar 8, 2025

Encapsulate encoding information in string conversion #955

Encapsulate encoding information in string conversion #955

Comments

jtv commented Feb 25, 2025

tt4g commented Feb 25, 2025

jtv commented Feb 25, 2025

tt4g commented Feb 25, 2025 • edited Loading

jtv commented Feb 26, 2025

tt4g commented Feb 26, 2025

jtv commented Mar 7, 2025

tt4g commented Mar 7, 2025

jtv commented Mar 7, 2025

tt4g commented Mar 8, 2025

tt4g commented Feb 25, 2025 •

edited

Loading