Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encapsulate encoding information in string conversion #955

Open
jtv opened this issue Feb 25, 2025 · 9 comments
Open

Encapsulate encoding information in string conversion #955

jtv opened this issue Feb 25, 2025 · 9 comments
Labels

Comments

@jtv
Copy link
Owner

jtv commented Feb 25, 2025

For array conversions in particular, the string conversion API may need to know the client encoding group.

I'd prefer not to expose encoding groups, since they're not set in stone and we may later discover that we need more detailed information; and also, I'm not completely confident that they correspond exactly to something that the client would want to know. They're more of an internal data type. So, it would seem to make sense to encapsulate the information in a small class.

We then need to pass an instance of that class to some of the string conversions (only to_buf() and into_buf() I think), which means a further extension of the conversion API.

Unresolved question: is there a way to pass this only when needed? It seems like a stupid question, but there's no such thing as a default encoding really. Perhaps there should be a way of expressing "don't care" as an encoding group, ideally at compile time.

@jtv jtv added the 8.0 label Feb 25, 2025
@tt4g
Copy link
Contributor

tt4g commented Feb 25, 2025

In my experience, the most popular pattern is to have the environment information accessed through an object named “context”.
The advantage of using “contexts” is that when you want to make new information accessible, you do not have to add new arguments to the function (the function signature remains unchanged).
Also, the cost of creating a “context” can be reduced by implementing the function in such a way that the information is retrieved the first time access to the required data from the “context” is encountered for the first time.

@jtv
Copy link
Owner Author

jtv commented Feb 25, 2025

Thanks @tt4g, that sounds like a good idea. Perhaps the std::source_location ought to go in there as well.

I guess then if we want to add more optional members to the context, we can just default-construct those.

For my other question, about what the default value for the encoding group should be... I think we need an enum value for "unknown." That way, the few conversions that care about the encoding can just check the encoding and throw an error if the conversion is dangerous.

Frankly, if the entier SJIS//GB/BIG5 families of encodings had been deprecated in favour f UTF-8 a decade ago, that would probably make my life a lot easier!

@tt4g
Copy link
Contributor

tt4g commented Feb 25, 2025

I think the encoding gotten by PQclientEncoding(const PGconn *) (https://www.postgresql.org/docs/17/libpq-control.html) should be the default.
However, this means that all encodings supported by libpq must be supported.
Since that would lead to endless work, I think it is acceptable to have its own “encoding groups supported by libpqxx” to support only popular character encodings.

EDIT: I didn't write why I think PQclientEncoding(const PGconn *) should be the default encoding.
The reason is that prepared statements (SQL) not processed by to_buf() should be processed by libpq as PQclientEncoding(const PGconn *) character encoding.
What is not processed by to_buf() is simple SQL like SELECT VERSION() (e.g. tx.query1<std::string>("SELECT VERSION()");).
Since this SQL is passed directly to libpq, libpq should interpret it as the character encoding of PQclientEncoding(const PGconn *) and send it to the server.
If libpqxx uses a different character encoding than PQclientEncoding(const PGconn *), it would cause confusion because data would be sent with a different character encoding.

@jtv
Copy link
Owner Author

jtv commented Feb 26, 2025

@tt4g does libpq support any encodings that libpqxx does not?

(Bearing in mind of course that libpqxx really only cares about encoding groups — a custom abstraction that lets us treat encodings with the same encoding mechanism as identical for our purposes.)

I'd like to avoid the call to obtain the client encoding where possible, since there's going to be a small performance cost to calling out to the C library. Very very few conversions will need the information, and we do so many conversions that I want them to be really fast.

Caching could be an option but then we need to worry about invalidation.

@tt4g
Copy link
Contributor

tt4g commented Feb 26, 2025

A list is available on https://www.postgresql.org/docs/17/multibyte.html

I am not familiar with all encodings, but perhaps the following encoding support is missing:

  • KOI8R
  • KOI8U
  • UHC
  • WIN874
  • WIN1250
  • WIN1251
  • WIN1252
  • WIN1253
  • WIN1254
  • WIN1255
  • WIN1256
  • WIN1257
  • WIN1258

Tip

WIN*** encoding may be a character encoding unique to Windows.
There is some doubt about the need for these supports.

enum class encoding_group
{
// Handles all single-byte fixed-width encodings
MONOBYTE,
// Multibyte encodings.
// Many of these can embed ASCII-like bytes inside multibyte characters,
// notably Big5, SJIS, SHIFT_JIS_2004, GP18030, GBK, JOHAB, UHC.
BIG5,
EUC_CN,
EUC_JP,
EUC_KR,
EUC_TW,
GB18030,
GBK,
JOHAB,
MULE_INTERNAL,
SJIS,
UHC,
UTF8,
};

It may also need to support behavior compatible with SQL_ASCII.

The SQL_ASCII setting behaves considerably differently from the other settings. When the server character set is SQL_ASCII, the server interprets byte values 0–127 according to the ASCII standard, while byte values 128–255 are taken as uninterpreted characters. No encoding conversion will be done when the setting is SQL_ASCII. Thus, this setting is not so much a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding. In most cases, if you are working with any non-ASCII data, it is unwise to use the SQL_ASCII setting because PostgreSQL will be unable to help you by converting or validating non-ASCII characters.

@jtv
Copy link
Owner Author

jtv commented Mar 7, 2025

@tt4g it looks to me like libpqxx currently supports all encodings in that list! Most of them fall under MONOBYTE.

@tt4g
Copy link
Contributor

tt4g commented Mar 7, 2025

@jtv That's good.

@jtv
Copy link
Owner Author

jtv commented Mar 7, 2025

Oh and by the way @tt4g I'm using a "context" struct for the new string conversion API, like you suggested.

@tt4g
Copy link
Contributor

tt4g commented Mar 8, 2025

@jtv Yeah. I saw it added in a new commit. Glad to see my opinion was adopted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants