Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add built-in functions UNICODE_CHAR and UNICODE_VAL to convert between Unicode code point and character #6798

Closed
mrotteveel opened this issue May 10, 2021 · 9 comments

Comments

@mrotteveel
Copy link
Member

Currently, Firebird has ASCII_CHAR and ASCII_VAL which allows conversion between ASCII code point and ASCII characters. It would be helpful to have equivalent functions UNICODE_CHAR and UNICODE_VAL to convert between Unicode code points and CHAR(1) CHARACTER SET UTF8 characters.

The input of UNICODE_CHAR would be an integer value in the range of 0x00 and 0x10FFFF and the result would be a CHAR(1) CHARACTER SET UTF8 with the equivalent character.

The input of UNICODE_VAL would be any string type (including blobs) with character set UTF8 (character strings of other character sets should be converted to UTF8), and returns the Unicode code point of the first character of the string.

@aafemt
Copy link
Contributor

aafemt commented May 10, 2021

It would be nice if inside UNICODE_VAL worked a little more clever than any charset -> unicode -> UTF-8 -> unicode -> take the first codepoint...

@mrotteveel
Copy link
Member Author

It would be nice if inside UNICODE_VAL worked a little more clever than any charset -> unicode -> UTF-8 -> unicode -> take the first codepoint...

Exactly what do you mean with "a little more clever"?

@aafemt
Copy link
Contributor

aafemt commented May 10, 2021

At least "any charset -> unicode -> take the first codepoint". Best of all would be to take only one character from source string but taking into account surrogate pairs.

@asfernandes asfernandes self-assigned this May 10, 2021
@asfernandes
Copy link
Member

It would also be good if we have some form of string supporting Unicode escaped character numbers.

If we do that with a function, that function may transform things at compile time for fixed strings.

BTW idea of functions doing things at compile time may be done for ASCII_CHAR / ASCII_VAL too and some operators (just think on expression using strings, concatenations and ASCII_CHAR being purely resolved at compile time.

@mrotteveel
Copy link
Member Author

At least "any charset -> unicode -> take the first codepoint". Best of all would be to take only one character from source string but taking into account surrogate pairs.

That sounds good to me as well.

I'm not sure how you want to account for surrogate pairs, because you can either resolve the code point of the individual surrogate, or the code point of the combined surrogates. The last one might be 'more correct', but might be a bit of a hassle.

@mrotteveel
Copy link
Member Author

It would also be good if we have some form of string supporting Unicode escaped character numbers.

For that we would need to add support for Unicode string literals (see 5.3 <literal>, <Unicode character string literal> in the SQL:2016-2 standard).

If we do that with a function, that function may transform things at compile time for fixed strings.

BTW idea of functions doing things at compile time may be done for ASCII_CHAR / ASCII_VAL too and some operators (just think on expression using strings, concatenations and ASCII_CHAR being purely resolved at compile time.

That would be an interesting optimization, but I think that should be done separately outside the scope of this request.

@asfernandes
Copy link
Member

At least "any charset -> unicode -> take the first codepoint". Best of all would be to take only one character from source string but taking into account surrogate pairs.

That sounds good to me as well.

I'm not sure how you want to account for surrogate pairs, because you can either resolve the code point of the individual surrogate, or the code point of the combined surrogates. The last one might be 'more correct', but might be a bit of a hassle.

Firebird considers a surrogate pair as a single character in SUBTRING, so it should be considered as a single character in these functions.

@mrotteveel
Copy link
Member Author

I'm not sure how you want to account for surrogate pairs, because you can either resolve the code point of the individual surrogate, or the code point of the combined surrogates. The last one might be 'more correct', but might be a bit of a hassle.

Firebird considers a surrogate pair as a single character in SUBTRING, so it should be considered as a single character in these functions.

Ok, then UNICODE_VAL should follow the same pattern.

@aafemt
Copy link
Contributor

aafemt commented May 16, 2021

Did you test it with 1GB BLOB?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants