Add built-in functions UNICODE_CHAR and UNICODE_VAL to convert between Unicode code point and character #6798

mrotteveel · 2021-05-10T10:47:50Z

Currently, Firebird has ASCII_CHAR and ASCII_VAL which allows conversion between ASCII code point and ASCII characters. It would be helpful to have equivalent functions UNICODE_CHAR and UNICODE_VAL to convert between Unicode code points and CHAR(1) CHARACTER SET UTF8 characters.

The input of UNICODE_CHAR would be an integer value in the range of 0x00 and 0x10FFFF and the result would be a CHAR(1) CHARACTER SET UTF8 with the equivalent character.

The input of UNICODE_VAL would be any string type (including blobs) with character set UTF8 (character strings of other character sets should be converted to UTF8), and returns the Unicode code point of the first character of the string.

The text was updated successfully, but these errors were encountered:

aafemt · 2021-05-10T11:02:08Z

It would be nice if inside UNICODE_VAL worked a little more clever than any charset -> unicode -> UTF-8 -> unicode -> take the first codepoint...

mrotteveel · 2021-05-10T11:04:16Z

It would be nice if inside UNICODE_VAL worked a little more clever than any charset -> unicode -> UTF-8 -> unicode -> take the first codepoint...

Exactly what do you mean with "a little more clever"?

aafemt · 2021-05-10T11:11:46Z

At least "any charset -> unicode -> take the first codepoint". Best of all would be to take only one character from source string but taking into account surrogate pairs.

asfernandes · 2021-05-10T11:18:56Z

It would also be good if we have some form of string supporting Unicode escaped character numbers.

If we do that with a function, that function may transform things at compile time for fixed strings.

BTW idea of functions doing things at compile time may be done for ASCII_CHAR / ASCII_VAL too and some operators (just think on expression using strings, concatenations and ASCII_CHAR being purely resolved at compile time.

mrotteveel · 2021-05-10T11:19:16Z

At least "any charset -> unicode -> take the first codepoint". Best of all would be to take only one character from source string but taking into account surrogate pairs.

That sounds good to me as well.

I'm not sure how you want to account for surrogate pairs, because you can either resolve the code point of the individual surrogate, or the code point of the combined surrogates. The last one might be 'more correct', but might be a bit of a hassle.

mrotteveel · 2021-05-10T11:23:03Z

It would also be good if we have some form of string supporting Unicode escaped character numbers.

For that we would need to add support for Unicode string literals (see 5.3 <literal>, <Unicode character string literal> in the SQL:2016-2 standard).

If we do that with a function, that function may transform things at compile time for fixed strings.

BTW idea of functions doing things at compile time may be done for ASCII_CHAR / ASCII_VAL too and some operators (just think on expression using strings, concatenations and ASCII_CHAR being purely resolved at compile time.

That would be an interesting optimization, but I think that should be done separately outside the scope of this request.

asfernandes · 2021-05-10T11:23:39Z

At least "any charset -> unicode -> take the first codepoint". Best of all would be to take only one character from source string but taking into account surrogate pairs.

That sounds good to me as well.

I'm not sure how you want to account for surrogate pairs, because you can either resolve the code point of the individual surrogate, or the code point of the combined surrogates. The last one might be 'more correct', but might be a bit of a hassle.

Firebird considers a surrogate pair as a single character in SUBTRING, so it should be considered as a single character in these functions.

mrotteveel · 2021-05-10T11:25:34Z

I'm not sure how you want to account for surrogate pairs, because you can either resolve the code point of the individual surrogate, or the code point of the combined surrogates. The last one might be 'more correct', but might be a bit of a hassle.

Firebird considers a surrogate pair as a single character in SUBTRING, so it should be considered as a single character in these functions.

Ok, then UNICODE_VAL should follow the same pattern.

…o convert between Unicode code point and character.

aafemt · 2021-05-16T21:15:42Z

Did you test it with 1GB BLOB?

mrotteveel added component: engine priority: minor type: new feature labels May 10, 2021

asfernandes self-assigned this May 10, 2021

asfernandes added a commit that referenced this issue May 14, 2021

Feature #6798 - Add built-in functions UNICODE_CHAR and UNICODE_VAL t…

3b37219

…o convert between Unicode code point and character.

asfernandes added the fix-version: 5.0 Beta 1 label May 14, 2021

asfernandes closed this as completed May 14, 2021

asfernandes added a commit that referenced this issue May 14, 2021

Correction for #6798 docs as noted by Mark.

225b014

asfernandes added a commit that referenced this issue May 14, 2021

Check negative argument in UNICODE_CHAR - #6798.

4cd4649

pavel-zotov added the qa: done successfully label May 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add built-in functions UNICODE_CHAR and UNICODE_VAL to convert between Unicode code point and character #6798

Add built-in functions UNICODE_CHAR and UNICODE_VAL to convert between Unicode code point and character #6798

mrotteveel commented May 10, 2021

aafemt commented May 10, 2021

mrotteveel commented May 10, 2021

aafemt commented May 10, 2021

asfernandes commented May 10, 2021

mrotteveel commented May 10, 2021

mrotteveel commented May 10, 2021

asfernandes commented May 10, 2021

mrotteveel commented May 10, 2021

aafemt commented May 16, 2021

Add built-in functions UNICODE_CHAR and UNICODE_VAL to convert between Unicode code point and character #6798

Add built-in functions UNICODE_CHAR and UNICODE_VAL to convert between Unicode code point and character #6798

Comments

mrotteveel commented May 10, 2021

aafemt commented May 10, 2021

mrotteveel commented May 10, 2021

aafemt commented May 10, 2021

asfernandes commented May 10, 2021

mrotteveel commented May 10, 2021

mrotteveel commented May 10, 2021

asfernandes commented May 10, 2021

mrotteveel commented May 10, 2021

aafemt commented May 16, 2021