Online DDL: avoid SQL's CONVERT(...), convert programmatically if needed#16597
Conversation
…ify column's charset or collation Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…onvert for vplayer Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
go/vt/vttablet/onlineddl/vrepl.go
Outdated
| sb.WriteString(fmt.Sprintf("CONCAT(%s)", escapeName(name))) | ||
| case sourceCol.Type() == "json": | ||
| sb.WriteString(fmt.Sprintf("convert(%s using utf8mb4)", escapeName(name))) | ||
| case targetCol.Type() == "json" && sourceCol.Type() != "json": |
There was a problem hiding this comment.
This moves up from below so as to eliminate a case before we compare charsets for JSONs, which is not required and not beneficial.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #16597 +/- ##
==========================================
- Coverage 68.85% 68.84% -0.02%
==========================================
Files 1557 1557
Lines 199891 200003 +112
==========================================
+ Hits 137644 137697 +53
- Misses 62247 62306 +59 ☔ View full report in Codecov by Sentry. |
| @@ -646,6 +654,24 @@ func appendFromRow(pq *sqlparser.ParsedQuery, buf *bytes2.Buffer, fields []*quer | |||
| buf.WriteString(sqltypes.NullStr) | |||
| } else { | |||
| vv := sqltypes.MakeTrusted(typ, row.Values[col.offset:col.offset+col.length]) | |||
There was a problem hiding this comment.
Does this also allocate and later on too? Is it worth avoiding creating this if we overwrite it later?
There was a problem hiding this comment.
Done. No double allocation. Also, converged the two codepaths that do charset.Convert() into a single convertStringCharset() function.
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
| sqlbuffer.WriteString(", ") | ||
| } | ||
| if err := appendFromRow(tp.BulkInsertValues, sqlbuffer, tp.Fields, row, tp.FieldsToSkip); err != nil { | ||
| if err := tp.appendFromRow(tp.BulkInsertValues, sqlbuffer, tp.Fields, row, tp.FieldsToSkip); err != nil { |
There was a problem hiding this comment.
If we make this change, which I'm OK with, then we don't need to pass in the other tp struct values:
tp.appendFromRow(sqlbuffer, row)
There was a problem hiding this comment.
Good catch! Fixed.
mattlord
left a comment
There was a problem hiding this comment.
💅 This is great! I think that this will solve so many edge cases we've seen in production. ❤️ Just a couple of minor points so far.
go/vt/vttablet/onlineddl/vrepl.go
Outdated
|
|
||
| if trivialCharset(fromCollation) && trivialCharset(toCollation) && targetCol.Type() != "json" { | ||
| sb.WriteString(escapeName(name)) | ||
| } else if fromCollation == toCollation && targetCol.Type() != "json" { |
There was a problem hiding this comment.
We don't want && targetCol.Type() != "json" here and just above, do we? We already handle the non-JSON to JSON case above. We'd fall into the else case below where we'd say there's a collation conversion necessary even though there isn't. No?
There was a problem hiding this comment.
In any event, I don't think this is a major issue as the primary issue we've seen on the target/vplayer side is where we were unable to use the desired index because of the CONVERT usage and you can't add indexes directly on JSON columns anyway.
There was a problem hiding this comment.
We already handle the non-JSON to JSON case above.
You're right! We changed the case ordering and now we don't need this check. Fixed: removed three unnecessary checks in total.
There was a problem hiding this comment.
Yes, we're still left with a few CONVERT(...)s yet in the code: for JSONs and for ENUMs. For JSONs it's as you say - not something you can even put in a primary key or any unique key; for ENUMs it's more complex. I'll take it to another PR.
| case sourceCol.Type() == "json": | ||
| sb.WriteString(fmt.Sprintf("convert(%s using utf8mb4)", escapeName(name))) |
There was a problem hiding this comment.
@dbussink do you think this is still needed? I don't think so anymore, now that we have native JSON type support.
There was a problem hiding this comment.
(against a v21 vtgate here)
❯ mysql commerce -e "create table json_test (id int not null primary key, j1 json); insert into json_test values (1, '{\"name\":\"Matt\"}')"
❯ mysql commerce -e "insert into json_test select id+10, j1 from json_test"
❯ mysql commerce -e "select * from json_test" --column-type-info
Field 1: `id`
Catalog: `def`
Database: `commerce`
Table: `json_test`
Org_table: `json_test`
Type: LONG
Collation: binary (63)
Length: 11
Max_length: 2
Decimals: 0
Flags: NOT_NULL PRI_KEY NO_DEFAULT_VALUE NUM PART_KEY
Field 2: `j1`
Catalog: `def`
Database: `commerce`
Table: `json_test`
Org_table: `json_test`
Type: JSON
Collation: binary (63)
Length: 4294967295
Max_length: 16
Decimals: 0
Flags: BLOB BINARY
+----+------------------+
| id | j1 |
+----+------------------+
| 1 | {"name": "Matt"} |
| 11 | {"name": "Matt"} |
+----+------------------+
I expect this to be bytes we pass on to MySQL "on the other side" and they are interpreted there as either a JSON field or serialized as a utf8mb4 string if some other type on the target.
There was a problem hiding this comment.
Either way, I don't think it's a major deal on the source/vcopier side as the primary problems we've seen there are when these CONVERT calls then preclude us from using the desired index in the rowstreamer query and you can't add indexes directly on JSON columns anyway.
There was a problem hiding this comment.
Let's leave it like so for now.
There was a problem hiding this comment.
JSON is a bit special anyway, since we can't use the direct textual representation, but we turn it into a sql expression using JSON_OBJECT so we lose as little type information as possible.
|
|
||
| if conversion, ok := tp.ConvertCharset[col.field.Name]; ok && col.length >= 0 { | ||
| // Non-null string value, for which we have a charset conversion instruction | ||
| fromCollation := tp.CollationEnv.DefaultCollationForCharset(conversion.FromCharset) |
There was a problem hiding this comment.
Do we have to rely on the default collation for the charset (on from and to side)? If we take utf8mb4 for example:
mysql> show collation where charset = 'utf8mb4';
+----------------------------+---------+-----+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+----------------------------+---------+-----+---------+----------+---------+---------------+
| utf8mb4_0900_ai_ci | utf8mb4 | 255 | Yes | Yes | 0 | NO PAD |
| utf8mb4_0900_as_ci | utf8mb4 | 305 | | Yes | 0 | NO PAD |
| utf8mb4_0900_as_cs | utf8mb4 | 278 | | Yes | 0 | NO PAD |
| utf8mb4_0900_bin | utf8mb4 | 309 | | Yes | 1 | NO PAD |
| utf8mb4_bg_0900_ai_ci | utf8mb4 | 318 | | Yes | 0 | NO PAD |
| utf8mb4_bg_0900_as_cs | utf8mb4 | 319 | | Yes | 0 | NO PAD |
| utf8mb4_bin | utf8mb4 | 46 | | Yes | 1 | PAD SPACE |
...
| utf8mb4_turkish_ci | utf8mb4 | 233 | | Yes | 8 | PAD SPACE |
| utf8mb4_unicode_520_ci | utf8mb4 | 246 | | Yes | 8 | PAD SPACE |
| utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 8 | PAD SPACE |
| utf8mb4_vietnamese_ci | utf8mb4 | 247 | | Yes | 8 | PAD SPACE |
| utf8mb4_vi_0900_ai_ci | utf8mb4 | 277 | | Yes | 0 | NO PAD |
| utf8mb4_vi_0900_as_cs | utf8mb4 | 300 | | Yes | 0 | NO PAD |
| utf8mb4_zh_0900_as_cs | utf8mb4 | 308 | | Yes | 0 | NO PAD |
+----------------------------+---------+-----+---------+----------+---------+---------------+
89 rows in set (0.00 sec)
There was a problem hiding this comment.
If you're up for squeezing another change in here... I think we might want to make it ConvertCollation that we use in OnlineDDL — or if we leave the field name the same, just use the collation name when possible rather than the charset name. The collation is specific, and it implies the character set. Perhaps we truly only care about the character set in this scenario though... 🤔
There was a problem hiding this comment.
Do we have to rely on the default collation for the charset (on from and to side)? If we take utf8mb4 for example:
It's a bit moot. We only use Collation as an intermediate step to get from the named charset (e.g. "latin1") into a Charset object. So we may as well use the default collection to get there.
There was a problem hiding this comment.
Perhaps we truly only care about the character set in this scenario though... 🤔
This is worth digging into. If we do end up using collation rather than charset, then there's a few proto changes to make, so this will be outside the scope of this PR.
| // Non-null string value, for which we have a charset conversion instruction | ||
| fromCollation := tp.CollationEnv.DefaultCollationForCharset(conversion.FromCharset) | ||
| if fromCollation == collations.Unknown { | ||
| return vterrors.Errorf(vtrpcpb.Code_INVALID_ARGUMENT, "Character set %s not supported for column %s", conversion.FromCharset, col.field.Name) |
There was a problem hiding this comment.
Nit, but errors aren't supposed to be capitalized (due to wrapping). That applies throughout the new code in the PR.
There was a problem hiding this comment.
Fixed! One place where I did leave the message capitalized is in "Incorrect string value" - this string mimics the error message MySQL would have given for the equivalent SQL CONVERT(...) function, and I think we should keep this as it promotes consistency.
| } else { | ||
| vv := sqltypes.MakeTrusted(typ, row.Values[col.offset:col.offset+col.length]) | ||
|
|
||
| if conversion, ok := tp.ConvertCharset[col.field.Name]; ok && col.length >= 0 { |
There was a problem hiding this comment.
We don't want col.length > 0 here? If there are no chars/bytes then I wouldn't think we need to do anything in this regard.
There was a problem hiding this comment.
Due to my bad English, I'm not sure if you mean we should use col.length >= 0 or if you mean we shouldn't use col.length >= 0.
Just in case you mean the former, we do have col.length >= 0 at the end of this line, in case you've missed it.
If you meant the latter, then col.length >= 0 in this context is an indicator that the value is not NULL, and we should test this or otherwise the conversion will break.
There was a problem hiding this comment.
@dbussink pointed out that you meant to highlight > 0 rather than >= 0. Agreed, and fixed!
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
|
I'm backporting this to all supported versions as I see this as an important bugfix. |
Description
Fixes #16023
We have a clear picture and a fix to #16023. The original reason why we needed
convert()in the first place is thatvreplicationandvstreamerboth issue aSET NAMES binary. We will want to change that in the future, but this PR in the meantime confirms to thebinaryconnection charset.So we used
convert()to turn textual values intoutf8mb4. On the other side,vplayeris reading events from the binary log. It used programmatic conversion (charset.Convert()) of the data toutf8mb4to align withvcopier.What we are doing now:
convert(), solving the sorting issue described in Bug Report: OnlineDDL PK conversion results in table scans #16023 (comment)vcopierread data, We do introduce programmatic conversion of non-utf columns into their designated charsets.vplayer, we do not convert at all if both source and target have same charsetvplayer, we do apply programmatic conversion of non-utf columns into their designated charsets, in a similar logic as forvcopier.charset.Convert()error, we translate it intoERROR 1366("Incorrect string value ..."), which is a terminal error in vreplication, and so the migration bails out as soon as that happens. This can happen if e.g. we're converting a UTF column into ASCII and the UTF column contains a smiley emoji.Because we do not convert the original charset to
utf8mb4, we get to programmatically convert it to the specific target column. Previously (and this is perhaps the last piece of magic I have not digged into yet, and again likely to be caused by thebinarycharset) we did not need to convert into the target charset.All the tests remain the same, and we introduce a couple new ones.
Related Issue(s)
Backport
I wish to backport this to all supported versions, seeing that this is a bugfix: without this fix some migrations will slow down to a near halt.
Checklist
Deployment Notes