Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ORM] Emojis and utf8mb4 #8062

Closed
sminnee opened this issue May 9, 2018 · 13 comments · Fixed by silverstripe/recipe-core#28
Closed

[ORM] Emojis and utf8mb4 #8062

sminnee opened this issue May 9, 2018 · 13 comments · Fixed by silverstripe/recipe-core#28

Comments

@sminnee
Copy link
Member

sminnee commented May 9, 2018

MySQL by default allows a maximum of 3-bytes per UTF8 character, meaning that emoji don't work. If you try and insert a a string with an emoji, it will be truncated just before the first emoji. Presumably this would occur with any UTF character requiring 4 bytes in UTF8, but emojis are where I have come across this.

On a project basis you can enable this pretty easily:

SilverStripe\ORM\Connect\MySQLDatabase:
  connection_charset: utf8mb4
  connection_collation: utf8mb4_unicode_ci
  charset: utf8mb4
  collation: utf8mb4_unicode_ci

However, given the increasing prevalence of emojis in everyday application usage, I think that this is a confusing pitfall, and we should enable this as a default setting.

Caveats

There's a caveat - since the maximum number of bytes in a character is now 4 bytes, rather than 3, checks for the maximum length of a row / index / etc will sometimes blow out. For example, a string based index will have a maximum of 192 characters (4 x 192 = 768) rather than 256 (3 x 256 = 768). This will sometimes lead to errors when you run dev/build after making this change. These errors also seem to depend on MySQL version.

However, if this is in place from the beginning of a project, that's not just a big deal, but it is a pain if the error is triggered during a minor-release upgrade.

Recommended release

So I would recommend:

  • Put the above config in silverstripe-installer for 4 so that new projects have this applied
  • Alter the default in master.
@frankmullenger
Copy link
Contributor

This would become a high impact issue for us if it affects 4 byte Chinese characters for example, anecdotally that sounds to be the case.

@chillu
Copy link
Member

chillu commented Jun 25, 2018

We should also mention this both in the fluent README and our i18n docs. It's not fully related to i18n of course (emojis can be used in the "base language"), but it's the most common place people will look for this info.

@madmatt
Copy link
Member

madmatt commented Jun 25, 2018

Wholeheartedly agree with this, and there's a lot of evidence to support this [1, 2, 3].

I guess this is a breaking change so would need to be made in master for SS5, but I really like @sminnee's solution of putting it in silverstripe-installer now so that new projects get the benefit of being correct from the start.

edit: I think from MySQL 5.7 onwards it no longer complains about the 768b limit to string lengths as the default row and table format changed to dynamic/barracuda - but I haven't researched that extensively, and we'll need to support older versions of MySQL anyway.

edit2: [4] is interesting, showing that the original purpose of utf8 (aka utf8mb3) was for performance reasons, however as of newer MySQL versions it's now a slower collation.

[1] https://medium.com/@adamhooper/in-mysql-never-use-utf8-use-utf8mb4-11761243e434
[2] https://mathiasbynens.be/notes/mysql-utf8mb4
[3] https://www.eversql.com/mysql-utf8-vs-utf8mb4-whats-the-difference-between-utf8-and-utf8mb4/
[4] http://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-over-utf8mb4/
[5] http://mysqlserverteam.com/sushi-beer-an-introduction-of-utf8-support-in-mysql-8-0/ (different problem related to choosing incorrect collations, with no good solutions until MySQL 8 I believe)

@Cheddam
Copy link
Member

Cheddam commented Jun 26, 2018

These errors also seem to depend on MySQL version.

Some clarity around this:

  • MySQL 5.5 and lower cannot support indexes larger than 768 bytes
  • MySQL 5.6 supports larger indexes (3072 bytes) if the innodb_large_prefix setting is enabled (not by default)
  • MySQL 5.7 and newer have innodb_large_prefix enabled by default
  • MariaDB ~10.1 matches MySQL 5.6's behaviour, >10.2 matches 5.7's.

sminnee pushed a commit to sminnee/recipe-core that referenced this issue Oct 5, 2018
This will switch for new projects without breaking APIs in 
upgrades.

Fixes silverstripe/silverstripe-framework#8062
@sminnee sminnee changed the title Emojis and utf8mb4 [ORM] Emojis and utf8mb4 Oct 7, 2018
@sminnee sminnee removed the type/bug label Oct 24, 2018
@dnsl48
Copy link
Contributor

dnsl48 commented Aug 2, 2019

MySQLiConnector doesn't always properly switch connection_collation (for more details see #9160).

As such, we cannot recommend using utf8mb4_unicode_ci at this moment. We may have to fall back to utf8mb4_general_ci until we find a solution.

SilverStripe\ORM\Connect\MySQLDatabase:
  connection_charset: utf8mb4
  connection_collation: utf8mb4_general_ci
  charset: utf8mb4
  collation: utf8mb4_general_ci

@JorisDebonnet
Copy link
Contributor

JorisDebonnet commented Mar 1, 2020

I encountered this issue again last night, when I made a long edit on a blog post including an emoticon near the start... so when I saved it, all my changes were lost because it truncated before the emoticon.

If this is complicated to solve, can we add a graceful degradation in such a way that it removes any 4-byte characters before feeding it to the database, if it sees that a non-mb4 character set is used for that field? If could go something like this:

onsave:
if (has_4_byte_character($value)) {
  $charset = get_db_field_charset($field);
  if ($charset.max_bytes() < 4) {
    // see https://stackoverflow.com/questions/8491431/how-to-replace-remove-4-byte-characters-from-a-utf-8-string-in-php
    $value = preg_replace('/[\x{10000}-\x{10FFFF}]/u', "\xEF\xBF\xBD", $value);
  }
}

And then once this is in place, we can implement the better default for new projects, and start worrying about how to migrate existing applications.

@dnsl48
Copy link
Contributor

dnsl48 commented Mar 1, 2020

I believe a switch from utf8 to utf8mb4 should solve the problem.

AFAIK utf8mb4_general_ci supports emojis just fine. The only downside of it is it has some weird ordering rules in some situations (for non-latin languages). The initial problem mentioned here would only happen for utf8 encoding (which is not a real UTF).

@JorisDebonnet
Copy link
Contributor

@dnsl48, I would be perfectly happy with utf8mb4_general_ci, too. But the 'caveat' in the original post is real, so we can't just force a migration.

I guess we could

  1. Enable it for new projects
  2. Do a migration for existing projects if a DB and system check reveals there will be no issues (this will be a complicated one, so maybe we'll only do 3.)
  3. Document a manual migration for those that can't do 2, and mention it prominently in release notes.
  4. For those that can't do 2. and won't do (or forget to do) 3., we could implement the 'graceful degradation' I mentioned.

This way, it's backwards compatible with everything, and solves or improves the original issue in all situations.

@sminnee
Copy link
Member Author

sminnee commented Mar 1, 2020

I'd be hesitant to add code that risks slowing down every DataObject write. #1 and #3 makes sense. We could do something like put a warning message in dev/build for people who haven't done a manual migration.

@chillu
Copy link
Member

chillu commented Mar 3, 2020

@JorisDebonnet Do you want to help with 1 and 3 through pull requests to https://github.com/silverstripe/silverstripe-installer/tree/4/app/_config and docs?

@chillu chillu self-assigned this Jul 30, 2020
@chillu
Copy link
Member

chillu commented Jul 30, 2020

@JorisDebonnet Keen to help reduce the chance that other devs have to go through this, and send a pull request? :) We should also add this to https://github.com/silverstripe/cwp-installer

@theTigerDuck
Copy link

theTigerDuck commented Dec 17, 2020

MySQLiConnector doesn't always properly switch connection_collation (for more details see #9160).

As such, we cannot recommend using utf8mb4_unicode_ci at this moment. We may have to fall back to utf8mb4_general_ci until we find a solution.

SilverStripe\ORM\Connect\MySQLDatabase:
  connection_charset: utf8mb4
  connection_collation: utf8mb4_general_ci
  charset: utf8mb4
  collation: utf8mb4_general_ci

is it possible to change charset for just one Table / DataObject?
Much more important: How to do this in SS 3.x?

@dnsl48
Copy link
Contributor

dnsl48 commented Dec 17, 2020

MySQLiConnector doesn't always properly switch connection_collation (for more details see #9160).

That's not the case. It works fine. See #9160 for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants