Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emoji skin modifiers don't work #942

Closed
Tyriar opened this issue Sep 4, 2017 · 10 comments
Closed

Emoji skin modifiers don't work #942

Tyriar opened this issue Sep 4, 2017 · 10 comments
Labels
area/parser type/enhancement Features or improvements to existing features

Comments

@Tyriar
Copy link
Member

Tyriar commented Sep 4, 2017

Pasting "👍🏻 👈🏿"

screen shot 2017-09-04 at 10 50 09 am

It looks like the skin modifier is being put into a new cell, when it should actually be added to the single emoji.

@Tyriar Tyriar added the type/bug Something is misbehaving label Sep 4, 2017
@Tyriar Tyriar mentioned this issue Sep 4, 2017
22 tasks
@Tyriar
Copy link
Member Author

Tyriar commented Sep 4, 2017

I got it working with the following code:

diff --git a/src/Parser.ts b/src/Parser.ts
index b047f91..0a4aeb9 100644
--- a/src/Parser.ts
+++ b/src/Parser.ts
@@ -187,7 +187,7 @@ export class Parser {
     let cs;
     let ch;
     let code;
-    let low;
+let skipChars = 0;
 
     const cursorStartX = this._terminal.buffer.x;
     const cursorStartY = this._terminal.buffer.y;
@@ -204,6 +204,10 @@ export class Parser {
     }
 
     for (; this._position < l; this._position++) {
+      if (skipChars > 0) {
+        skipChars--;
+        continue;
+      }
       ch = data[this._position];
 
       // FIXME: higher chars than 0xa0 are not allowed in escape sequences
@@ -212,7 +216,7 @@ export class Parser {
       if (0xD800 <= code && code <= 0xDBFF) {
         // we got a surrogate high
         // get surrogate low (next 2 bytes)
-        low = data.charCodeAt(this._position + 1);
+        let low = data.charCodeAt(this._position + 1);
         if (isNaN(low)) {
           // end of data stream, save surrogate high
           this._terminal.surrogate_high = ch;
@@ -220,10 +224,19 @@ export class Parser {
         }
         code = ((code - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000;
         ch += data.charAt(this._position + 1);
+        let i = 2;
+        skipChars = 1;
+        while (0xD800 <= data.charCodeAt(this._position + i) && data.charCodeAt(this._position + i) <= 0xDBFF) {
+          ch += data.charAt(this._position + i);
+          ch += data.charAt(this._position + i + 1);
+          i += 2;
+          skipChars++;
+        }
       }
       // surrogate low - already handled above
-      if (0xDC00 <= code && code <= 0xDFFF)
+      if (0xDC00 <= code && code <= 0xDFFF) {
         continue;
+      }
 
       switch (this._state) {
         case ParserState.NORMAL:

However, doing this screws up the prompt as 2 characters (according to the shell) become one, which means you can go left into the $PS1.

screen shot 2017-09-04 at 11 24 24 am

We should fix that before going forward with supporting modifiers and the zero-width joiner. Not sure how best to do this, perhaps characters need some more metadata indicating how many characters are joined into the cell, then left/right arrow repeats the action that number of times?

@jerch
Copy link
Member

jerch commented Sep 4, 2017

@Tyriar Imho this needs a more advanced character evaluation than xterm.js has atm - the current approach can only handle leading combining chars, while this one follows the char to be modified (http://unicode.org/reports/tr51/#Emoji_Modifiers_Table). The info about the correct rendering is in the sequence of chars (+ unicode spec), it is just a nightmare to cover all those edge cases (emoji modifiers were not part of the official spec until 2015).

@jerch
Copy link
Member

jerch commented Sep 5, 2017

Oh I was wrong with my last statement, the input does not support leading but only following combining characters atm (combining chars with zero width will end up in the last active cell). So the emoji problem should be fixable by adding the modifiers to the wcwidth combining char table COMBINING_HIGH.

@jerch
Copy link
Member

jerch commented Sep 5, 2017

Ok the problem gets quite nasty - the correct handling of an emoji + modifier in the browser depends on the system font renderer. There is no platform independent way to solve this, seems this feature is to new.

testing results:

  • Ubuntu 14 prints 2 block symbols with unicode number [][] (no visual glyphs)
  • Ubuntu 16 prints the glyphs separated (thumbup + colored field behind), it does not combine the characters
  • OSX 10.12 correctly does the trick

Thats the point where the wcwidth implementation has to reassemble the system wcwidth to get correct widths. Only way I can think of to solve this is to prerender the emoji+modifier offline and measure the text width to see if it actually combines. If it does not combine the Unicode spec suggests to print them both or only the first one. The latter seems reasonable in a terminal environment to preserve the row width.
Not sure, if emojis with modifiers are worth that trouble, what you think?

@Tyriar
Copy link
Member Author

Tyriar commented Sep 5, 2017

Well it would be good to support it, I'm not exactly to keen on jumping on it atm though.

The offscreen measuring thing looks like it would solve the problem. I'm not sure we would want to touch wcwidth though? Isn't that used to determine the width that the character consumes, not the number of characters in a single "cell"?

@jerch
Copy link
Member

jerch commented Sep 5, 2017

Well wcwidth is part of the deal: currently combining chars end up in the last cell. This is because wcwidth determined it as a char with zero width. With systems treating emoji modifiers differently it does not work that easy anymore:

  • OSX: modifier combines --> wcwidth should report width of zero
  • Ubuntu: modifier does not combine --> char has its own visual representation and needs its space, wcwidth should report whatever space it takes to draw it (alternatively drop that char at a higher level before rendering)

If you have a look at the Unicode spec above it gets even trickier with the ZERO WIDTH JOINER (http://unicode.org/reports/tr51/#Emoji_ZWJ_Sequences).

Maybe its time to get the hands dirty with splitting up the stream into graphemes (see https://github.com/devongovett/grapheme-breaker) and feed those to wcwidth. wcwidth then needs a way to dynamically determine the width of unkown graphemes, which could be some offscreen thingy.

NOTE - any implementation in that direction will have a big negative performance impact, since every char has to be inspected at least once (like now with wcwidth) and multiple times in case combining chars are following.

@mandel59
Copy link

Relating #1059

@mandel59
Copy link

To align to the behavior of remote wcwidth implementations rather than Unicode specification is one of solutions: for instance, mintty renders a keycap emoji in 1 cell and a rainbow flag emoji in 4 cells. Although it is ugly, it solves the probrem about cursor location.

https://github.com/mintty/mintty/wiki/Tips#installing-emoji-resources

@jerch
Copy link
Member

jerch commented May 11, 2018

@mandel59 Well yes and no. Problem behind is that xterm.js can run on a total different system than the backend due to the browser/network stack in between, and both systems involved can differ about the wcwidth and the final render output width (which is even worse because font capabilities dependent). Currently xterm.js's wcwidth is not frontend aware, it simply applies the same rules for all systems. But even being frontend system aware does not solve the problem:

Imagine the following for char XY:

  • backend: wcwidth 1
  • frontend: wcwidth 2

Which one to apply now? If we take the 1 of the backend the rendered char is still drawn as wide char due to the way the frontend wants to render it - the cursor ends up on top of or "in the middle" of the glyph. If we take the 2 of the frontend we violate the grid layout of the terminal and will run one cell short at the end of the line, because the pty on the backend side just saw the 1 there.

Imho there is no real solution for this problem. In a perfect world wcwidth is the same for backend and frontend and we dont face this problem at all. For systems where the backend runs on the same system as the frontend (like a local terminal) it can be fixed this way, yes. For remote systems, it is a matter of priority - go with the backend for correct cell alignment with possible over/under draws or go with the frontend for correct draw but broken cell alignment?

Note that most of the unicode chars did not change the wcwidth spec for years and do not create these hassles. The emojies were changed recently from a width of 1 to 2, therefore the problem.

@jerch jerch mentioned this issue May 23, 2018
@Tyriar Tyriar added type/enhancement Features or improvements to existing features and removed type/bug Something is misbehaving labels Sep 17, 2018
@jerch
Copy link
Member

jerch commented Nov 15, 2019

Background on this: Those composed characters are part of the grapheme parsing rules in Unicode, same as country flags and others.

We will not see grapheme support anytime soon in xterm.js, as all cmdline programs get them wrong currently. Also note that they are specced to produce different outputs, technically the hand symbol + the box is correct. If the font renderer has a merged symbol for the codepoints it is free to use that prolly shorter symbol. That ambiguity is a nightmare for monospaced grids, but Unicode does not care much in this regard.

@Tyriar Should we close this issue as wont fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/parser type/enhancement Features or improvements to existing features
Projects
None yet
Development

No branches or pull requests

3 participants