Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using libMultiMarkdown for syntax highlighting (on Mac) #95

Closed
DivineDominion opened this issue Oct 19, 2017 · 13 comments
Closed

Using libMultiMarkdown for syntax highlighting (on Mac) #95

DivineDominion opened this issue Oct 19, 2017 · 13 comments

Comments

@DivineDominion
Copy link
Contributor

We're experimenting with using the library for live syntax highlighting in macOS/iOS apps. A basic setup is indeed functional, but there are some issues I'd like to point out and discuss.

First, @fletcher do you think the library is fit at all to make this feasible?

The CommonMark library's tokenizer didn't do well with only portions of a file (e.g. paragraphs), but MMD6 so far seems to cope with it okay.

So here's 2 coding roadbumps that seem most irritating.

String Encoding

MMD6 operates on UTF-8 encoded strings. Swift strings backed using UTF-16, and NSString-based NSRanges don't work well with e.g. emoji. The count is always off and MMD will end up highlighting a non-fitting portion if you don't fix the string location and length.

But you can also obtain a String.UTF8View to calculate indices inside the string. Boils down to this, if anyone is interested:

    let string = self.storage.string
    let paragraphRange = string.paragraphRange(for: Range(self.editedRange, in: string)!)
    let utf8Range = self.utf8Range(from: paragraphRange, in: string)
    let tokenTree = mmd_engine_parse_substring(engine, utf8Range.location, utf8Range.length)

    func utf8Range(from range: Range<String.Index>, in string: String) -> NSRange {

        let utf8 = string.utf8
        let startLocation = range.lowerBound.samePosition(in: utf8)!
        let endLocation = range.upperBound.samePosition(in: utf8)!
        let utf8Location: Int = utf8.distance(from: utf8.startIndex, to: startLocation)
        let utf8Length: Int = utf8.distance(from: startLocation, to: endLocation)

        return NSRange(location: utf8Location, length: utf8Length)
    }

This fixes passing in proper indices, but you'll get back UTF-8 based tokens and have to reverse the calculation:

    // currentToken: UnsafeMutablePointer<token>
    let tokenRange = NSMakeRange(currentToken.pointee.start, currentToken.pointee.len)
    let string = self.storage.string
    let nsRange = nsStringRange(fromUTF8Range: tokenRange, in: string)

    func nsStringRange(fromUTF8Range range: NSRange, in string: String) -> NSRange {

        let utf8 = string.utf8
        let startLocation = utf8.index(utf8.startIndex, offsetBy: range.location)
        let endLocation = utf8.index(utf8.startIndex, offsetBy: NSMaxRange(range))
        let stringRange = startLocation.samePosition(in: string)! ..< endLocation.samePosition(in: string)!
        return NSRange(stringRange, in: string)
    }

This takes an awful lot of time. According to the Time Profiling Instrument, the index offset calculations take 33% of the processing time when you type. The lag is noticeable.

Even though reusing the engine and manipulating the underlying DString could help, too, the UTF-8/UTF-16 location dance is a waste at the moment. Not sure what we can do about that at the moment.

Types Inside the Framework

To work with the library, especially in Swift, is rather painful, e.g.:

switch token.pointee.type {
case UInt16(BLOCK_BLOCKQUOTE.rawValue):
   // highlight range
}

All enum values are exposed as global constants, and the enums don't group these together. I wrote a Ruby code generator script that parses libMultiMarkdown.h for enum definitions and makes Objective-C NS_ENUMs from them:

typedef NS_ENUM(NSUInteger, MMD6TokenTypes) {
    MMD6TokenTypesDocStartToken = DOC_START_TOKEN,
    MMD6TokenTypesBlockBlockquote = BLOCK_BLOCKQUOTE,
    // ...
} NS_SWIFT_NAME(TokenType);
// ... and the other enums, too ...

It's a wrapper, not a redefinition.

That way, you can create a TokenType instance and write:

switch tokenType {
case .docStartToken:
    // ...
case .blockBlockquote:
    // ...
case .blockCodeFenced:
    // ...
}

... with all usual type-safe benefits. And you can write extensions for TokenType that take a range and do the highlighting itself, so the NSTextStorage code file only contains tokenType.applyHighlighting(storage: self, range: tokenRange) or similar instead of the large switch-case statement.

I can put together the Ruby scripts into a self-explanatory CLI script and open a PR if that fits the vision.

@fletcher
Copy link
Owner

On 10/19/17 5:08 AM, Christian Tietze wrote:

We're experimenting with using the library for live syntax highlighting in macOS/iOS apps. A basic setup is indeed functional, but there are some issues I'd like to point out and discuss.

First, @fletcher https://github.com/fletcher do you think the library is /fit/ at all to make this feasible?

I should say so.... I've been doing it for over five years.

The CommonMark library's tokenizer didn't do well with only portions of a file (e.g. paragraphs), but MMD6 so far seems to cope with it okay.

There are a few things you need to be careful about when doing this, particularly metadata, lists, and code blocks (esp. fenced ones), but yes this works.

So here's 2 coding roadbumps that seem most irritating.

String Encoding

MMD6 operates on UTF-8 encoded strings. Swift strings backed using UTF-16, and |NSString|-based |NSRange|s don't work well with e.g. emoji. The count is always off and MMD will end up highlighting a non-fitting portion if you don't fix the string location and length.

Yes. You have to account for the different offsets in one way or another.

This takes an /awful/ lot of time. According to the Time Profiling Instrument, the index offset calculations take 33% of the processing time when you type. The lag is noticeable.

I'm not sure how those underlying calculations are done, but that doesn't seem very efficient.

I use a different text handling approach in Composer v4, and part of that is to keep a C string of the text that has UTF-8 continuation bytes stripped out so that byte offsets in the C string match character offsets in the Cocoa framework. This allows everything else to function as if UTF-8 == UTF-16 (for purposes of syntax highlighting).

Because I wrote the underlying text-editor engine in C (which runs "underneath" NSTextStorage), I get this for a virtually negligible cost. But the same could be done by modifying NSTextStorage which could be just as efficient.

Previously I did it differently and a bit more like your approach. But in my testing I found that Apple's methods to convert a character range to a byte range (using something like [NSString lengthOfBytesUsingEncoding:NSUTF8StringEncoding]) were much faster in one direction than the other (potentially by orders of magnitude??) IIRC, converting from a character position to a byte position was fast enough (but not great), but the other way around was too slow to be useful. It looks like you may be using both in the underlying calculations in your code.

My solution was to write an algorithm to used the fast method for both directions by guessing and narrowing in on the answer. That was much faster and worked well. (e.g. it was still faster to take 3-4 guesses with the fast method than a single use of the slow method, ESPECIALLY in areas of text without multibyte characters since the first guess would be correct in those instances).

But the new approach is easier and faster.

A third approach would be to modify MMD so that it worked on UTF-16. I imagine there would be a few "gotchas" while implementing that, but the MMD-6 test suite should help there.

Even though reusing the engine and manipulating the underlying DString could help, too, the UTF-8/UTF-16 location dance is a waste at the moment. Not sure what we can do about that at the moment.

Types Inside the Framework

To work with the library, especially in Swift, is rather painful, e.g.:

|switch token.pointee.type { case UInt16(BLOCK_BLOCKQUOTE.rawValue): // highlight range } |

My work predates Swift. I've experimented with Swift a bit, but haven't found it compelling enough to start using yet. The Swift<->C interface is one of those reasons. Most of my "heavy-lifting" code is done in C (for performance and for compatibility), and it's easy to plug this into Objective-C.

All that's to say that I don't have any Swift tricks. I went as far as getting MMD itself to work in Swift, but not in a highlighting capacity. Simply as a text->html engine.

All enum values are exposed as global constants, and the enums don't group these together. I wrote a Ruby code generator script that parses |libMultiMarkdown.h| for |enum| definitions and makes Objective-C |NS_ENUM|s from them:

typedef NS_ENUM(NSUInteger, MMD6TokenTypes) {
MMD6TokenTypesDocStartToken = DOC_START_TOKEN,
MMD6TokenTypesBlockBlockquote = BLOCK_BLOCKQUOTE,
// ...
} NS_SWIFT_NAME(TokenType);
// ... and the other enums, too ...

It's a wrapper, not a redefinition.

That way, you can create a |TokenType| instance and write:

switch tokenType {
case .docStartToken:
// ...
case .blockBlockquote:
// ...
case .blockCodeFenced:
// ...
}

... with all usual type-safe benefits. And you can write extensions for |TokenType| that take a range and do the highlighting itself, so the |NSTextStorage| code file only contains |tokenType.applyHighlighting(storage: self, range: tokenRange)| or similar instead of the large switch-case statement.

I can put together the Ruby scripts into a self-explanatory CLI script and open a PR if that fits the vision.

I think that could be useful for others, but would want to make sure it doesn't "clutter" things for everyone else. Maybe if everything ends up inside a top-level "swift" directory?

F-

@DivineDominion
Copy link
Contributor Author

So using MMD6 for syntax highlighting isn't just a weird hack, cool :)

Keeping the source around as NSString/Swift.String and a C string sounds intriguing. The adapter is a bit of work, though, to do what you have mentioned. Will need to think about it some more.

Regarding the Ruby scripts, so you are okay with having them in the repository, too? Then I'll clean stuff up and create a pull request. I would like to do it in 2 stages:

  1. add the Objective-C enum wrappers (and include them in a separate .h file of the framework, since only ObjC/Swift coders will use the xcode target)
  2. output Swift extensions into a separate folder for people to copy into their projects

Would you like me to handle things differently?

@fletcher
Copy link
Owner

MultiMarkdown Composer has been built using MultiMarkdown since v1 (of Composer, not MultiMarkdown). So definitely not just a weird hack. ;)

When you have things packed up, or even mostly so, send me a copy and I can look at it. If it all makes sense, we can shove it in a swift directory. If it doesn’t make sense, maybe it would be best in a separate repo? But together would be best if possible…. We can each think about it.

@DivineDominion
Copy link
Contributor Author

I polished the script and opened a PR for you to review #101

@DivineDominion
Copy link
Contributor Author

DivineDominion commented Jan 12, 2019

I still haven't given up on this :)

The following applies to regular typing and some deletions. I haven't figured out all cases of collapsing multiple blocks.

For paragraphs and headings and other non-list blocks, I got reasonable results when editing. I perform a mmd_engine_parse_substring on the affected block. That is for level-1 children of the document root. The parsing result is embedded in a document (type == 0), so I inject all its children into the existing document. The result is patchwork. For cleanup, I adjust the length and shift all blocks that follow the replacement so the metadata match.

But list items. Oh my.

The first match is the all-encompassing <ul> or <ol>. For large lists, like the MMD changelog, that's basically all the text :) I want to limit scope to the innermost list item and change that. I guess I can figure out how to patch the result in properly with a couple of edge case scenarios. But then I noticed recursive_parse_list_item in the MMD source, which also takes care of calling deindent_block and other useful things. I guess I would want to use that more directly.

So here I wonder: did you cheat for MultiMarkdown Composer and expose more than libMultiMarkdown's public interface, like recursive_parse_list_item or deindent_block, or does MMD Composer work with the current public interface of the library only? :)

@fletcher
Copy link
Owner

I still haven't given up on this

Glad to hear it!

Yes -- there is an optimization problem you have to "solve".

At one extreme, you can reparse the entire document and be guaranteed
accurate results at the cost of having to rehighlight everything that
has not changed to catch the things that have. The logical extreme is
to rehighlight an entire novel because you inserted a single character
at the very end.

The other extreme is a complex algorithm that figures out the exact
minimum rehighlighting for every possible change one could make. The
logical extreme would be a pile of code that consists of a million edge
cases, each written by hand.

At first it seems easy, as you began in your description, where you
simply reparse the current "block." Then you discover lists, and in
particular long lists, and you want to try to optimize it a bit more.
And that seems doable at first. Just a few edge cases to consider.

And then, after testing, you accidentally start typing a list
(previously you were just modifying an existing list during your
testing), and you realize that is an entirely different monster. (I
actually added empty list items to MMD several years ago for this exact
reason.) You discover that when editing an item, or adding a new item,
that other parts of your list toggle back and forth between list items
and indented code blocks because of the tabs. Huh.... That's a pain....

So you finally figure something out that is acceptable to you, and you
have a celebratory adult beverage.

And then you remember fenced code blocks. And you let out a string of
curses that would make a sailor blush.... (as a thought experiment,
imagine a document consisting of a series of fenced code blocks
intermixed with regular text. And then add or delete an additional
fence at the very beginning of the document...)

In MMDC 3 (was in public beta, but never sold), I had a relatively
complex algorithm that contained a fair number of edge cases. It would
prune and graft the token tree for the entire document in order to make
sure it was up to date for the entire document. In order to catch
exceptions, with every keystroke MMDC would use the algorithm as well as
a complete reparse, and compare both token trees to ensure an accurate
result. I had to figure out exactly what constituted equivalency, bc
they weren't exact matches, but would match in terms of meaning. And
this was slow -- the beta was fine for shorter documents, but
essentially unusable for long documents. A small parse, a long parse,
and walking both complete token trees and comparing each token took time....

In order to catch all those edge cases, however, I absolutely had to use
the beta in that way. Different people write in different ways, and I
would never have caught the strange and wonderful things some people do.
I created a special debugging tool that would generate a version of a
"crash report" that would show the existing text and the change made
that caused the error. This allowed me to recreate the problem and add
another edge case. Those crash reports trickled to a stop as the edge
case coverage became more and more complete. (PS -- if you do it this
way, you absolutely need a test suite of some sort...)

But the code became hard to really understand. Which meant that if I
needed to change it, it was really hard to do.

Enter MMDC 4 (or more precisely, MMD 6). One of the goals of MMD 6 was
improved performance, and it was much faster than prior versions.
This improved performance is a good thing by itself of course, but
another benefit was that I didn't have to worry quite so much about
optimization. I was able to migrate away from edge cases, and back
towards accepting that I would just rehighlight entire blocks, even when
not strictly necessary. The new algorithm only contains a couple of
edge cases (mostly involving lists and fenced code blocks). It's much
easier to maintain. And overall performance is improved, since the
underlying MMD is faster. In fact, it even works just fine on an iPhone.

Over the last 2-3 weeks, I rewrote my SmartText engine mostly from
scratch (the underlying cross-platform C code engine that MMDC 4 uses to
control the text one is editing, handle changes, undo/redo, etc.
Basically anything that adds/removes/changes the text contained in the
editor pane.) This engine the integrates with a highlighting engine,
also in C, that handles the themes and when combined with libMMD allows
performing syntax highlighting.

(PS> I believe from earlier conversations you have jumped on the Swift
bandwagon, and I'm not sure how comfortable you are in C. But I am
increasingly convinced I made the right choice in consolidating some key
parts of the code base in C libraries. I have a toy version of the MMDC
editor pane (nothing else, just the editor itself) using GTK that runs
on Linux. It involved a short amount of glue code to modify the GTK
text buffer to use my SmartText engine and MMD highlighter. This allowed
me to test a few things in different ways (always good for finding
subtle bugs.) And I might eventually take the time to convert the other
elements (Preview,TOC, References, CriticMarkup and the info bar). This
would allow Windows and Linux versions of Composer which others have
asked for over the years, without requiring an unreasonable amount of
effort. This recent rewrite allowed me to migrate some of the "code
creep" from C to Objective-C that had occurred when getting the MMDC 4
GUI back on par with the MMD 2/3 GUI.)

In doing this I had to rewrite the algorithm that determines how much
text to rehighlight. And I really didn't change much (I found one or
two minor bugs and fixed them, but really not big changes. It has not
been exhaustively tested yet though.) I still think, for me, this is
the right approach -- rely on a faster MMD library in order to be a bit
more lenient in how much of the text is rehighlighted.

No one has complained about the new version being slow (except for one
or two true edge cases that lead to me discovering bugs in my
implementation.) If I open the mmd-6 DevelopmentNotes.txt file and edit
the change log list, it is slower than normal editing. If I type really
quickly, there is a slight lag. Typing g But if I write while
thinking, it's not intrusive. And I never write lists like that in
"real life."

So I guess you'll have to decide what your use case is, and how much
optimization is premature, and how much (and what kind of) optimization
is warranted.

Oh -- but the short answer to your question is that I only used the
public interface of the MMD-6 library, but you do need extra code of
some sort to make decisions about what you're dealing with (e.g. lists,
paragraph blocks, etc.)

(PS> I believe I have told you this before, but to reiterate and for
others who may be reading along -- I highly recommend that any code
you use for syntax highlighting, including the text you pass into MMD
for parsing, be based on a 1:1 ratio between bytes and character counts
(assuming your macOS string functions are using NSString and character
based functions like replaceCharactersInRange, etc.) Converting between
byte offsets and character offsets can be costly under certain
circumstances, and using NSStrings it was slower in one direction than
the other (I don't recall which). Solving that issue (which I had to do
in MMDC 2) took some time. Now my highlighting code is based on
character based offsets from the start, so I don't have to worry about
that as much. Be sure to test using multibyte unicode characters.)

@DivineDominion
Copy link
Contributor Author

DivineDominion commented Jan 14, 2019

As always, thanks for the time you take to answer in such detail! It's a waste this treasure is buried in GitHub issues :) I bet that over time, there'll be enough material for an action-packed documentary short film.

I love to write in Swift (similar in intensity but not taste to my love for Ruby, the programming language); but I can work in C just fine. It's not feasible to write the highlighter code in Swift, though, because automatic NSString and Swift String conversion still produces a sizeable overhead for every text operation. So I'm sticking with Objective-C for now. Once things really work, I'm considering moving the highlighter core logic to a C library for the exact same reason: eventual portability of the "Zettelkasten" note-taking app I'm working on.

And there you have my use case; it's about note-taking, so I'm fine with my crappy self-made Markdown highlighter at the moment; but some users tend to write long notes like overviews or lists of project tasks, and end up working almost exclusively in outlines. But my app isn't Emacs Org mode, so I don't think that optimizing for list-editing should be my major goal at the moment.

With the default Cocoa text components, I think I should still try to limit highlighting to level-1 list items whenever possible and expand to the block if the item's returned block type changes (i.e. if the user indents the item and it'd be reported as a code block in isolation, or if the user removes the bullet/number). Then performance shouldn't degrade that quickly. And this seems to be more manageable than writing my own text system from scratch right now. :)

Now my highlighting code is based on character based offsets from the start, so I don't have to worry about that as much. Be sure to test using multibyte unicode characters.

I remember that. Now "character based offsets" sounds like what Swift Swings offers, and what Swift does really well in my opinion. There's no easy way to convert NSRange to character offsets, though, because you need to start at 0 and then move forward UTF character by character. That's not a terribly quick operation.

When you make your highlighting code work with character offsets instead of bytes, it sounds like abc🐱efg will be converted to a byte sequence of 7 bytes length, instead of 10 (6 ASCII chars + 4 for the emoji in UTF-8). Do you replace about all non-ASCII characters with placeholders and send them off to the DString so you get a character-offset representation?

@fletcher
Copy link
Owner

Happy to discuss issues related to MMD. :) It's worth it to share some of my experiences with others, as well as to learn something myself.

I don't remember the benefits of NSString vs Swift Strings, but trust you when you say that NSString is better (at least for now). And to be clear, the "top layer" of my code uses NSString to apply the highlighting via NSTextStorage's NSAttributedString. Not everything is in C.

But my NSTextStorage subclass is basically a wrapper around a C library that handles accepting changes from the user and processing them, as well as providing the information needed to apply the highlighting.

As for staying within the domain of character based offsets, my NSRange values all use characters as their unit of measurement. So they don't need to be translated when used to highlight the NSString.

And yes, I have a copy of the underlying text with all "continuation bytes" stripped out, so that I have a 1:1 relationship between bytes and characters when processing with MMD for highlighting (not exporting or previewing). It's actually a bit more complicated than that, since my C library provided many more features than just highlighting, and therefore it needs access to all bytes as well. And I have some relatively fast C routines for converting between characters and bytes quickly when necessary.

@fletcher
Copy link
Owner

I suspect you may be as interested in learning how to figure this out as you are in having a working implementation. But just in case -- as I mentioned, I have recently rewritten the latest version of my text engine/highlighter library to better isolate as much functionality as possible. This rewrite includes NSTextStorage/NSTextView subclasses to function as "glue code". You could then subclass them further to add your own functionality.

I haven't worked through all the details, but if you are interested in licensing the library we could discuss it.

Otherwise, happy to keep answering some questions and offering advice.

@DivineDominion
Copy link
Contributor Author

Cool, I emailed you regarding licensing options!

As for staying within the domain of character based offsets, my NSRange values all use characters as their unit of measurement

That's interesting. editedRange and other range-based NSTextStorage parameters will be counting differently, so how come you get NSRange values that correspond to visible characters?

Translating back and forth with your C core library would be an option, but you mention that separately, so it seems like these are two different achievements:

And I have some relatively fast C routines for converting between characters and bytes quickly when necessary

One-way conversion, stripping out multi-byte characters and leaving in single-byte characters in their stead, is pretty simple. But how do you do the other way?

I am considering putting all the deltas in a list. That'd be how many -[NSRange length] units are lost by replacing e.g. Emoji; for any given NSRange.location from the Cocoa text system, I would then be able to compute totalDelta = sum of all deltas up to NSRange.location. This seems overly complicated, but apart from recomputing the delta from the original string every time, one needs to have some extra source of information, because the stripped-down, 1-character-as-1-byte string doesn't have the information anymore. It's a lossy conversion.

That's when the statement I quoted first began to puzzle me. Maybe you don't need to convert back as often because your ranges do count in characters. In any case, this seems to be non-standard behavior. 🤔

@fletcher
Copy link
Owner

Perhaps we are talking about different things, since we probably have different implementations? (And to be clear to anyone else reading, all of my code discussion refers to Objective-C, not Swift.)

NSRange is simply a data type, the numbers it references can refer to anything, including bytes or characters.

A key function is replaceCharactersInRange:withString, for example. The NSRange there refers to character offsets. My library converts that into a range based on bytes, and then modifies a backing C char * (array of bytes) based on the existing NSTextStorage string, and the new replacement string. This all happens internally using byte measurements.

To perform the MMD highlighting, the C library uses a modified version of the C string such that all measurements are in characters. These can then applied directly in methods like addAttribute:value:range: that expect character measurements. So, during the highlighting phase, there is no need to convert between character and byte offsets, since everything was measured in characters during the initial MMD parsing, resulting in a token tree based on character offsets.

My old system (MMD Composer version 1 and 2) used NSString to convert character ranges to byte ranges, and backwards. If I remember correctly, NSString can return the byte length of a given string quickly. But there was not a fast direct method to get the character measurement of a given byte. In fact, what I ended up doing was an approximation algorithm that guessed a character offset, converted that back to bytes, and checked the result. It then made adjustments until it had the right answer. This usually happened within a few iterations (I think I capped it at 10???). This was much faster than a more direct approach going the other way. But I believe my current solution is much better, and much easier to use for the developer, leading to fewer errors along the way. And since everything is measured in characters, performance is faster by stripping out the interconversion.

And yes, the delta approach does get complicated and error-prone, and I don't use it when converting between character and byte counting. I use it when making a series of changes to the text all at once (for example, selecting an entire list and applying Bold, which results in applying bold separately to each item in the list.)

Hopefully this helps a bit?

@DivineDominion
Copy link
Contributor Author

Sure, NSRange can represent anything, but the Cocoa framework doesn't use it to represent human-recognizable characters as far as I can tell. Here's me spying on the replaceCharactersInRange:withString: method when appending strings:

replaceCharactersInRange:(0, 0) withString:"12345" (length=5)
replaceCharactersInRange:(5, 0) withString:"6👨‍👩‍👦‍👦8" (length=13)
replaceCharactersInRange:(18, 0) withString:"9" (length=1)

You see, the family emoji, which is a combination of 4 single emoji, has a NSString.length of 11. The third method call takes this into account by starting at NSRange.location 18, not 8. To clarify: when you talk about "character offsets", would you expect "6👨‍👩‍👦‍👦8" to be 3? That's how I understood your conversion strategy to C string character-as-bytes.

These can then applied directly in methods like addAttribute:value:range: that expect character measurements.

"Directly" sounds like the C string values can be used, well, directly :) without converting them back to what the text system reports and how NSString is counted.

The problem with my understanding of your tips is: when I pass "6👨‍👩‍👦‍👦8" to the highlighter as a converted C string with length 3, the token tree measurements will expand by the value 3 in this place, not 13. So -[NSTextStorage addAttribute:value:range:] will be off by 10 for anything that comes after the emoji.

Edit: I checked MultiMarkdown Composer's output. The stats report a length of 11 characters for "👨‍👩‍👦‍👦", like I'd expect from NSTextView.selectedRange. So you didn't seem to trick the system into thinking the "length" is 1 for that. (The family emoji is a combination of 4 person emoji + 3 Zero-width joiners (ZWJ, U+200D); the person emoji are reported as 2 characters each by MMD Composer while the ZWJ is 1 character wide. That's all consistent with the Cocoa defaults.) Doesn't say anything about the C string you're using, of course.

@fletcher
Copy link
Owner

Like you said, 👨‍👩‍👦‍👦 is 4 separate emoji which appear in the
same location. It's not counted as one character, since it's not. ;)

(BTW -- I'm going from memory since I wrote most of the code related to
this 2-6+ years ago.... So forgive me if I make slight errors or misspeak.)

So my initial feeling would be that 👨‍👩‍👦‍👦 should count as 4
characters (1 for each person), but then I think about the "glue
characters" (ZWJ) that tell the system to overlap the people. So maybe
it should count as 7 characters.

But then, since Apple uses UTF-16 based encoding inside NSTextStorage,
there are some "things" that count as 2 characters instead of just one.
Which, as you note, is why the people inside the family count as 2
characters each (4 * 2 = 8) plus the 3 ZWJ = 11 total.

One could argue as to whether the count should be 1, 4, 7, or 11. But
since NSTextStorage counts in terms of UTF-16, the useful answer is 11.

My thoughts:

  1. Emoji are kind of silly. I'm glad people enjoy using them, but I
    don't particularly care whether the "user character count" in Composer
    is accurate in regards to them. Especially since it's debatable what
    the "correct" character count should be for things like the family emoji.

  2. I do care very much that using emoji doesn't break Composer, or
    cause errors in the syntax highlighting. So I do care that the
    character count is accurate in a UTF-16 sense, such that counting works
    properly on NSTextStorage methods.

  3. Regardless of which counting system one prefers, we can all agree
    that the family emoji is not 25 characters long (the number of bytes in
    UTF-8 encoding.)

So, to your points/questions:

  • In code I interconvert between bytes and character counts, where
    character counts refer to UTF-16 in order to be compatible with
    NSTextStorage (Using the same routine for roughly accurate user-facing
    character counts is a nice by-product, but not the initial purpose.)

  • "Directly" is correct -- the C string character counts match exactly
    (I haven't found a bug yet) with what NSTextStorage expects. That is by
    design.

  • The key here (whatever terminology we use) is that there are two units
    of measurement that are important when dealing with strings on macOS/iOS
    -- bytes and UTF-16 character offsets/counts. MMD itself works on
    bytes. It expects UTF-8 strings, but is somewhat agnostic to the
    underlying text. The best way to do accurate syntax highlighting of
    Markdown/MultiMarkdown text is a parser of some sort (regular
    expressions will only get you so far.) Since NSTextStorage expects to
    be told what to do using UTF-16 character measures, you have two choices:

    1. Process the text as UTF-8, and then convert each part of the MMD
      parse tree to refer to UTF-16 instead of bytes. This is possible, of
      course, but prone to error in programming since you have to accurately
      keep track of deltas throughout the process. More importantly, however,
      it is slow since you end up repeating a lot of the same work every time
      you highlight something. This is the approach I used years ago when
      first starting out.

    2. Modify the system so that MMD reports a parse tree that is already
      measured in UTF-16 character offsets. This means that the results can
      be directly applied to the text without having to convert between
      counting systems. Which makes the code less complex, less prone to
      error, and faster. This is the approach I use now. It is much less
      prone to strange bugs that used to appear when dealing with non-English
      writers who have more frequent non-ASCII characters, especially CJK writers.

I think we are on the same page, but hopefully this helps explain my
approach better.

Fletcher

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants