-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using libMultiMarkdown for syntax highlighting (on Mac) #95
Comments
On 10/19/17 5:08 AM, Christian Tietze wrote:
I should say so.... I've been doing it for over five years.
There are a few things you need to be careful about when doing this, particularly metadata, lists, and code blocks (esp. fenced ones), but yes this works.
Yes. You have to account for the different offsets in one way or another.
I'm not sure how those underlying calculations are done, but that doesn't seem very efficient. I use a different text handling approach in Composer v4, and part of that is to keep a C string of the text that has UTF-8 continuation bytes stripped out so that byte offsets in the C string match character offsets in the Cocoa framework. This allows everything else to function as if UTF-8 == UTF-16 (for purposes of syntax highlighting). Because I wrote the underlying text-editor engine in C (which runs "underneath" NSTextStorage), I get this for a virtually negligible cost. But the same could be done by modifying NSTextStorage which could be just as efficient. Previously I did it differently and a bit more like your approach. But in my testing I found that Apple's methods to convert a character range to a byte range (using something like [NSString lengthOfBytesUsingEncoding:NSUTF8StringEncoding]) were much faster in one direction than the other (potentially by orders of magnitude??) IIRC, converting from a character position to a byte position was fast enough (but not great), but the other way around was too slow to be useful. It looks like you may be using both in the underlying calculations in your code. My solution was to write an algorithm to used the fast method for both directions by guessing and narrowing in on the answer. That was much faster and worked well. (e.g. it was still faster to take 3-4 guesses with the fast method than a single use of the slow method, ESPECIALLY in areas of text without multibyte characters since the first guess would be correct in those instances). But the new approach is easier and faster. A third approach would be to modify MMD so that it worked on UTF-16. I imagine there would be a few "gotchas" while implementing that, but the MMD-6 test suite should help there.
My work predates Swift. I've experimented with Swift a bit, but haven't found it compelling enough to start using yet. The Swift<->C interface is one of those reasons. Most of my "heavy-lifting" code is done in C (for performance and for compatibility), and it's easy to plug this into Objective-C. All that's to say that I don't have any Swift tricks. I went as far as getting MMD itself to work in Swift, but not in a highlighting capacity. Simply as a text->html engine.
I think that could be useful for others, but would want to make sure it doesn't "clutter" things for everyone else. Maybe if everything ends up inside a top-level "swift" directory? F- |
So using MMD6 for syntax highlighting isn't just a weird hack, cool :) Keeping the source around as NSString/Swift.String and a C string sounds intriguing. The adapter is a bit of work, though, to do what you have mentioned. Will need to think about it some more. Regarding the Ruby scripts, so you are okay with having them in the repository, too? Then I'll clean stuff up and create a pull request. I would like to do it in 2 stages:
Would you like me to handle things differently? |
MultiMarkdown Composer has been built using MultiMarkdown since v1 (of Composer, not MultiMarkdown). So definitely not just a weird hack. ;) When you have things packed up, or even mostly so, send me a copy and I can look at it. If it all makes sense, we can shove it in a swift directory. If it doesn’t make sense, maybe it would be best in a separate repo? But together would be best if possible…. We can each think about it. |
I polished the script and opened a PR for you to review #101 |
I still haven't given up on this :) The following applies to regular typing and some deletions. I haven't figured out all cases of collapsing multiple blocks. For paragraphs and headings and other non-list blocks, I got reasonable results when editing. I perform a But list items. Oh my. The first match is the all-encompassing So here I wonder: did you cheat for MultiMarkdown Composer and expose more than libMultiMarkdown's public interface, like |
Glad to hear it! Yes -- there is an optimization problem you have to "solve". At one extreme, you can reparse the entire document and be guaranteed The other extreme is a complex algorithm that figures out the exact At first it seems easy, as you began in your description, where you And then, after testing, you accidentally start typing a list So you finally figure something out that is acceptable to you, and you And then you remember fenced code blocks. And you let out a string of In MMDC 3 (was in public beta, but never sold), I had a relatively In order to catch all those edge cases, however, I absolutely had to use But the code became hard to really understand. Which meant that if I Enter MMDC 4 (or more precisely, MMD 6). One of the goals of MMD 6 was Over the last 2-3 weeks, I rewrote my SmartText engine mostly from (PS> I believe from earlier conversations you have jumped on the Swift In doing this I had to rewrite the algorithm that determines how much No one has complained about the new version being slow (except for one So I guess you'll have to decide what your use case is, and how much Oh -- but the short answer to your question is that I only used the (PS> I believe I have told you this before, but to reiterate and for |
As always, thanks for the time you take to answer in such detail! It's a waste this treasure is buried in GitHub issues :) I bet that over time, there'll be enough material for an action-packed documentary short film. I love to write in Swift (similar in intensity but not taste to my love for Ruby, the programming language); but I can work in C just fine. It's not feasible to write the highlighter code in Swift, though, because automatic NSString and Swift String conversion still produces a sizeable overhead for every text operation. So I'm sticking with Objective-C for now. Once things really work, I'm considering moving the highlighter core logic to a C library for the exact same reason: eventual portability of the "Zettelkasten" note-taking app I'm working on. And there you have my use case; it's about note-taking, so I'm fine with my crappy self-made Markdown highlighter at the moment; but some users tend to write long notes like overviews or lists of project tasks, and end up working almost exclusively in outlines. But my app isn't Emacs Org mode, so I don't think that optimizing for list-editing should be my major goal at the moment. With the default Cocoa text components, I think I should still try to limit highlighting to level-1 list items whenever possible and expand to the block if the item's returned block type changes (i.e. if the user indents the item and it'd be reported as a code block in isolation, or if the user removes the bullet/number). Then performance shouldn't degrade that quickly. And this seems to be more manageable than writing my own text system from scratch right now. :)
I remember that. Now "character based offsets" sounds like what Swift Swings offers, and what Swift does really well in my opinion. There's no easy way to convert When you make your highlighting code work with character offsets instead of bytes, it sounds like |
Happy to discuss issues related to MMD. :) It's worth it to share some of my experiences with others, as well as to learn something myself. I don't remember the benefits of NSString vs Swift Strings, but trust you when you say that NSString is better (at least for now). And to be clear, the "top layer" of my code uses NSString to apply the highlighting via NSTextStorage's NSAttributedString. Not everything is in C. But my NSTextStorage subclass is basically a wrapper around a C library that handles accepting changes from the user and processing them, as well as providing the information needed to apply the highlighting. As for staying within the domain of character based offsets, my NSRange values all use characters as their unit of measurement. So they don't need to be translated when used to highlight the NSString. And yes, I have a copy of the underlying text with all "continuation bytes" stripped out, so that I have a 1:1 relationship between bytes and characters when processing with MMD for highlighting (not exporting or previewing). It's actually a bit more complicated than that, since my C library provided many more features than just highlighting, and therefore it needs access to all bytes as well. And I have some relatively fast C routines for converting between characters and bytes quickly when necessary. |
I suspect you may be as interested in learning how to figure this out as you are in having a working implementation. But just in case -- as I mentioned, I have recently rewritten the latest version of my text engine/highlighter library to better isolate as much functionality as possible. This rewrite includes NSTextStorage/NSTextView subclasses to function as "glue code". You could then subclass them further to add your own functionality. I haven't worked through all the details, but if you are interested in licensing the library we could discuss it. Otherwise, happy to keep answering some questions and offering advice. |
Cool, I emailed you regarding licensing options!
That's interesting. Translating back and forth with your C core library would be an option, but you mention that separately, so it seems like these are two different achievements:
One-way conversion, stripping out multi-byte characters and leaving in single-byte characters in their stead, is pretty simple. But how do you do the other way? I am considering putting all the deltas in a list. That'd be how many That's when the statement I quoted first began to puzzle me. Maybe you don't need to convert back as often because your ranges do count in characters. In any case, this seems to be non-standard behavior. 🤔 |
Perhaps we are talking about different things, since we probably have different implementations? (And to be clear to anyone else reading, all of my code discussion refers to Objective-C, not Swift.) NSRange is simply a data type, the numbers it references can refer to anything, including bytes or characters. A key function is To perform the MMD highlighting, the C library uses a modified version of the C string such that all measurements are in characters. These can then applied directly in methods like My old system (MMD Composer version 1 and 2) used NSString to convert character ranges to byte ranges, and backwards. If I remember correctly, NSString can return the byte length of a given string quickly. But there was not a fast direct method to get the character measurement of a given byte. In fact, what I ended up doing was an approximation algorithm that guessed a character offset, converted that back to bytes, and checked the result. It then made adjustments until it had the right answer. This usually happened within a few iterations (I think I capped it at 10???). This was much faster than a more direct approach going the other way. But I believe my current solution is much better, and much easier to use for the developer, leading to fewer errors along the way. And since everything is measured in characters, performance is faster by stripping out the interconversion. And yes, the delta approach does get complicated and error-prone, and I don't use it when converting between character and byte counting. I use it when making a series of changes to the text all at once (for example, selecting an entire list and applying Bold, which results in applying bold separately to each item in the list.) Hopefully this helps a bit? |
Sure,
You see, the family emoji, which is a combination of 4 single emoji, has a
"Directly" sounds like the C string values can be used, well, directly :) without converting them back to what the text system reports and how NSString is counted. The problem with my understanding of your tips is: when I pass Edit: I checked MultiMarkdown Composer's output. The stats report a length of 11 characters for "👨👩👦👦", like I'd expect from |
Like you said, (BTW -- I'm going from memory since I wrote most of the code related to So my initial feeling would be that But then, since Apple uses UTF-16 based encoding inside NSTextStorage, One could argue as to whether the count should be 1, 4, 7, or 11. But My thoughts:
So, to your points/questions:
I think we are on the same page, but hopefully this helps explain my Fletcher |
We're experimenting with using the library for live syntax highlighting in macOS/iOS apps. A basic setup is indeed functional, but there are some issues I'd like to point out and discuss.
First, @fletcher do you think the library is fit at all to make this feasible?
The CommonMark library's tokenizer didn't do well with only portions of a file (e.g. paragraphs), but MMD6 so far seems to cope with it okay.
So here's 2 coding roadbumps that seem most irritating.
String Encoding
MMD6 operates on UTF-8 encoded strings. Swift strings backed using UTF-16, and
NSString
-basedNSRange
s don't work well with e.g. emoji. The count is always off and MMD will end up highlighting a non-fitting portion if you don't fix the string location and length.But you can also obtain a
String.UTF8View
to calculate indices inside the string. Boils down to this, if anyone is interested:This fixes passing in proper indices, but you'll get back UTF-8 based tokens and have to reverse the calculation:
This takes an awful lot of time. According to the Time Profiling Instrument, the index offset calculations take 33% of the processing time when you type. The lag is noticeable.
Even though reusing the engine and manipulating the underlying DString could help, too, the UTF-8/UTF-16 location dance is a waste at the moment. Not sure what we can do about that at the moment.
Types Inside the Framework
To work with the library, especially in Swift, is rather painful, e.g.:
All enum values are exposed as global constants, and the enums don't group these together. I wrote a Ruby code generator script that parses
libMultiMarkdown.h
forenum
definitions and makes Objective-CNS_ENUM
s from them:It's a wrapper, not a redefinition.
That way, you can create a
TokenType
instance and write:... with all usual type-safe benefits. And you can write extensions for
TokenType
that take a range and do the highlighting itself, so theNSTextStorage
code file only containstokenType.applyHighlighting(storage: self, range: tokenRange)
or similar instead of the large switch-case statement.I can put together the Ruby scripts into a self-explanatory CLI script and open a PR if that fits the vision.
The text was updated successfully, but these errors were encountered: