-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
highlighter: Change the parsing approach to significantly improve performance #3127
base: master
Are you sure you want to change the base?
Conversation
There is a problem maybe related maybe not. Create a file with 1MB of characters on one line and micro will freeze or slow but will work. It will start to work normally when there is no any characters from that big line is displayed on the screen |
This is caused by https://github.com/zyedidia/micro/blob/master/internal/util/unicode.go#L69 resp. |
Yeah but I am almost sure not fully. Why? I tried one time assume that my file will be ascii only so no unicode and I deleted the length checking (https://github.com/dustdfg/micro/tree/unicode_length_unchecked) and performance became better but not much |
Yes, you're right. Due to the randomness it finds a lot of "regions" within this long line and starts slicing them one after an other. The problem is, that this slicing process is based on the decoding of the characters too, to identify the correct position of the code points within the byte slice. Currently this isn't that optimized so far, but I've no cool idea how this can be done better. |
a988899
to
e9effe8
Compare
Randomness? |
Didn't you use something like the following to create the file?
That was what I used and due to that I will receive some of |
Yeah but I used |
I believe the problem with very long lines is that micro stores line data in memory simply as an array of raw bytes, not an array of unicode runes. I.e. it doesn't decode the line into runes beforehand. So whenever micro needs to know anything about a visual representation of any part of a line, it decodes it on the fly. (So for instance, if this is a very long line and we are closer to the end of it, micro has to decode almost the entire line, and it has to do that every time when redrawing the screen and so on. So things become extremely slow.) Heck, micro doesn't even know how many characters are there in a line. It has to call If we change it to decode and store lines as slices of runes beforehand, it should solve these problems, and generally improve performance. I really don't get why it wasn't done this way from the beginning. |
Yep, that's an good idea to solve this here as well. Maybe I'll take care of this in the next few days. If someone would like to improve this earlier then give it a go and we will merge it afterwards into this PR. 😉 |
I see one possible reason: Rune is int32. It would be like using UTF32 for all characters even if our document contains only ASCII |
Will be great if you take care of it. But why do it in this PR? It is not really a highlighter issue. It is an issue of efficient representation of a line in the basic data structures (so it is more fundamental than the highlighter issues, and affects us even when syntax highlighting is not used at all). Initial rough idea of where we might start:
|
Yes, it costs memory. Isn't it worth it? |
Idk. I can only say if I would a developer that writes a new app I would prefer to make something on bytes because they are cheaper (in size) just because it is the most obvious solution. If we have 1MB file we would store 4MB in memory it isn't so bad for most hardware but I am not sure I would want it. Just as a user I would prefer to have different modes which isn't so easy for developer... |
Yeah it is a separate issue
Storing data in bytes and storing the same data in the ints? I can understand replacement but addition? Why? |
As I said, this is just an initial rough idea. I hope we won't need to keep both representations. OTOH, for example, if we need to keep |
I give both of you right, that the non-decoded bytes are and thus their repetitive decoding is a common performance degradation independent from the highlighting but the feature suffering the most from it is the highlighting...at least from end user perspective. The impact and result is easier to "feel"/measure with the syntax highlighting and due to the common usage of byte streams this interface this tightly coupled into the highlighting process. A rebase is then most likely. 😉 Anway, I will find a solution for that lazy developer reason. 😄 |
I don't think so. Try it: create a file Just as an example, how many times do we call
I don't think the highlighting is special here. Replacing byte streams with rune streams, although conceptually simple, is gonna be a huge change, reworking so many parts of code throughout micro (look at all the places where
Nothing's wrong with that. |
Ok, you convinced me once again. The branch is highly floating and the result (line handling) far away from a productive state. The start is done and now needs a lot of refactoring, fixing and adaptations at a lot of different locations. |
I still use it and didn't encounter any problems so I think @dmaluka is right and highlight should be separate pull request (and issue?) |
I was looking at I was looking at this file when testing a bit: c-comment.c.txt I think only the line where a multi-line comment is started is highlighted as text in a comment if there is text in it highlighted using regions when not in a comment. The other text in the line is also not highlighted as text in a comment if the region would end in it when not in a comment. The lines between the first and last line in a character literal are not highlighted as text with an error. The other text in the line where a character literal ends is highlighted as text in a literal if the literal ends in another line. There are similar bugs but I do not know if they are actually the same so I have not written about them. |
@niten94 I can also see this bug (and don't see it without this PR). |
Yes, this one wouldn't be mixed up with the other idea/improvement optimizing and storing the line data.
Yes, it's a regression which will be fixed in the next few days. Sorry for the inconveniences. AFAIK it sounds again like a logic issue within the nested region detection, which now uses all indices found in a line.
I will consider both. |
e9effe8
to
103ed6d
Compare
e9f848d
to
3f08ae8
Compare
3f08ae8
to
4ab825b
Compare
pkg/highlight/highlighter.go
Outdated
@@ -44,6 +44,7 @@ type highlightStorage struct { | |||
end int | |||
group Group | |||
region *region | |||
childs []*highlightStorage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
children?
pkg/highlight/highlighter.go
Outdated
h.storage = append(h.storage, highlightStorage{start, end, group, region}) | ||
} | ||
|
||
func (h *Highlighter) storePatterns(start int, lineNum int, line []byte, curRegion *region) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
highlightPatterns()
and highlightRegions()
actually seem to be more suitable names (since they don't just store ranges but actually highlight them, i.e. determine their group
values)?
pkg/highlight/highlighter.go
Outdated
h.highlightRegions(fullHighlights, start, lineNum, line, curRegion, h.Def.rules.regions, false) | ||
|
||
for _, e := range h.storage { | ||
if e.start <= e.end && e.end <= len(fullHighlights) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it was added out of paranoia to prevent crossing the boundary, but in even then it would hide issues in the previous highlighter code, which should be found.
I removed it, opened all my test files and there was no crash. So seems to be OK for now.
Thanks for spotting. 👍
endMatches := findAllIndex(r.end, r.skip, line) | ||
samePattern := false | ||
startLoop: | ||
for startIdx := 0; startIdx < len(startMatches); startIdx++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add this:
startMatch := startMatches[startIdx]
endMatch := endMatches[endIdx]
and then replace startMatches[startIdx]
with startMatch
and endMatches[endIdx]
with endMatch
throughout this loop body, the code will be much easier to review, I suppose.
pkg/highlight/highlighter.go
Outdated
samePattern := false | ||
startLoop: | ||
for startIdx := 0; startIdx < len(startMatches); startIdx++ { | ||
if startMatches[startIdx][0] < lineLen { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what cases is this condition false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I can't give a plausible reason why I added this check. 🤔
Since a false result would mean the regex parser did something totally wrong it's worth to get informed about it.
Thanks for spotting. 👍
pkg/highlight/highlighter.go
Outdated
@@ -186,7 +186,9 @@ func (h *Highlighter) highlightRegions(fullHighlights []Group, start int, lineNu | |||
h.lastEnd = start + endMatches[endIdx][1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this remaining check in storePatterns()
:
if curRegion == nil || curRegion.group == curRegion.limitGroup || p.group == curRegion.limitGroup {
Looks like it doesn't make sense? storePatterns()
is about patterns inside a region, i.e. between the region's limitGroup
-matching patterns, so storePatterns()
itself shouldn't have anything to do with limitGroup
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, looks like historical takeover. This makes the whole if condition superfluous.
I removed it locally and opened all my test files again. There was no obvious fault.
pkg/highlight/highlighter.go
Outdated
if startMatches[startIdx][0] < lineLen { | ||
for endIdx := 0; endIdx < len(endMatches); endIdx++ { | ||
if startMatches[startIdx][0] == endMatches[endIdx][0] { | ||
// start and end are the same (pattern) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "the same"? r.start
and r.end
may be different patterns, right?
...Every single line in this if startMatches[startIdx][0] == endMatches[endIdx][0]
branch, including the comments, looks completely confusing to me. Could you explain what is it actually doing? What is the meaning of samePattern
, why are we comparing lengths of startMatches
and endMatches
, why are we comparing curRegion
and r
, and so on.
(Maybe other parts of storeRegions()
are no less confusing, but I haven't gotten to them yet.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "the same"?
r.start
andr.end
may be different patterns, right?
Right, they may be different, but are often the same as given as example in the commit message "highlighter: Rework of matching approach to speed up the processing" of the first big rework commit.
Often we've to deal with regions defined by the usual string markers/patterns ("I'm a string"
), in which the start
is the same as the end
of this region...it's both time "
.
Having the same start
and end
marker/pattern makes it very hard to identify where something starts and where it ends, especially when the region goes across different lines (where a next region of the same type could start too):
e.g.:
Here starts a "I'm a multi-line string,
which ends here" and "that's a second
one here".
Look at the nested '
, which can be again a further (multi-line) region with the same start
and end
marker of '
in the same syntax definition...as we've a lot of them. This creates overlapping regions.
This is the reason why we've to do some kind of "smart guessing". These guesses are, at least from my perspective commented (yes, I'm biased, since it's my code 😜) and even prepared with log.Println()
to help understand, why we've to check lengths for overlapping regions.
Every region of the same type can be identified by highlightRegions()
for a overlap situation. Since different types of regions are handled independent of each other it needs a bit more complex rules to find out if such regions overlap, which is were storeRange()
comes in.
I fully agree, that the code isn't really trivial or easy to understand, but I wasn't able to find a better approach so far.
The most expensive path in the highlighting process is the regex parsing step which was invoked for every line twice due to the split approach. Since this can be done in one step the highlighting effort can be reduced by around 50 percent!
The old approach was to slide over the line by (empty) region over region one after another. Step down into a possible available nested part and do the same. The new approach iterates over all available region definitions and shall find all occurrences of the same region at the same time (`findAllIndex()`) and not just the first (`findIndex()`). This will allow marking a region already in the moment it isn't the first region at the line. After an region has been identified all nested parts are considered the same way. Simple example (no nesting): 1. region start=" end=" 2. region start=' end=' 3. region start=< end=> The "first", "second" and "third" are highlighted in one round. This 'highlights after ""' while <highlights last> and "highlights first".
This is just my interpretation of `limitGroup` as it might be intended a long time ago, but this interpretation could be wrong, since it is not really documented how it should be.
This is necessary in the moment sibling regions overlap each other.
Co-authored-by: Dmitry Maluka <[email protected]>
4ab825b
to
fb2ed1b
Compare
The performance of the current parsing approach can't be improved without changing the whole highlighter code. Due to this the change isn't without any risk, but it's definitely worth the try. Please see the following list, which has been done with the same host and
micro -profile
. The test has been stopped after complete highlight within 80x24 or aborted due to known "endless" recursion (DNF). Afterwards thetop1
has been printed withpprof
.<script>
)My available test files created the same or even more complex highlighting (e.g. pattern highlight within regions in HTMLs) results.
Most probably the logic isn't in a perfect shape yet, but definitely feasible as proof of concept thought.
Please help to test and improving it with a review.
It took a lot of days to get this far and would be a shame when we didn't get this upstream in any form. 😉
Fixes #2839
Fixes #3115
Fixes #3487
Closes #3242