-
Notifications
You must be signed in to change notification settings - Fork 7.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stb_texedit.h: support for utf-8 #188
Comments
The easy way to avoid doing conversions everywhere is to use wide characters everywhere in your code!!!! But yeah, I would take such a patch. |
Also, I would take a patch that had an optional more-stateful approach (#ifdef'd or runtime condition), whether that be automatically caching under the hood or adding to the API to be more explicit (or both, I guess). |
Thanks. Yeah I hate having to convert. I believe Ascii / UTF-8 make more sense for tools (lots of identifiers, filenames, little amount of localized text) though. I'll look into doing the first patch, if anything it'll make more accustomed with the code toward a stateful patch. |
Closing this for now so you don't get the noise. |
i have a mostly working utf8 prototype. now that i've tried it out utf8 seems very unsuited because it's so indeterministic. a lot of caching/additonal data structures would be needed to make it fast so i will probably end up never using it myself. the basic idea of this implementation is to have a anyways, i had the following problems:
stb changes: kritzikratzi@e12945a utf8 implementation: https://github.com/kritzikratzi/ofxMightyUI/blob/820fd69ce85186e9a9b45497bbace0abda177876/src/MuiTextArea.cpp |
I said I will take such a patch, and I did mean that, so I'm not saying no to a patch along these lines (I understand this is a proof of concept), but (especially to @ocornut ) I do think it's seriously worth considering whether it would be better to just use UTF-32 everywhere in the text editor (which I kinda jokingly said earlier in the thread). Maybe you'd have to do conversions back to utf8 in and out of it, but that seems like it might be more straightforward. Maybe we could provide automatic wrappers to do that for you, even. In my imgui, I actually have a flag on text editors which is whether it's editing a "live" version of the string that you pass in, or if you're editing an imgui-internal copy that only updates the app's copy when you press enter. It would be plausible to make it only ever edit an internal copy (and if you ask for a 'live' version it just always copies in/out on every update), and than that internal copy could just be utf32. But it would introduce a dependency on malloc to store the copy that the current system avoids, which is why I have that stuff in my imgui, not in textedit. |
I used utf8-cpp to implement naïve/brute-force implementations of the
This is perhaps the major difference In our approaches. I treat the indexes coming from STB_TEXTEDIT_CHARTYPE STB_TEXTEDIT_GETCHAR(STB_TEXTEDIT_STRING * string, int i) {
auto pos = string->c_str();
utf8::unchecked::advance(pos, i);
return utf8::unchecked::peek_next(pos);
}
static int STB_TEXTEDIT_KEYTOTEXT(STB_TEXTEDIT_CHARTYPE key) {
return (key != 0 or utf8::internal::is_code_point_valid(key)) ? key : -1;
}
static bool STB_TEXTEDIT_INSERTCHARS(STB_TEXTEDIT_STRING * string, int index, const STB_TEXTEDIT_CHARTYPE * text, int n) {
auto pos = string->cbegin();
utf8::unchecked::advance(pos, index);
while (n--) {
uint8_t utf8[4] = { };
auto stop = utf8::unchecked::append(*text++, utf8);
pos = string->insert(pos, utf8, stop);
}
string->codepoints += n;
return true;
}
static bool STB_TEXTEDIT_DELETECHARS(STB_TEXTEDIT_STRING * string, int index, int n) {
auto start = string->cbegin();
utf8::unchecked::advance(start, index);
auto stop = start;
utf8::unchecked::advance(stop, n);
string->erase(start, stop);
string->codepoints -= n;
return true;
}
static float STB_TEXTEDIT_GETWIDTH(STB_TEXTEDIT_STRING * string, int n, int i) {
auto start = string->c_str();
utf8::unchecked::advance(start, n + i);
auto nbytes = utf8::internal::sequence_length(start);
if (nbytes == 1 and *start == '\n')
return STB_TEXTEDIT_GETWIDTH_NEWLINE;
return nvgTextBounds(string->nvg, 0, 0, start, std::next(start, nbytes), nullptr);
}
static void STB_TEXTEDIT_LAYOUTROW(StbTexteditRow * result, STB_TEXTEDIT_STRING * string, int index)
{
NVGtextRow row;
auto start = string->cbegin();
utf8::unchecked::advance(start, index);
if (nvgTextBreakLines(string->nvg, &*start, &*string->cend(), string->w, &row, 1) == false) {
result->x0 = result->x1 = string->x;
result->ymin = result->ymax = string->y;
result->baseline_y_delta = string->leading;
result->num_chars = 0;
return;
}
float bounds[4]; // xmin, ymin, xmax, ymax
nvgTextBounds(string->nvg, string->x, string->y, row.start, row.end, bounds);
result->x0 = bounds[0];
result->x1 = bounds[2];
result->ymin = bounds[1];
result->ymax = bounds[3];
result->baseline_y_delta = string->leading;
result->num_chars = utf8::unchecked::distance(row.start, row.next);
} No changes to |
It looks like we could modify stb_textedit.h to enable the user to keep an UTF-8 underlying representation while avoiding random lookup. If we consider all indices and num_chars values as byte-indices (and not character-indices), we'd only need two extra functions:
Default to i-1
For UTF-8 the user would need to backtrack in the stream looking for the first < 0x80 byte. that would work and be efficient enough. However it means that user's own handling of malformed UTF-8 (for which there are no standard convention for, AFAIK), to be compatible with rewinding would have to do the reverse operation. Editing malformed UTF-8 is a super edge-case that is reasonable to avoid or catch earlier, and wouldn't affect people not using UTF-8.
Default to i+1
Name would be a nice symmetry to the previous function. It could also be turned into a STB_TEXTEDIT_GETNUMINDICESFORCHAR() / STB_TEXTEDIT_GETBYTECOUNTFORCHAR() defaulting to 1.
With this scheme a typical loop such as
Would become
There's probably a few other things to solve and clarify but that's the gist of it.
Do you think you would take such a patch?
(
This is merely me dumping some ideas, as I'm not yet sure I went to undergo this modification. It has been niggling be that my text input widget has to do back and forth UTF-8 - wchar conversions. As I'm trying to handle large of text reasonably in an imgui context and not doing multiple pass on the data, the code is already quite complex and would benefit from only dealing with a single UTF-8 buffer. However for interactive performances with large amount of text, I may just as well have to rewrite something anyway because stb_textedit is not designed for large text.
So I can either:
a) Add this UTF-8 support to stb_textedit, it would simplify my code a lot (primary focus), make it a little faster, and generally may be useful to have that support in stb_textedit. The cons is that the code in stb_textedit.h will look a little heavier.
b) Rewrite something custom, more stateful to handle interacting with large text. More effort. I don't absolutely need the perf but it'd be nice. Nobody else will benefit from the improvement. I'd prefer to avoid this path.
)
The text was updated successfully, but these errors were encountered: