Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stb_texedit.h: support for utf-8 #188

Closed
ocornut opened this issue Oct 18, 2015 · 7 comments
Closed

stb_texedit.h: support for utf-8 #188

ocornut opened this issue Oct 18, 2015 · 7 comments

Comments

@ocornut
Copy link
Contributor

ocornut commented Oct 18, 2015

It looks like we could modify stb_textedit.h to enable the user to keep an UTF-8 underlying representation while avoiding random lookup. If we consider all indices and num_chars values as byte-indices (and not character-indices), we'd only need two extra functions:

  • STB_TEXTEDIT_GETPREVIOUSCHARINDEX(obj, int idx)

Default to i-1
For UTF-8 the user would need to backtrack in the stream looking for the first < 0x80 byte. that would work and be efficient enough. However it means that user's own handling of malformed UTF-8 (for which there are no standard convention for, AFAIK), to be compatible with rewinding would have to do the reverse operation. Editing malformed UTF-8 is a super edge-case that is reasonable to avoid or catch earlier, and wouldn't affect people not using UTF-8.

  • STB_TEXTEDIT_GETNEXTCHARINDEX(obj, int char_idx);

Default to i+1
Name would be a nice symmetry to the previous function. It could also be turned into a STB_TEXTEDIT_GETNUMINDICESFORCHAR() / STB_TEXTEDIT_GETBYTECOUNTFORCHAR() defaulting to 1.

  • A common pattern used by stb_textedit.h would be to call STB_TEXTEDIT_GETWIDTH() or STB_TEXTEDIT_GETCHAR() with one of those functions, so we could offer a way for the user to do both at once possibly, but we don't have to. It may just add unnecessary complexity to offer those.

With this scheme a typical loop such as

   for (i=0; first+i < n; ++i)
      find->x += STB_TEXTEDIT_GETWIDTH(str, first, i);

Would become

   for (i=0; first+i < n; i = STB_TEXTEDIT_GETNEXTCHARINDEX(i))
      find->x += STB_TEXTEDIT_GETWIDTH(str, first, i);

There's probably a few other things to solve and clarify but that's the gist of it.
Do you think you would take such a patch?

(
This is merely me dumping some ideas, as I'm not yet sure I went to undergo this modification. It has been niggling be that my text input widget has to do back and forth UTF-8 - wchar conversions. As I'm trying to handle large of text reasonably in an imgui context and not doing multiple pass on the data, the code is already quite complex and would benefit from only dealing with a single UTF-8 buffer. However for interactive performances with large amount of text, I may just as well have to rewrite something anyway because stb_textedit is not designed for large text.

So I can either:
a) Add this UTF-8 support to stb_textedit, it would simplify my code a lot (primary focus), make it a little faster, and generally may be useful to have that support in stb_textedit. The cons is that the code in stb_textedit.h will look a little heavier.
b) Rewrite something custom, more stateful to handle interacting with large text. More effort. I don't absolutely need the perf but it'd be nice. Nobody else will benefit from the improvement. I'd prefer to avoid this path.
)

@nothings
Copy link
Owner

The easy way to avoid doing conversions everywhere is to use wide characters everywhere in your code!!!!

But yeah, I would take such a patch.

@nothings
Copy link
Owner

Also, I would take a patch that had an optional more-stateful approach (#ifdef'd or runtime condition), whether that be automatically caching under the hood or adding to the API to be more explicit (or both, I guess).

@ocornut
Copy link
Contributor Author

ocornut commented Oct 18, 2015

Thanks. Yeah I hate having to convert. I believe Ascii / UTF-8 make more sense for tools (lots of identifiers, filenames, little amount of localized text) though. I'll look into doing the first patch, if anything it'll make more accustomed with the code toward a stateful patch.

@ocornut
Copy link
Contributor Author

ocornut commented Oct 19, 2015

Closing this for now so you don't get the noise.

@kritzikratzi
Copy link
Contributor

i have a mostly working utf8 prototype.

now that i've tried it out utf8 seems very unsuited because it's so indeterministic. a lot of caching/additonal data structures would be needed to make it fast so i will probably end up never using it myself.
so maybe i'm posting this mostly as a warning to others :)

the basic idea of this implementation is to have a unicode_idx_to_utf8_idx(int unicode_idx) that maps from stb_textarea's view to the utf8 string. this adds another layer of loops which costs quite some speed.

anyways, i had the following problems:

  • paste didn't work, because the cursor moves strlen(char*) and not utf8strlen(char*). for this i introduced an optional: #define STB_TEXTEDIT_CHARARR_LEN(arr,len) len
  • stb_textedit_key is not usable for multi-byte sequences. i solved that outside stb_textedit (control char? -> use stb_textedit_key(...), for normal text input i copied the default case from stb_textedit_key and fixed it up for multibyte sequences). this is not ideal, but i'm not sure how to improve the situation.
  • i believe there still some funny problems with undo/redo (not fixed)

stb changes: kritzikratzi@e12945a

utf8 implementation: https://github.com/kritzikratzi/ofxMightyUI/blob/820fd69ce85186e9a9b45497bbace0abda177876/src/MuiTextArea.cpp

@nothings
Copy link
Owner

I said I will take such a patch, and I did mean that, so I'm not saying no to a patch along these lines (I understand this is a proof of concept), but (especially to @ocornut ) I do think it's seriously worth considering whether it would be better to just use UTF-32 everywhere in the text editor (which I kinda jokingly said earlier in the thread). Maybe you'd have to do conversions back to utf8 in and out of it, but that seems like it might be more straightforward. Maybe we could provide automatic wrappers to do that for you, even.

In my imgui, I actually have a flag on text editors which is whether it's editing a "live" version of the string that you pass in, or if you're editing an imgui-internal copy that only updates the app's copy when you press enter. It would be plausible to make it only ever edit an internal copy (and if you ask for a 'live' version it just always copies in/out on every update), and than that internal copy could just be utf32. But it would introduce a dependency on malloc to store the copy that the current system avoids, which is why I have that stuff in my imgui, not in textedit.

@bwoods
Copy link

bwoods commented Oct 19, 2017

a lot of caching/additonal data structures would be needed to make it fast so i will probably end up never using it myself.

I used utf8-cpp to implement naïve/brute-force implementations of the std_textedit.h callbacks. Due to the non-random-access nature of utf8, many steps are n² in the length of the line being displayed. Nevertheless, testing on the complete text of Alice's Adventures in Wonderland, utf8 handling barely shows up when profiling.

If we consider all indices and num_chars values as byte-indices (and not character-indices)

This is perhaps the major difference In our approaches. I treat the indexes coming from std_textedit.h as code-point indexes; just as you would if editing utf32.

STB_TEXTEDIT_CHARTYPE STB_TEXTEDIT_GETCHAR(STB_TEXTEDIT_STRING * string, int i) {
	auto pos = string->c_str();
	utf8::unchecked::advance(pos, i);
	return utf8::unchecked::peek_next(pos);
}

static int STB_TEXTEDIT_KEYTOTEXT(STB_TEXTEDIT_CHARTYPE key) {
	return (key != 0 or utf8::internal::is_code_point_valid(key)) ? key : -1;
}

static bool STB_TEXTEDIT_INSERTCHARS(STB_TEXTEDIT_STRING * string, int index, const STB_TEXTEDIT_CHARTYPE * text, int n) {
	auto pos = string->cbegin();
	utf8::unchecked::advance(pos, index);

	while (n--) {
		uint8_t utf8[4] = { };
		auto stop = utf8::unchecked::append(*text++, utf8);
		pos = string->insert(pos, utf8, stop);
	}

	string->codepoints += n;
	return true;
}

static bool STB_TEXTEDIT_DELETECHARS(STB_TEXTEDIT_STRING * string, int index, int n) {
	auto start = string->cbegin();
	utf8::unchecked::advance(start, index);
	auto stop = start;
	utf8::unchecked::advance(stop, n);

	string->erase(start, stop);
	string->codepoints -= n;

	return true;
}

static float STB_TEXTEDIT_GETWIDTH(STB_TEXTEDIT_STRING * string, int n, int i) {
	auto start = string->c_str();
	utf8::unchecked::advance(start, n + i);

	auto nbytes = utf8::internal::sequence_length(start);
	if (nbytes == 1 and *start == '\n')
		return STB_TEXTEDIT_GETWIDTH_NEWLINE;

	return nvgTextBounds(string->nvg, 0, 0, start, std::next(start, nbytes), nullptr);
}

static void STB_TEXTEDIT_LAYOUTROW(StbTexteditRow * result, STB_TEXTEDIT_STRING * string, int index)
{
	NVGtextRow row;
	auto start = string->cbegin();
	utf8::unchecked::advance(start, index);

	if (nvgTextBreakLines(string->nvg, &*start, &*string->cend(), string->w, &row, 1) == false) {
		result->x0 = result->x1 = string->x;
		result->ymin = result->ymax = string->y;
		result->baseline_y_delta = string->leading;
		result->num_chars = 0;
		return;
	}

	float bounds[4]; // xmin, ymin, xmax, ymax
	nvgTextBounds(string->nvg, string->x, string->y, row.start, row.end, bounds);

	result->x0 = bounds[0];
	result->x1 = bounds[2];
	result->ymin = bounds[1];
	result->ymax = bounds[3];
	result->baseline_y_delta = string->leading;
	result->num_chars = utf8::unchecked::distance(row.start, row.next);
}

No changes to stb_textedit.h were required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants