[pkg/ottl/ottlfuncs] Added utf8 support to truncate_all function#36713
[pkg/ottl/ottlfuncs] Added utf8 support to truncate_all function#36713yigithankarabulut wants to merge 1 commit into
Conversation
Signed-off-by: Yigithan Karabulut <yigithannkarabulutt@gmail.com>
The committers listed above are authorized under a signed CLA. |
| truncatedStr := stringVal[:limit] | ||
| for !utf8.ValidString(truncatedStr) { | ||
| limit-- | ||
| if limit == 0 { | ||
| value.SetStr("") | ||
| return true | ||
| } | ||
| truncatedStr = stringVal[:limit] | ||
| } | ||
| value.SetStr(truncatedStr) |
There was a problem hiding this comment.
Since utf8.ValidString requires creating a slice and checking all of it every loop, I think a simpler and more efficient solution is to check if the byte after the slice is a valid rune start byte:
| truncatedStr := stringVal[:limit] | |
| for !utf8.ValidString(truncatedStr) { | |
| limit-- | |
| if limit == 0 { | |
| value.SetStr("") | |
| return true | |
| } | |
| truncatedStr = stringVal[:limit] | |
| } | |
| value.SetStr(truncatedStr) | |
| for limit > 0 && !utf8.RuneStart(stringVal[limit]) { | |
| limit-- | |
| } | |
| value.SetStr(stringVal[:limit]) |
(Neither solution works all that well if the string is not valid UTF-8 in the first place, and both will gladly cut in the middle of multi-codepoint grapheme clusters, but I'm assuming these edge cases are not much of a concern compared to outputting invalid UTF-8.)
There was a problem hiding this comment.
Thanks for your comment. The way you wrote is faster and simpler, so it can be preferred, of course, if we assume that the string is generally UTF-8 compatible. In the other way, if we consider that a UTF-8 character is a maximum of 4 bytes, we can also validate the string by going back at most 3 times. The choice is yours.
There was a problem hiding this comment.
I think the simplest way to handle invalid UTF-8 input here would be to validate the whole string once; if it succeeds, proceed with the loop; otherwise, just cut at the byte level since it's not UTF-8 in the first place. What do you think?
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
#### Description The `truncate_all` function cuts strings at arbitrary byte positions, which could break UTF-8 multi-byte characters and produce invalid utf-8 output. This PR makes it back up to a valid character boundary when the limit falls mid-character. If the string isn't valid UTF-8 to begin with, it just cuts at the byte level like before. Based on the approach suggested by @jade-guiton-dd in [#36713.](#36713 (comment)) #### Link to tracking issue Fixes #36017, #36713 #### Testing - Added `Test_truncateAll_UTF8` with test cases covering rune boundary truncation also the behaviour with grapheme cluster bytes - All existing tests continue to pass #### Documentation Updated README to note that the limit is in bytes and UTF-8 boundaries are respected. --------- Co-authored-by: Edmo Vamerlatti Costa <11836452+edmocosta@users.noreply.github.com>
Description
Truncate_all will slice the string up to the given length. If truncating at exactly the length results in a broken UTF-8 encoding, it'll truncate before where the last UTF-8 character started.
Link to tracking issue
Fixes #36017
Testing
Two UTF-8 characters were added to the end of the string and the limit was adjusted to match them and the truncation process was tested.
Documentation
Updated pkg/ottl/ottfuncs/README.md with description.