[pkg/ottl/ottlfuncs] Added utf8 support to truncate_all function by yigithankarabulut · Pull Request #36713 · open-telemetry/opentelemetry-collector-contrib

yigithankarabulut · 2024-12-07T16:45:23Z

Description

Truncate_all will slice the string up to the given length. If truncating at exactly the length results in a broken UTF-8 encoding, it'll truncate before where the last UTF-8 character started.

Link to tracking issue

Fixes #36017

Testing

Two UTF-8 characters were added to the end of the string and the limit was adjusted to match them and the truncation process was tested.

Documentation

Updated pkg/ottl/ottfuncs/README.md with description.

Signed-off-by: Yigithan Karabulut <yigithannkarabulutt@gmail.com>

linux-foundation-easycla · 2024-12-07T16:45:27Z

✅login: yigithankarabulut / (7db6139)

The committers listed above are authorized under a signed CLA.

jade-guiton-dd · 2024-12-10T14:03:54Z

+				truncatedStr := stringVal[:limit]
+				for !utf8.ValidString(truncatedStr) {
+					limit--
+					if limit == 0 {
+						value.SetStr("")
+						return true
+					}
+					truncatedStr = stringVal[:limit]
+				}
+				value.SetStr(truncatedStr)


Since utf8.ValidString requires creating a slice and checking all of it every loop, I think a simpler and more efficient solution is to check if the byte after the slice is a valid rune start byte:

Suggested change

truncatedStr := stringVal[:limit]

for !utf8.ValidString(truncatedStr) {

limit--

if limit == 0 {

value.SetStr("")

return true

}

truncatedStr = stringVal[:limit]

}

value.SetStr(truncatedStr)

for limit > 0 && !utf8.RuneStart(stringVal[limit]) {

limit--

}

value.SetStr(stringVal[:limit])

(Neither solution works all that well if the string is not valid UTF-8 in the first place, and both will gladly cut in the middle of multi-codepoint grapheme clusters, but I'm assuming these edge cases are not much of a concern compared to outputting invalid UTF-8.)

Thanks for your comment. The way you wrote is faster and simpler, so it can be preferred, of course, if we assume that the string is generally UTF-8 compatible. In the other way, if we consider that a UTF-8 character is a maximum of 4 bytes, we can also validate the string by going back at most 3 times. The choice is yours.

I think the simplest way to handle invalid UTF-8 input here would be to validate the whole string once; if it succeeds, proceed with the loop; otherwise, just cut at the byte level since it's not UTF-8 in the first place. What do you think?

github-actions · 2024-12-27T05:21:01Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2025-01-10T05:21:59Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@jade-guiton-dd

#### Description The `truncate_all` function cuts strings at arbitrary byte positions, which could break UTF-8 multi-byte characters and produce invalid utf-8 output. This PR makes it back up to a valid character boundary when the limit falls mid-character. If the string isn't valid UTF-8 to begin with, it just cuts at the byte level like before. Based on the approach suggested by @jade-guiton-dd in [#36713.](#36713 (comment)) #### Link to tracking issue Fixes #36017, #36713 #### Testing - Added `Test_truncateAll_UTF8` with test cases covering rune boundary truncation also the behaviour with grapheme cluster bytes - All existing tests continue to pass #### Documentation Updated README to note that the limit is in bytes and UTF-8 boundaries are respected. --------- Co-authored-by: Edmo Vamerlatti Costa <11836452+edmocosta@users.noreply.github.com>

added utf8 support to truncate_all function

7db6139

Signed-off-by: Yigithan Karabulut <yigithannkarabulutt@gmail.com>

yigithankarabulut requested review from a team, TylerHelmuth, bogdandrutu and evan-bradley as code owners December 7, 2024 16:45

github-actions Bot assigned mx-psi Dec 7, 2024

github-actions Bot added the pkg/ottl label Dec 7, 2024

github-actions Bot requested a review from kentquirk December 7, 2024 16:45

jade-guiton-dd reviewed Dec 10, 2024

View reviewed changes

github-actions Bot added the Stale label Dec 27, 2024

github-actions Bot closed this Jan 10, 2025

AsishRaju mentioned this pull request Dec 18, 2025

[pkg/ottl/ottlfuncs] Add UTF-8 support to truncate_all function #45055

Merged

dloucasfx mentioned this pull request Mar 24, 2026

[chore] Changelog for v0.148.0 signalfx/splunk-otel-collector#7357

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pkg/ottl/ottlfuncs] Added utf8 support to truncate_all function#36713

[pkg/ottl/ottlfuncs] Added utf8 support to truncate_all function#36713
yigithankarabulut wants to merge 1 commit into
open-telemetry:mainfrom
yigithankarabulut:truncate-supports-utf-8

yigithankarabulut commented Dec 7, 2024

Uh oh!

linux-foundation-easycla Bot commented Dec 7, 2024 •

edited

Loading

Uh oh!

jade-guiton-dd Dec 10, 2024 •

edited

Loading

Uh oh!

yigithankarabulut Dec 11, 2024

Uh oh!

jade-guiton-dd Dec 12, 2024

Uh oh!

github-actions Bot commented Dec 27, 2024

Uh oh!

github-actions Bot commented Jan 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yigithankarabulut commented Dec 7, 2024

Description

Link to tracking issue

Testing

Documentation

Uh oh!

linux-foundation-easycla Bot commented Dec 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jade-guiton-dd Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yigithankarabulut Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

jade-guiton-dd Dec 12, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Dec 27, 2024

Uh oh!

github-actions Bot commented Jan 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

linux-foundation-easycla Bot commented Dec 7, 2024 •

edited

Loading

jade-guiton-dd Dec 10, 2024 •

edited

Loading