Skip to content

[pkg/ottl/ottlfuncs] Added utf8 support to truncate_all function#36713

Closed
yigithankarabulut wants to merge 1 commit into
open-telemetry:mainfrom
yigithankarabulut:truncate-supports-utf-8
Closed

[pkg/ottl/ottlfuncs] Added utf8 support to truncate_all function#36713
yigithankarabulut wants to merge 1 commit into
open-telemetry:mainfrom
yigithankarabulut:truncate-supports-utf-8

Conversation

@yigithankarabulut

Copy link
Copy Markdown

Description

Truncate_all will slice the string up to the given length. If truncating at exactly the length results in a broken UTF-8 encoding, it'll truncate before where the last UTF-8 character started.

Link to tracking issue

Fixes #36017

Testing

Two UTF-8 characters were added to the end of the string and the limit was adjusted to match them and the truncation process was tested.

Documentation

Updated pkg/ottl/ottfuncs/README.md with description.

Signed-off-by: Yigithan Karabulut <yigithannkarabulutt@gmail.com>
@linux-foundation-easycla

linux-foundation-easycla Bot commented Dec 7, 2024

Copy link
Copy Markdown

CLA Signed

  • ✅login: yigithankarabulut / (7db6139)

The committers listed above are authorized under a signed CLA.

Comment on lines +51 to +60
truncatedStr := stringVal[:limit]
for !utf8.ValidString(truncatedStr) {
limit--
if limit == 0 {
value.SetStr("")
return true
}
truncatedStr = stringVal[:limit]
}
value.SetStr(truncatedStr)

@jade-guiton-dd jade-guiton-dd Dec 10, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since utf8.ValidString requires creating a slice and checking all of it every loop, I think a simpler and more efficient solution is to check if the byte after the slice is a valid rune start byte:

Suggested change
truncatedStr := stringVal[:limit]
for !utf8.ValidString(truncatedStr) {
limit--
if limit == 0 {
value.SetStr("")
return true
}
truncatedStr = stringVal[:limit]
}
value.SetStr(truncatedStr)
for limit > 0 && !utf8.RuneStart(stringVal[limit]) {
limit--
}
value.SetStr(stringVal[:limit])

(Neither solution works all that well if the string is not valid UTF-8 in the first place, and both will gladly cut in the middle of multi-codepoint grapheme clusters, but I'm assuming these edge cases are not much of a concern compared to outputting invalid UTF-8.)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment. The way you wrote is faster and simpler, so it can be preferred, of course, if we assume that the string is generally UTF-8 compatible. In the other way, if we consider that a UTF-8 character is a maximum of 4 bytes, we can also validate the string by going back at most 3 times. The choice is yours.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the simplest way to handle invalid UTF-8 input here would be to validate the whole string once; if it succeeds, proceed with the loop; otherwise, just cut at the byte level since it's not UTF-8 in the first place. What do you think?

@github-actions

Copy link
Copy Markdown
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions Bot added the Stale label Dec 27, 2024
@github-actions

Copy link
Copy Markdown
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions Bot closed this Jan 10, 2025
TylerHelmuth pushed a commit that referenced this pull request Mar 11, 2026
#### Description

The `truncate_all` function cuts strings at arbitrary byte positions,
which could break UTF-8 multi-byte characters and produce invalid utf-8
output.

This PR makes it back up to a valid character boundary when the limit
falls mid-character. If the string isn't valid UTF-8 to begin with, it
just cuts at the byte level like before.

Based on the approach suggested by @jade-guiton-dd in
[#36713.](#36713 (comment))

#### Link to tracking issue

Fixes #36017, #36713

#### Testing

- Added `Test_truncateAll_UTF8` with test cases covering rune boundary
truncation also the behaviour with grapheme cluster bytes
- All existing tests continue to pass

#### Documentation

Updated README to note that the limit is in bytes and UTF-8 boundaries
are respected.

---------

Co-authored-by: Edmo Vamerlatti Costa <11836452+edmocosta@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[pkg/ottl] truncate_all function corrupts UTF-8 encoding

3 participants