Skip to content

[pkg/ottl/ottlfuncs] Add UTF-8 support to truncate_all function#45055

Merged
TylerHelmuth merged 9 commits into
open-telemetry:mainfrom
AsishRaju:asish/fix-truncate-all-utf8
Mar 11, 2026
Merged

[pkg/ottl/ottlfuncs] Add UTF-8 support to truncate_all function#45055
TylerHelmuth merged 9 commits into
open-telemetry:mainfrom
AsishRaju:asish/fix-truncate-all-utf8

Conversation

@AsishRaju

Copy link
Copy Markdown
Contributor

Description

The truncate_all function cuts strings at arbitrary byte positions, which could break UTF-8 multi-byte characters and produce invalid utf-8 output.

This PR makes it back up to a valid character boundary when the limit falls mid-character. If the string isn't valid UTF-8 to begin with, it just cuts at the byte level like before.

Based on the approach suggested by @jade-guiton-dd in #36713.

Link to tracking issue

Fixes #36017, #36713

Testing

  • Added Test_truncateAll_UTF8 with test cases covering rune boundary truncation also the behaviour with grapheme cluster bytes
  • All existing tests continue to pass

Documentation

Updated README to note that the limit is in bytes and UTF-8 boundaries are respected.

@github-actions

Copy link
Copy Markdown
Contributor

Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib.

Important reminders:

A maintainer will review your pull request soon. Thank you for helping make OpenTelemetry better!

@edmocosta

edmocosta commented Jan 8, 2026

Copy link
Copy Markdown
Contributor

Hi @AsishRaju, thank you for working on this. Do we have any benchmark comparing the existing code vs the new one? If this solution adds too much overhead, we might need an extra optional argument to enable it, so we don't introduce a performance regression.

@AsishRaju

AsishRaju commented Jan 9, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review @edmocosta , I did run some benchmark comparing the existing code vs the new one,

goos: darwin
goarch: arm64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/pkg/ottl/ottlfuncs
cpu: Apple M4 Pro

> go test -bench=BenchmarkTruncateAll -benchmem -count=10 ./ottlfuncs/...

TL;DR: Short strings have minimal overhead (~10ns) while long strings have ~300-500ns w/ UTF-8 validation. No additional memory allocations.

With no UTF-8 handling

Test Name min ns/op max ns/op avg ns/op B/op allocs/op
ASCII_long_1KB 45.42 46.13 45.75 48 3
ASCII_medium_100B 45.12 45.90 45.57 48 3
ASCII_no_truncation 33.96 35.31 34.42 32 2
ASCII_short_10B 44.60 49.47 45.53 48 3
UTF8_2byte_medium 45.78 46.70 46.24 48 3
UTF8_2byte_short 45.86 47.43 46.54 48 3
UTF8_3byte_long_900B 45.78 46.60 46.16 48 3
UTF8_3byte_medium_90B 46.02 48.82 46.87 48 3
UTF8_3byte_short_9B 45.87 46.90 46.23 48 3
UTF8_4byte_long_1KB 45.84 46.87 46.37 48 3
UTF8_4byte_medium_100B 45.57 46.67 46.18 48 3
UTF8_4byte_short_20B 45.61 46.78 46.23 48 3
UTF8_grapheme_long_450B 45.84 46.71 46.27 48 3
UTF8_grapheme_short 45.74 47.15 46.29 48 3
UTF8_mixed_long_700B 45.97 47.01 46.23 48 3
UTF8_mixed_short 45.62 46.57 46.11 48 3
invalid_UTF8_long 46.15 47.23 46.64 48 3
invalid_UTF8_short 46.04 47.67 46.55 48 3
limit_one 46.09 46.81 46.37 48 3
limit_zero 45.86 47.20 46.36 48 3

With UTF-8 handling

Test Name min-ns/op max-ns/op avg-ns/op B/op allocs/op
ASCII_long_1KB 108.70 112.10 110.35 48 3
ASCII_medium_100B 52.04 53.81 52.62 48 3
ASCII_no_truncation 34.66 35.87 35.38 32 2
ASCII_short_10B 49.26 50.85 49.87 48 3
UTF8_2byte_medium 140.10 144.30 141.76 48 3
UTF8_2byte_short 56.76 59.06 57.44 48 3
UTF8_3byte_long_900B 436.30 448.20 442.20 48 3
UTF8_3byte_medium_90B 82.05 83.50 82.69 48 3
UTF8_3byte_short_9B 53.04 54.77 53.92 48 3
UTF8_4byte_long_1KB 370.70 376.20 373.96 48 3
UTF8_4byte_medium_100B 81.60 82.94 82.24 48 3
UTF8_4byte_short_20B 54.65 55.93 55.35 48 3
UTF8_grapheme_long_450B 344.90 353.60 347.67 48 3
UTF8_grapheme_short 56.65 58.34 57.27 48 3
UTF8_mixed_long_700B 489.00 506.10 496.13 48 3
UTF8_mixed_short 58.35 60.57 59.23 48 3
invalid_UTF8_long 50.06 51.11 50.62 48 3
invalid_UTF8_short 50.23 51.29 50.56 48 3
limit_one 56.62 58.02 57.33 48 3
limit_zero 56.36 57.82 57.14 48 3

Old vs New

Test Name Old avg New avg Diff Diff % B/op allocs/op
ASCII_long_1KB 45.75 110.35 +64.60 +141.2% 48 3
ASCII_medium_100B 45.57 52.62 +7.05 +15.5% 48 3
ASCII_no_truncation 34.42 35.38 +0.96 +2.8% 32 2
ASCII_short_10B 45.53 49.87 +4.33 +9.5% 48 3
UTF8_2byte_medium 46.24 141.76 +95.52 +206.6% 48 3
UTF8_2byte_short 46.54 57.44 +10.90 +23.4% 48 3
UTF8_3byte_long_900B 46.16 442.20 +396.04 +858.1% 48 3
UTF8_3byte_medium_90B 46.87 82.69 +35.81 +76.4% 48 3
UTF8_3byte_short_9B 46.23 53.92 +7.68 +16.6% 48 3
UTF8_4byte_long_1KB 46.37 373.96 +327.59 +706.5% 48 3
UTF8_4byte_medium_100B 46.18 82.24 +36.06 +78.1% 48 3
UTF8_4byte_short_20B 46.23 55.35 +9.12 +19.7% 48 3
UTF8_grapheme_long_450B 46.27 347.67 +301.40 +651.5% 48 3
UTF8_grapheme_short 46.29 57.27 +10.98 +23.7% 48 3
UTF8_mixed_long_700B 46.23 496.13 +449.90 +973.2% 48 3
UTF8_mixed_short 46.11 59.23 +13.11 +28.4% 48 3
invalid_UTF8_long 46.64 50.62 +3.98 +8.5% 48 3
invalid_UTF8_short 46.55 50.56 +4.01 +8.6% 48 3
limit_one 46.37 57.33 +10.96 +23.6% 48 3
limit_zero 46.36 57.14 +10.78 +23.3% 48 3

I feel the overhead is acceptable for correctness, valid UTF-8 output will preventing downstream issues with systems that expect valid UTF-8 strings. But happy to know your views on this, if it helps putting under a flag is also fine.

@edmocosta

Copy link
Copy Markdown
Contributor

Thank you @AsishRaju for the benchmarks and great summary!

I feel the overhead is acceptable for correctness, valid UTF-8 output will preventing downstream issues with systems that expect valid UTF-8 strings. But happy to know your views on this, if it helps putting under a flag is also fine.

Correctness is indeed important, but we also need to take into consideration users with high throughput where a considerably small overhead per transaction might have a big impact. We also don't know how big the strings people are manipulating are, so the overhead might be something from ~20% for typical cases (1.2 - 1.3x slower), to 7-11x times slower when the value is a long and complex UTF-8 string.

Based on the benchmarks, I think I'd go with the extra optional argument to enable it, being false the default. WDYT @bogdan-st @TylerHelmuth @evan-bradley?

@AsishRaju AsishRaju force-pushed the asish/fix-truncate-all-utf8 branch from e93e698 to 71f5a6f Compare January 15, 2026 17:41
@AsishRaju

AsishRaju commented Jan 28, 2026

Copy link
Copy Markdown
Contributor Author

Bumping this up @edmocosta, now added optional flag to make it backward compatible

Comment thread pkg/ottl/ottlfuncs/func_truncate_all.go Outdated
Comment thread pkg/ottl/ottlfuncs/func_truncate_all.go
Comment thread pkg/ottl/ottlfuncs/func_truncate_all.go Outdated
Comment thread pkg/ottl/ottlfuncs/func_truncate_all.go Outdated
Comment thread pkg/ottl/ottlfuncs/func_truncate_all.go Outdated
Comment thread pkg/ottl/ottlfuncs/func_truncate_all.go Outdated
Comment thread pkg/ottl/ottlfuncs/func_truncate_all.go Outdated
Comment thread pkg/ottl/ottlfuncs/func_truncate_all.go Outdated
Comment thread pkg/ottl/ottlfuncs/README.md Outdated
Comment thread pkg/ottl/ottlfuncs/README.md Outdated
Comment thread pkg/ottl/ottlfuncs/README.md Outdated
Comment thread .chloggen/ottl-truncate-all-utf8.yaml Outdated
Comment thread .chloggen/ottl-truncate-all-utf8.yaml Outdated
Comment thread pkg/ottl/ottlfuncs/func_truncate_all_test.go
@AsishRaju

Copy link
Copy Markdown
Contributor Author

@edmocosta applied your suggestions, made this change categorize as breaking as the default behaviour is changed but your call on that

@AsishRaju

Copy link
Copy Markdown
Contributor Author

@edmocosta i think the CI has passed

@edmocosta edmocosta left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @AsishRaju!

@evan-bradley @TylerHelmuth @bogdandrutu could you please take a look and let me know if you agree on making safe UTF8 truncation the default behavior?

@TylerHelmuth TylerHelmuth merged commit 4d0fb00 into open-telemetry:main Mar 11, 2026
191 checks passed
@otelbot

otelbot Bot commented Mar 11, 2026

Copy link
Copy Markdown
Contributor

Thank you for your contribution @AsishRaju! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey. If you are getting started contributing, you can also join the CNCF Slack channel #opentelemetry-new-contributors to ask for guidance and get help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[pkg/ottl] truncate_all function corrupts UTF-8 encoding

6 participants