[pkg/ottl/ottlfuncs] Add UTF-8 support to truncate_all function by AsishRaju · Pull Request #45055 · open-telemetry/opentelemetry-collector-contrib

AsishRaju · 2025-12-18T19:53:13Z

Description

The truncate_all function cuts strings at arbitrary byte positions, which could break UTF-8 multi-byte characters and produce invalid utf-8 output.

This PR makes it back up to a valid character boundary when the limit falls mid-character. If the string isn't valid UTF-8 to begin with, it just cuts at the byte level like before.

Based on the approach suggested by @jade-guiton-dd in #36713.

Link to tracking issue

Fixes #36017, #36713

Testing

Added Test_truncateAll_UTF8 with test cases covering rune boundary truncation also the behaviour with grapheme cluster bytes
All existing tests continue to pass

Documentation

Updated README to note that the limit is in bytes and UTF-8 boundaries are respected.

github-actions · 2025-12-18T19:53:26Z

Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib.

Important reminders:

Please review our Contributing Guidelines.
Don't forget to sign the Contributor License Agreement (CLA) if you haven't already.

A maintainer will review your pull request soon. Thank you for helping make OpenTelemetry better!

edmocosta · 2026-01-08T17:00:32Z

Hi @AsishRaju, thank you for working on this. Do we have any benchmark comparing the existing code vs the new one? If this solution adds too much overhead, we might need an extra optional argument to enable it, so we don't introduce a performance regression.

AsishRaju · 2026-01-09T04:24:23Z

Thanks for the review @edmocosta , I did run some benchmark comparing the existing code vs the new one,

goos: darwin
goarch: arm64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/pkg/ottl/ottlfuncs
cpu: Apple M4 Pro

> go test -bench=BenchmarkTruncateAll -benchmem -count=10 ./ottlfuncs/...

TL;DR: Short strings have minimal overhead (~10ns) while long strings have ~300-500ns w/ UTF-8 validation. No additional memory allocations.

With no UTF-8 handling

Test Name	min ns/op	max ns/op	avg ns/op	B/op	allocs/op
ASCII_long_1KB	45.42	46.13	45.75	48	3
ASCII_medium_100B	45.12	45.90	45.57	48	3
ASCII_no_truncation	33.96	35.31	34.42	32	2
ASCII_short_10B	44.60	49.47	45.53	48	3
UTF8_2byte_medium	45.78	46.70	46.24	48	3
UTF8_2byte_short	45.86	47.43	46.54	48	3
UTF8_3byte_long_900B	45.78	46.60	46.16	48	3
UTF8_3byte_medium_90B	46.02	48.82	46.87	48	3
UTF8_3byte_short_9B	45.87	46.90	46.23	48	3
UTF8_4byte_long_1KB	45.84	46.87	46.37	48	3
UTF8_4byte_medium_100B	45.57	46.67	46.18	48	3
UTF8_4byte_short_20B	45.61	46.78	46.23	48	3
UTF8_grapheme_long_450B	45.84	46.71	46.27	48	3
UTF8_grapheme_short	45.74	47.15	46.29	48	3
UTF8_mixed_long_700B	45.97	47.01	46.23	48	3
UTF8_mixed_short	45.62	46.57	46.11	48	3
invalid_UTF8_long	46.15	47.23	46.64	48	3
invalid_UTF8_short	46.04	47.67	46.55	48	3
limit_one	46.09	46.81	46.37	48	3
limit_zero	45.86	47.20	46.36	48	3

With UTF-8 handling

Test Name	min-ns/op	max-ns/op	avg-ns/op	B/op	allocs/op
ASCII_long_1KB	108.70	112.10	110.35	48	3
ASCII_medium_100B	52.04	53.81	52.62	48	3
ASCII_no_truncation	34.66	35.87	35.38	32	2
ASCII_short_10B	49.26	50.85	49.87	48	3
UTF8_2byte_medium	140.10	144.30	141.76	48	3
UTF8_2byte_short	56.76	59.06	57.44	48	3
UTF8_3byte_long_900B	436.30	448.20	442.20	48	3
UTF8_3byte_medium_90B	82.05	83.50	82.69	48	3
UTF8_3byte_short_9B	53.04	54.77	53.92	48	3
UTF8_4byte_long_1KB	370.70	376.20	373.96	48	3
UTF8_4byte_medium_100B	81.60	82.94	82.24	48	3
UTF8_4byte_short_20B	54.65	55.93	55.35	48	3
UTF8_grapheme_long_450B	344.90	353.60	347.67	48	3
UTF8_grapheme_short	56.65	58.34	57.27	48	3
UTF8_mixed_long_700B	489.00	506.10	496.13	48	3
UTF8_mixed_short	58.35	60.57	59.23	48	3
invalid_UTF8_long	50.06	51.11	50.62	48	3
invalid_UTF8_short	50.23	51.29	50.56	48	3
limit_one	56.62	58.02	57.33	48	3
limit_zero	56.36	57.82	57.14	48	3

Old vs New

Test Name	Old avg	New avg	Diff	Diff %	B/op	allocs/op
ASCII_long_1KB	45.75	110.35	+64.60	+141.2%	48	3
ASCII_medium_100B	45.57	52.62	+7.05	+15.5%	48	3
ASCII_no_truncation	34.42	35.38	+0.96	+2.8%	32	2
ASCII_short_10B	45.53	49.87	+4.33	+9.5%	48	3
UTF8_2byte_medium	46.24	141.76	+95.52	+206.6%	48	3
UTF8_2byte_short	46.54	57.44	+10.90	+23.4%	48	3
UTF8_3byte_long_900B	46.16	442.20	+396.04	+858.1%	48	3
UTF8_3byte_medium_90B	46.87	82.69	+35.81	+76.4%	48	3
UTF8_3byte_short_9B	46.23	53.92	+7.68	+16.6%	48	3
UTF8_4byte_long_1KB	46.37	373.96	+327.59	+706.5%	48	3
UTF8_4byte_medium_100B	46.18	82.24	+36.06	+78.1%	48	3
UTF8_4byte_short_20B	46.23	55.35	+9.12	+19.7%	48	3
UTF8_grapheme_long_450B	46.27	347.67	+301.40	+651.5%	48	3
UTF8_grapheme_short	46.29	57.27	+10.98	+23.7%	48	3
UTF8_mixed_long_700B	46.23	496.13	+449.90	+973.2%	48	3
UTF8_mixed_short	46.11	59.23	+13.11	+28.4%	48	3
invalid_UTF8_long	46.64	50.62	+3.98	+8.5%	48	3
invalid_UTF8_short	46.55	50.56	+4.01	+8.6%	48	3
limit_one	46.37	57.33	+10.96	+23.6%	48	3
limit_zero	46.36	57.14	+10.78	+23.3%	48	3

I feel the overhead is acceptable for correctness, valid UTF-8 output will preventing downstream issues with systems that expect valid UTF-8 strings. But happy to know your views on this, if it helps putting under a flag is also fine.

edmocosta · 2026-01-09T18:09:36Z

Thank you @AsishRaju for the benchmarks and great summary!

I feel the overhead is acceptable for correctness, valid UTF-8 output will preventing downstream issues with systems that expect valid UTF-8 strings. But happy to know your views on this, if it helps putting under a flag is also fine.

Correctness is indeed important, but we also need to take into consideration users with high throughput where a considerably small overhead per transaction might have a big impact. We also don't know how big the strings people are manipulating are, so the overhead might be something from ~20% for typical cases (1.2 - 1.3x slower), to 7-11x times slower when the value is a long and complex UTF-8 string.

Based on the benchmarks, I think I'd go with the extra optional argument to enable it, being false the default. WDYT @bogdan-st @TylerHelmuth @evan-bradley?

AsishRaju · 2026-01-28T17:51:49Z

Bumping this up @edmocosta, now added optional flag to make it backward compatible

AsishRaju · 2026-02-20T20:34:51Z

@edmocosta applied your suggestions, made this change categorize as breaking as the default behaviour is changed but your call on that

AsishRaju · 2026-02-25T01:06:29Z

@edmocosta i think the CI has passed

edmocosta

LGTM, thank you @AsishRaju!

@evan-bradley @TylerHelmuth @bogdandrutu could you please take a look and let me know if you agree on making safe UTF8 truncation the default behavior?

otelbot · 2026-03-11T18:05:53Z

Thank you for your contribution @AsishRaju! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey. If you are getting started contributing, you can also join the CNCF Slack channel #opentelemetry-new-contributors to ask for guidance and get help.

AsishRaju requested review from a team, TylerHelmuth, bogdandrutu, edmocosta and evan-bradley as code owners December 18, 2025 19:53

github-actions Bot assigned MovieStoreGuy Dec 18, 2025

github-actions Bot added the first-time contributor PRs made by new contributors label Dec 18, 2025

github-actions Bot added the pkg/ottl label Dec 18, 2025

atoulme added the waiting-for-code-owners label Jan 1, 2026

AsishRaju added 3 commits January 15, 2026 09:41

added fix and tests

fec7103

updated changelog and readme

b54e2ba

update change_type

71f5a6f

AsishRaju force-pushed the asish/fix-truncate-all-utf8 branch from e93e698 to 71f5a6f Compare January 15, 2026 17:41

AsishRaju and others added 2 commits January 28, 2026 09:16

Merge branch 'main' into asish/fix-truncate-all-utf8

9cdc76c

added optional flag

aa4b77c

alexcams reviewed Feb 3, 2026

View reviewed changes

Comment thread pkg/ottl/ottlfuncs/func_truncate_all.go Outdated

Comment thread pkg/ottl/ottlfuncs/func_truncate_all.go

Comment thread pkg/ottl/ottlfuncs/func_truncate_all.go Outdated

remove redundant check

d85e04c

edmocosta reviewed Feb 11, 2026

View reviewed changes

Comment thread pkg/ottl/ottlfuncs/func_truncate_all.go Outdated

edmocosta reviewed Feb 19, 2026

View reviewed changes

apply suggested change

521f959

edmocosta and others added 2 commits February 24, 2026 18:03

Merge branch 'main' into asish/fix-truncate-all-utf8

bb37f2c

Merge branch 'main' into asish/fix-truncate-all-utf8

ae0ba52

edmocosta approved these changes Mar 2, 2026

View reviewed changes

TylerHelmuth approved these changes Mar 11, 2026

View reviewed changes

TylerHelmuth merged commit 4d0fb00 into open-telemetry:main Mar 11, 2026
191 checks passed

edmocosta mentioned this pull request May 22, 2026

[pkg/ottl]: Add UTF-8 safe slicing support to Substring function #48590

Merged

Conversation

AsishRaju commented Dec 18, 2025

Description

Link to tracking issue

Testing

Documentation

Uh oh!

github-actions Bot commented Dec 18, 2025

Uh oh!

edmocosta commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AsishRaju commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

With no UTF-8 handling

With UTF-8 handling

Old vs New

Uh oh!

edmocosta commented Jan 9, 2026

Uh oh!

AsishRaju commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AsishRaju commented Feb 20, 2026

Uh oh!

AsishRaju commented Feb 25, 2026

Uh oh!

edmocosta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

otelbot Bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

edmocosta commented Jan 8, 2026 •

edited

Loading

AsishRaju commented Jan 9, 2026 •

edited

Loading

AsishRaju commented Jan 28, 2026 •

edited

Loading