[pkg/ottl]: Add UTF-8 safe slicing support to Substring function#48590
Conversation
|
|
|
Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib. Important reminders:
|
There was a problem hiding this comment.
Pull request overview
Adds an optional utf8_safe parameter to the OTTL Substring converter to allow rune-based (UTF-8 boundary safe) slicing while preserving existing byte-slicing behavior by default.
Changes:
- Extend
Substringto acceptutf8_safe(optional bool, defaultfalse) and implement rune-based slicing when enabled. - Add unit tests for UTF-8 safe behavior and an e2e test covering the reported reproduction scenario.
- Update OTTL function documentation and add a changelog entry.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pkg/ottl/ottlfuncs/README.md | Documents the new optional utf8_safe argument and clarifies byte vs rune semantics. |
| pkg/ottl/ottlfuncs/func_substring.go | Implements UTF-8 safe rune slicing path and supporting helpers. |
| pkg/ottl/ottlfuncs/func_substring_test.go | Adds unit coverage for default byte behavior and utf8_safe rune behavior (including error cases). |
| pkg/ottl/e2e/e2e_test.go | Adds e2e assertions for Substring(..., true) behavior and range error handling. |
| .chloggen/feat_ottl-substring-utf-8.yaml | Adds release note entry for the new optional parameter. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
edmocosta
left a comment
There was a problem hiding this comment.
Thanks @Nagato-Yuzuru, can we apply here a similar truncate_all solution #45055? The reasoning, concerns, and assumptions for that function are also valid for this function, so I think the code should be almost the same.
|
@edmocosta Thanks for the review. Happy to make changes, but I'd like to confirm the direction first. In truncate_all, the length is still counted in bytes; utf8Safe only controls whether a multi-byte character gets cut in half. The flag we proposed in the issue #48436 (also utf8Safe) instead switches between counting length in runes vs. bytes, at least that's how I understood it. If that's the case, reusing So I want to make sure what "similar truncate_all solution" refers to:
|
07c6bdb to
022c6ff
Compare
If using rune-based was the issue's goal, I probably misunderstood it, and maybe @TylerHelmuth did it as well? WDYT @TylerHelmuth @evan-bradley? Thanks! |
|
Yes we should stick with bytes |
|
Get it. It really is no different from |
edmocosta
left a comment
There was a problem hiding this comment.
Thanks for working on this @Nagato-Yuzuru!
Co-authored-by: Edmo Vamerlatti Costa <11836452+edmocosta@users.noreply.github.com>
|
Thank you for your contribution @Nagato-Yuzuru! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey. If you are getting started contributing, you can also join the CNCF Slack channel #opentelemetry-new-contributors to ask for guidance and get help. |
Description
Adds an optional
utf8_safeparameter toSubstring. Withtrue,startandlengthcount runes and slicing lands on rune boundaries. Defaults tofalse. Existing byte behavior doesn't change.Link to tracking issue
Fixes #48436
Testing
Unit tests for the rune path (CJK, emoji, out-of-range) plus an e2e case using the exact repro from the issue.
Documentation
Updated the
Substringentry inpkg/ottl/ottlfuncs/README.md.