Skip to content

[pkg/ottl]: Add UTF-8 safe slicing support to Substring function#48590

Merged
edmocosta merged 5 commits into
open-telemetry:mainfrom
Nagato-Yuzuru:feat/ottl-substring-utf-8
Jun 4, 2026
Merged

[pkg/ottl]: Add UTF-8 safe slicing support to Substring function#48590
edmocosta merged 5 commits into
open-telemetry:mainfrom
Nagato-Yuzuru:feat/ottl-substring-utf-8

Conversation

@Nagato-Yuzuru

Copy link
Copy Markdown
Contributor

Description

Adds an optional utf8_safe parameter to Substring. With true, start and length count runes and slicing lands on rune boundaries. Defaults to false. Existing byte behavior doesn't change.

Link to tracking issue

Fixes #48436

Testing

Unit tests for the rune path (CJK, emoji, out-of-range) plus an e2e case using the exact repro from the issue.

Documentation

Updated the Substring entry in pkg/ottl/ottlfuncs/README.md.

Copilot AI review requested due to automatic review settings May 22, 2026 09:31
@linux-foundation-easycla

linux-foundation-easycla Bot commented May 22, 2026

Copy link
Copy Markdown

CLA Signed
The committers listed above are authorized under a signed CLA.

  • ✅ login: Nagato-Yuzuru / name: Nagato Yuzuru (bb1650a)

@github-actions github-actions Bot added the first-time contributor PRs made by new contributors label May 22, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib.

Important reminders:

  • Read our Contributing Guidelines.
  • Sign the CLA if you haven't already.
  • First-time contributors should have at most one PR not marked as draft until their first PR is merged.
  • If your change isn't one of our priority components, reviews may take more time.
  • Give reviewers at least a few days before pinging them for feedback.
  • If you need help or struggle to move your PR forward, raise the topic on #otel-collector-dev or a Collector SIG meeting.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional utf8_safe parameter to the OTTL Substring converter to allow rune-based (UTF-8 boundary safe) slicing while preserving existing byte-slicing behavior by default.

Changes:

  • Extend Substring to accept utf8_safe (optional bool, default false) and implement rune-based slicing when enabled.
  • Add unit tests for UTF-8 safe behavior and an e2e test covering the reported reproduction scenario.
  • Update OTTL function documentation and add a changelog entry.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/ottl/ottlfuncs/README.md Documents the new optional utf8_safe argument and clarifies byte vs rune semantics.
pkg/ottl/ottlfuncs/func_substring.go Implements UTF-8 safe rune slicing path and supporting helpers.
pkg/ottl/ottlfuncs/func_substring_test.go Adds unit coverage for default byte behavior and utf8_safe rune behavior (including error cases).
pkg/ottl/e2e/e2e_test.go Adds e2e assertions for Substring(..., true) behavior and range error handling.
.chloggen/feat_ottl-substring-utf-8.yaml Adds release note entry for the new optional parameter.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/ottl/ottlfuncs/func_substring.go Outdated

@edmocosta edmocosta left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Nagato-Yuzuru, can we apply here a similar truncate_all solution #45055? The reasoning, concerns, and assumptions for that function are also valid for this function, so I think the code should be almost the same.

@Nagato-Yuzuru

Copy link
Copy Markdown
Contributor Author

@edmocosta Thanks for the review. Happy to make changes, but I'd like to confirm the direction first.

In truncate_all, the length is still counted in bytes; utf8Safe only controls whether a multi-byte character gets cut in half. The flag we proposed in the issue #48436 (also utf8Safe) instead switches between counting length in runes vs. bytes, at least that's how I understood it.

If that's the case, reusing truncate_all's RuneStart check doesn't quite fit here: RuneStart only snaps a byte offset back to a boundary, it can't express a rune count.

So I want to make sure what "similar truncate_all solution" refers to:

  1. using a RuneStart-style check to avoid splitting a character (which, based on the earlier discussion, is really a separate concern from what the issue asks for), or
  2. matching the function's shape, e.g. computing the cursors for both modes and returning from a single exit point.

@Nagato-Yuzuru Nagato-Yuzuru force-pushed the feat/ottl-substring-utf-8 branch from 07c6bdb to 022c6ff Compare May 25, 2026 03:25
@edmocosta

Copy link
Copy Markdown
Contributor

So I want to make sure what "similar truncate_all solution" refers to:

Substring should to not cut multi-byte character in half, like the truncate_all's Utf8Safe does, and it should continue to be byte-based instead of rune-based. For Substring, we'd need it to check both upper and lower bounds, so if start is not utf8.RuneStart, advance the offset (byte) until it's, or it reaches the end of the string. If the end isn't utf8.RuneStart, back up like truncate_all does (documenting it might produce shorter strings than requested).

  1. using a RuneStart-style check to avoid splitting a character (which, based on the earlier discussion, is really a separate concern from what the issue asks for).

If using rune-based was the issue's goal, I probably misunderstood it, and maybe @TylerHelmuth did it as well?
I wouldn't introduce a different concept/behavior for this function, especially using the same argument name as truncate_all. In addition to that, other string manipulation functions are byte-base (e.g Len), meaning this one would be breaking the consistency.

WDYT @TylerHelmuth @evan-bradley?

Thanks!

@TylerHelmuth

Copy link
Copy Markdown
Member

Yes we should stick with bytes

@Nagato-Yuzuru

Copy link
Copy Markdown
Contributor Author

Get it. It really is no different from truncate_all. I'll submit the new changes shortly.

@Nagato-Yuzuru Nagato-Yuzuru requested a review from edmocosta May 28, 2026 17:48

@edmocosta edmocosta left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @Nagato-Yuzuru!

Comment thread pkg/ottl/ottlfuncs/README.md Outdated
Comment thread pkg/ottl/ottlfuncs/func_substring.go Outdated
Comment thread pkg/ottl/ottlfuncs/func_substring_test.go
Nagato-Yuzuru and others added 2 commits June 4, 2026 01:11
Co-authored-by: Edmo Vamerlatti Costa <11836452+edmocosta@users.noreply.github.com>
@edmocosta edmocosta merged commit 14b0683 into open-telemetry:main Jun 4, 2026
160 checks passed
@otelbot

otelbot Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Thank you for your contribution @Nagato-Yuzuru! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey. If you are getting started contributing, you can also join the CNCF Slack channel #opentelemetry-new-contributors to ask for guidance and get help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

first-time contributor PRs made by new contributors pkg/ottl

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[pkg/ottl] The Substring function corrupts multibyte UTF-8 strings (byte based slicing)

5 participants