Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define restrictions for attribute and label keys #504

Closed
arminru opened this issue Mar 6, 2020 · 39 comments
Closed

Define restrictions for attribute and label keys #504

arminru opened this issue Mar 6, 2020 · 39 comments
Labels
area:api Cross language API specification issue release:after-ga Not required before GA release, and not going to work on before GA spec:baggage Related to the specification/baggage directory spec:metrics Related to the specification/metrics directory spec:resource Related to the specification/resource directory spec:trace Related to the specification/trace directory
Milestone

Comments

@arminru
Copy link
Member

arminru commented Mar 6, 2020

The spec should define restrictions for the keys used in span/link/event/resource attributes as well as metric labels.
This allows exporters, protocols and back ends to have a reliable definition on which keys they must expect to receive and process. Furthermore, from a user perspective, this assures users that the keys they define can be processed by every compliant exporter and back end.
Without any restriction, it is highly likely that SDKs, exporters and back ends introduce their own arbitrary limits or make assumptions on the data they receive. This could result in broken instrumentations when users define keys that are valid in one setup and then switch to another which uses different limits or has different assumptions.

I would propose to define following restrictions:

  • a clearly defined set of allowed characters
  • case sensitivity
  • a maximum length

For metric names we already have defined the first two of these although I would like have the character set clarified more clearly. A length limit is currently missing.

external systems. Metric names conform to the following syntax:
1. They are non-empty strings
2. They are case-insensitive
3. The first character must be non-numeric, non-space, non-punctuation
4. Subsequent characters must be belong to the alphanumeric characters, '_', '.', and '-'.

OTel-Java, for example, rejects metric names and resource labels that are longer than 255 chars or contain non-printable characters. I doubt that this is aligned across the other language implementations.

We will also have to decide on how we want to handle violations of those restrictions:

  • reject the key silently
  • replace the key with a default to make it easier to discover the error
  • try to "fix" the name by removing/replacing illegal characters and cropping it to the maximum length

related: open-telemetry/opentelemetry-collector#7095

@jmacd
Copy link
Contributor

jmacd commented Mar 9, 2020

I support a size limit of 255 bytes.

May we settle on the ANSI X3.4-1986 (i.e., "US-ASCII") character set for encoding?

Can we say the same for Span and Metric names as for attribute/resource/label keys? (😢 I was hoping to put emojis in my metric names.)

As for handling violations, I am in favor of specifying that the SDK MUST NOT reject the key, silently or otherwise. SDKs may take any of the other options but MUST yield a valid name somehow. For the default SDK, I'm in favor of trying to fix the key. For empty keys and null keys, I'm happy with "empty" and "null", but the file and line or class and method name would offer more help in locating the violation.

@jmacd
Copy link
Contributor

jmacd commented Mar 9, 2020

(ISO 646)

@arminru
Copy link
Member Author

arminru commented Mar 10, 2020

A limit of 255 sounds reasonable to me as well.

Restricting it to a subset of ASCII (non-control, non-space, ...) is also most reasonable I think. This allows for maximum compatibility as this should be consumable by all possible back ends. I can't think of use cases that would yield to other sensible names (although emojis could of course be funny).

As for handling violations I'd like to hear some more opinions from others on here as well.

@Oberon00
Copy link
Member

Oberon00 commented Mar 17, 2020

A concern in the spec SIG meeting that came up was, for which strings this should apply:

  • Attribute and label keys?
  • Span and event names?
  • Metric label values?
  • Other attribute values?

@SergeyKanzhelev
Copy link
Member

I want to make sure we are thinking about globalization. I saw many customers who wants to use native language to name spans or custom attributes. If there is no technical reasons to limit character set I would argue the product will be more inclusive if we will allow unicode characters in names and especially in values. Operators using the collected data may be better off with the names in the native language.

Size limits are very reasonable and needed. Size limits for string values is a good way to address security and malicious attacks concerns.

@jmacd
Copy link
Contributor

jmacd commented Mar 18, 2020

Choosing a non-7-bit character encoding will support internationalization, but it introduces a lot of complications for SDKs to be "Correct". For example, it becomes incorrect to truncate strings on an arbitrary byte offset. It also requires, sometimes and in some languages, the programmer to have explicit knowledge of the data they pass in--for example in C++, if we require strings to use a UTF8 encoding and the user supplies a string that is not UTF8-encoded, are we required to sanitize it? This seems to either introduce correctness challenges or implementation challenges.

I find myself quietly wishing for Unicode and UTF8 here, but it has a lot to do with working in Go long enough to forget these headaches in C++. Just think--we can put emojis in our span names! (sorry not sorry)

@pauldraper
Copy link

I don't think you'll find something that doesn't need sanitation.

Some system is not going to support slashes, and some system is not going to support hyphens, and some other system won't like quote characters.

Most telemetry libraries currently do client-side sanitation. I don't think you're going to get away from that without being draconian.


for example in C++, if we require strings to use a UTF8 encoding and the user supplies a string that is not UTF8-encoded, are we required to sanitize it?

Do the same thing you would do if the users passes NaN for the metric value. Blow up or ignore it or whatever is best for the language's paradigm, and document that decision.

@arminru
Copy link
Member Author

arminru commented Mar 30, 2020

Some system is not going to support slashes, and some system is not going to support hyphens, and some other system won't like quote characters.

@pauldraper I would like to add more lenient restrictions for what can be considered common and leaving further special restrictions (like not allowing slashes, hyphens, underscores) up to the exporters which should replace strings according to the system they export to. This way, users can rely that the data they provide can be handled by (at least almost) all systems since exporters will be able to transform it to something valid. Having a minimal supported (and thus maximal recommended) length would also help a lot.

@arminru
Copy link
Member Author

arminru commented Mar 30, 2020

Regarding character set and encoding:

ASCII would be the easiest and most foolproof solution but I agree with @SergeyKanzhelev that we should take internationalization into account. It would feel odd to build a "cloud-native" solution in 2020 that doesn't allow international characters (or emojis 🤷‍♂).

@jmacd I don't think we need to define an encoding in the API. This is up to the exporter and wire protocol. We should only define the allowed character set (e.g., printable Unicode characters) and limit keys to a certain number of characters (not bytes) at which they are truncated. If a system only understands ASCII, then the exporter for that system should replace other characters accordingly. WDYT?

@tsloughter
Copy link
Member

Has there been any decision?

I would vote for UTF8 for everything.

@tsloughter
Copy link
Member

And if using UTF8 then the easiest is to limit on bytes, I think? But require that the truncation happen on grapheme cluster and not code-point. Meaning if a span has 👩‍👩‍👦‍👦 in the name it will either appear as the whole family or be removed entirely, not possibly truncate to 👩👩👦.

May also have to define normalization as NFC so that comparisons are the same.

@jmacd
Copy link
Contributor

jmacd commented Apr 14, 2020

A useful resource popped up on HN today: http://utf8everywhere.org/

@Oberon00
Copy link
Member

People (incl myself 😉) are putting 😄-reactions under @tsloughter's comment, but these are all real problems. Unicode is complex.

require that the truncation happen on grapheme cluster

I think this requires (relatively) large Unicode databases and library support that not everyone has. I guess we can agree that we should at least only have valid UTF-8 after truncation (so no mid-codepoint truncation), because that is easy to implement in all languages (even without lib support if needs be).

If truncation happens at all, that's already an error condition, so I wouldn't put too much effort into trying to make the error as small as possible.

May also have to define normalization as NFC so that comparisons are the same.

Unicode normalization has the same problem of being hard to implement. I think normalization is a nice-to-have, but should not be a MUST-requirement. There should be a recommendation for which normal form to use if you normalize though, and a recommendation for users to already normalize their input. Where this is IMHO most important will be metric labels (keys and especially values), because this influences client-side aggregation.

@Oberon00
Copy link
Member

And there might be cases where you don't want to normalize, e.g. when you implement metrics for some Unicode normalization microservice (just pulling this example out of thin air here) and using the un-normalized input words as label values.

@jmacd
Copy link
Contributor

jmacd commented Apr 28, 2020

I would be comfortable with a specification saying that everything is encoded in utf8, and that when truncating an SDK is permitted to use byte-level truncation, leaving an invalid code point. Consumers of this data should learn to deal with truncated content, mark it as such when presenting to the user.

@tsloughter
Copy link
Member

I have a couple issues with that. First, truncating doesn't necessarily leave an invalid code point, it could leave a valid codepoint but a grapheme that is different from the one the user used.

But also, this entails not that the consumer library be able to properly handle UTF8 but that it can also handle a mixture of valid and invalid UTF8 and being able to truncate the invalid part.

@tsloughter
Copy link
Member

tsloughter commented Apr 28, 2020

Well, http://utf8everywhere.org/ seems to say byte-level truncation is the way to go (says code-unit truncation but in the case of UTF8 that is a byte, so same thing). And I won't argue with that as the author knows much much more about UTF8 than I do :)

@tsloughter
Copy link
Member

Aside from truncation we'll need recommendation on what to do when exporting to something that only supports ASCII. Like what should the Prometheus exporter do for utf8 metric names that go beyond the ascii characters?

Maybe make the requirement/suggestion that such characters are replaced by _?

@Oberon00
Copy link
Member

Oberon00 commented Apr 29, 2020

So that 流量 becomes __? Does not sound very helpful. We could use urlencoding to at least avoid introducing collisions. But I guess there is only so much we can do for ASCII-only systems.

@tsloughter
Copy link
Member

Ah yea, good point about conflicts, I was thinking specifically about Prometheus which doesn't simply limit to ascii but to [a-zA-Z_:][a-zA-Z0-9_:]* so url encoding wouldn't be an option.

I guess in the case of a system that allows all of ascii we could suggest url encoding and for something like Prometheus we could have our own form of url encoding? As in, instead of % use : which appears to be allowed in Prometheus string.

@pauldraper
Copy link

pauldraper commented May 20, 2020

It would feel odd to build a "cloud-native" solution in 2020 that doesn't allow international characters (or emojis man_shrugging).

You're going to have guidelines for permitted characters and those will include EMOJIS?

Like, it's not safe to a question mark, but it is safe to use a unicycle?


I'm confused by the encoding thing. E.g. https://github.com/open-telemetry/opentelemetry-python has attributes/labels as character strings. AFAIK they aren't ever encoded into bytes (except by whatever Tracer implementation is being used which a free variable as far as the specification is concerned). Should they be byte strings instead??

@tsloughter
Copy link
Member

Aren't python strings now always utf8? Is there a difference with "byte strings"?

@pauldraper
Copy link

pauldraper commented May 21, 2020

Aren't python strings now always utf8?

No, they are character strings. (If you mean the internal in-memory representation....I think CPython actually uses byte arrays of various encodings, but that's never exposed to the user.)

You can convert a character string (character sequence) to a byte string (byte sequence) by choosing an encoding and calling str.encode(encoding). AFAIK the Python OTel API is all character strings.

As, I imagine, many languages are: Ruby, Rust, Go. (Java, JS and C++ don't, because they've never really gotten a proper Unicode character string....just byte or double-byte arrays.)

@carlosalberto carlosalberto modified the milestones: v0.5, v0.6 Jun 9, 2020
@arminru arminru added the area:api Cross language API specification issue label Jun 9, 2020
@tsloughter
Copy link
Member

Agree with @Oberon00 on case sensitivity.

And to do so we need to define the normalization each library must use? NFC or NFD probably fine? I've seen references to NFKC being preferred for identifies for security reason, but that must refer to like username identifiers? So no need to go that route for metric names, attributes or labels.

@yurishkuro
Copy link
Member

Just came here from an internal discussion and was surprised this is (a) not resolved, and (b) classified after-GA.

@AndrewAXue you removed the after-ga label, what was the justification for that?

@yurishkuro
Copy link
Member

For the reference, W3C Baggage spec (https://w3c.github.io/baggage/) only allows US_ASCII for baggage keys. Granted, baggage has more narrow scope than trace attributes.

@yurishkuro
Copy link
Member

@tsloughter
Copy link
Member

+1 to wanting this before GA.

We could follow W3C baggage and say keys are ASCII and values are UTF-8. Though I'm still not sure what is supposed to happen when exporting to Prometheus which supports even fewer characters in a key.

@yurishkuro
Copy link
Member

what is supposed to happen when exporting to Prometheus which supports even fewer characters in a key.

Substitution, specific to the exporter. I would rather do that than bubble up those restrictions to all of OTEL - there are many backends with more relaxed string limits for metric labels.

@tsloughter
Copy link
Member

Right, I meant what would that substitution be, but, true, as long as the solution of substitution is accepted then it is a question that can be answered separate from, and later, than defining what is allowed in OTEL attribute key/values.

@jmacd
Copy link
Contributor

jmacd commented Sep 9, 2022

This has to be addressed, at the very least we have to specify what an SDK does with invalid UTF-8 for metric names, attribute keys, attribute values, span names etc.
#504

@pauldraper
Copy link

what an SDK does with invalid UTF-8

I really continue to be befuddled by what the byte encoding has to do with anything.

Are attributes/tags bytes or text?

@jmacd
Copy link
Contributor

jmacd commented Sep 9, 2022

See the discussion here: #3421

@pauldraper
Copy link

Now I'm confused if we're talking about OTLP or API/SDK.

@austinlparker
Copy link
Member

Closed in favor of #3950 as a more specific instance of the needed clarification. If this is still relevant, please open new issues.

@arminru arminru closed this as not planned Won't fix, can't repro, duplicate, stale Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:api Cross language API specification issue release:after-ga Not required before GA release, and not going to work on before GA spec:baggage Related to the specification/baggage directory spec:metrics Related to the specification/metrics directory spec:resource Related to the specification/resource directory spec:trace Related to the specification/trace directory
Projects
None yet
Development

No branches or pull requests

10 participants