Define restrictions for attribute and label keys #504

arminru · 2020-03-06T15:01:37Z

The spec should define restrictions for the keys used in span/link/event/resource attributes as well as metric labels.
This allows exporters, protocols and back ends to have a reliable definition on which keys they must expect to receive and process. Furthermore, from a user perspective, this assures users that the keys they define can be processed by every compliant exporter and back end.
Without any restriction, it is highly likely that SDKs, exporters and back ends introduce their own arbitrary limits or make assumptions on the data they receive. This could result in broken instrumentations when users define keys that are valid in one setup and then switch to another which uses different limits or has different assumptions.

I would propose to define following restrictions:

a clearly defined set of allowed characters
case sensitivity
a maximum length

For metric names we already have defined the first two of these although I would like have the character set clarified more clearly. A length limit is currently missing.

opentelemetry-specification/specification/api-metrics-user.md

Lines 61 to 66 in ac75cfe

    
           external systems.  Metric names conform to the following syntax: 
        
           1. They are non-empty strings 
        
           2. They are case-insensitive 
        
           3. The first character must be non-numeric, non-space, non-punctuation 
        
           4. Subsequent characters must be belong to the alphanumeric characters, '_', '.', and '-'.

OTel-Java, for example, rejects metric names and resource labels that are longer than 255 chars or contain non-printable characters. I doubt that this is aligned across the other language implementations.

We will also have to decide on how we want to handle violations of those restrictions:

reject the key silently
replace the key with a default to make it easier to discover the error
try to "fix" the name by removing/replacing illegal characters and cropping it to the maximum length

related: open-telemetry/opentelemetry-collector#7095

jmacd · 2020-03-09T17:42:34Z

I support a size limit of 255 bytes.

May we settle on the ANSI X3.4-1986 (i.e., "US-ASCII") character set for encoding?

Can we say the same for Span and Metric names as for attribute/resource/label keys? (😢 I was hoping to put emojis in my metric names.)

As for handling violations, I am in favor of specifying that the SDK MUST NOT reject the key, silently or otherwise. SDKs may take any of the other options but MUST yield a valid name somehow. For the default SDK, I'm in favor of trying to fix the key. For empty keys and null keys, I'm happy with "empty" and "null", but the file and line or class and method name would offer more help in locating the violation.

jmacd · 2020-03-09T17:44:28Z

(ISO 646)

arminru · 2020-03-10T15:13:51Z

A limit of 255 sounds reasonable to me as well.

Restricting it to a subset of ASCII (non-control, non-space, ...) is also most reasonable I think. This allows for maximum compatibility as this should be consumable by all possible back ends. I can't think of use cases that would yield to other sensible names (although emojis could of course be funny).

As for handling violations I'd like to hear some more opinions from others on here as well.

Oberon00 · 2020-03-17T15:43:21Z

A concern in the spec SIG meeting that came up was, for which strings this should apply:

Attribute and label keys?
Span and event names?
Metric label values?
Other attribute values?

SergeyKanzhelev · 2020-03-17T21:30:16Z

I want to make sure we are thinking about globalization. I saw many customers who wants to use native language to name spans or custom attributes. If there is no technical reasons to limit character set I would argue the product will be more inclusive if we will allow unicode characters in names and especially in values. Operators using the collected data may be better off with the names in the native language.

Size limits are very reasonable and needed. Size limits for string values is a good way to address security and malicious attacks concerns.

jmacd · 2020-03-18T15:58:39Z

Choosing a non-7-bit character encoding will support internationalization, but it introduces a lot of complications for SDKs to be "Correct". For example, it becomes incorrect to truncate strings on an arbitrary byte offset. It also requires, sometimes and in some languages, the programmer to have explicit knowledge of the data they pass in--for example in C++, if we require strings to use a UTF8 encoding and the user supplies a string that is not UTF8-encoded, are we required to sanitize it? This seems to either introduce correctness challenges or implementation challenges.

I find myself quietly wishing for Unicode and UTF8 here, but it has a lot to do with working in Go long enough to forget these headaches in C++. Just think--we can put emojis in our span names! (sorry not sorry)

pauldraper · 2020-03-28T17:09:17Z

I don't think you'll find something that doesn't need sanitation.

Some system is not going to support slashes, and some system is not going to support hyphens, and some other system won't like quote characters.

Most telemetry libraries currently do client-side sanitation. I don't think you're going to get away from that without being draconian.

for example in C++, if we require strings to use a UTF8 encoding and the user supplies a string that is not UTF8-encoded, are we required to sanitize it?

Do the same thing you would do if the users passes NaN for the metric value. Blow up or ignore it or whatever is best for the language's paradigm, and document that decision.

arminru · 2020-03-30T13:55:10Z

Some system is not going to support slashes, and some system is not going to support hyphens, and some other system won't like quote characters.

@pauldraper I would like to add more lenient restrictions for what can be considered common and leaving further special restrictions (like not allowing slashes, hyphens, underscores) up to the exporters which should replace strings according to the system they export to. This way, users can rely that the data they provide can be handled by (at least almost) all systems since exporters will be able to transform it to something valid. Having a minimal supported (and thus maximal recommended) length would also help a lot.

arminru · 2020-03-30T14:10:30Z

Regarding character set and encoding:

ASCII would be the easiest and most foolproof solution but I agree with @SergeyKanzhelev that we should take internationalization into account. It would feel odd to build a "cloud-native" solution in 2020 that doesn't allow international characters (or emojis 🤷‍♂).

@jmacd I don't think we need to define an encoding in the API. This is up to the exporter and wire protocol. We should only define the allowed character set (e.g., printable Unicode characters) and limit keys to a certain number of characters (not bytes) at which they are truncated. If a system only understands ASCII, then the exporter for that system should replace other characters accordingly. WDYT?

tsloughter · 2020-04-14T14:56:12Z

Has there been any decision?

I would vote for UTF8 for everything.

tsloughter · 2020-04-14T15:03:03Z

And if using UTF8 then the easiest is to limit on bytes, I think? But require that the truncation happen on grapheme cluster and not code-point. Meaning if a span has 👩‍👩‍👦‍👦 in the name it will either appear as the whole family or be removed entirely, not possibly truncate to 👩👩👦.

May also have to define normalization as NFC so that comparisons are the same.

jmacd · 2020-04-14T23:16:04Z

A useful resource popped up on HN today: http://utf8everywhere.org/

Oberon00 · 2020-04-28T15:57:12Z

People (incl myself 😉) are putting 😄-reactions under @tsloughter's comment, but these are all real problems. Unicode is complex.

require that the truncation happen on grapheme cluster

I think this requires (relatively) large Unicode databases and library support that not everyone has. I guess we can agree that we should at least only have valid UTF-8 after truncation (so no mid-codepoint truncation), because that is easy to implement in all languages (even without lib support if needs be).

If truncation happens at all, that's already an error condition, so I wouldn't put too much effort into trying to make the error as small as possible.

May also have to define normalization as NFC so that comparisons are the same.

Unicode normalization has the same problem of being hard to implement. I think normalization is a nice-to-have, but should not be a MUST-requirement. There should be a recommendation for which normal form to use if you normalize though, and a recommendation for users to already normalize their input. Where this is IMHO most important will be metric labels (keys and especially values), because this influences client-side aggregation.

Oberon00 · 2020-04-28T15:59:25Z

And there might be cases where you don't want to normalize, e.g. when you implement metrics for some Unicode normalization microservice (just pulling this example out of thin air here) and using the un-normalized input words as label values.

jmacd · 2020-04-28T18:01:00Z

I would be comfortable with a specification saying that everything is encoded in utf8, and that when truncating an SDK is permitted to use byte-level truncation, leaving an invalid code point. Consumers of this data should learn to deal with truncated content, mark it as such when presenting to the user.

tsloughter · 2020-04-28T18:10:06Z

I have a couple issues with that. First, truncating doesn't necessarily leave an invalid code point, it could leave a valid codepoint but a grapheme that is different from the one the user used.

But also, this entails not that the consumer library be able to properly handle UTF8 but that it can also handle a mixture of valid and invalid UTF8 and being able to truncate the invalid part.

tsloughter · 2020-04-28T18:32:45Z

Well, http://utf8everywhere.org/ seems to say byte-level truncation is the way to go (says code-unit truncation but in the case of UTF8 that is a byte, so same thing). And I won't argue with that as the author knows much much more about UTF8 than I do :)

tsloughter · 2020-04-28T18:36:45Z

Aside from truncation we'll need recommendation on what to do when exporting to something that only supports ASCII. Like what should the Prometheus exporter do for utf8 metric names that go beyond the ascii characters?

Maybe make the requirement/suggestion that such characters are replaced by _?

Oberon00 · 2020-04-29T15:17:24Z

So that 流量 becomes __? Does not sound very helpful. We could use urlencoding to at least avoid introducing collisions. But I guess there is only so much we can do for ASCII-only systems.

tsloughter · 2020-04-29T15:27:57Z

Ah yea, good point about conflicts, I was thinking specifically about Prometheus which doesn't simply limit to ascii but to [a-zA-Z_:][a-zA-Z0-9_:]* so url encoding wouldn't be an option.

I guess in the case of a system that allows all of ascii we could suggest url encoding and for something like Prometheus we could have our own form of url encoding? As in, instead of % use : which appears to be allowed in Prometheus string.

pauldraper · 2020-05-20T23:56:38Z

It would feel odd to build a "cloud-native" solution in 2020 that doesn't allow international characters (or emojis man_shrugging).

You're going to have guidelines for permitted characters and those will include EMOJIS?

Like, it's not safe to a question mark, but it is safe to use a unicycle?

I'm confused by the encoding thing. E.g. https://github.com/open-telemetry/opentelemetry-python has attributes/labels as character strings. AFAIK they aren't ever encoded into bytes (except by whatever Tracer implementation is being used which a free variable as far as the specification is concerned). Should they be byte strings instead??

tsloughter · 2020-05-21T00:20:39Z

Aren't python strings now always utf8? Is there a difference with "byte strings"?

pauldraper · 2020-05-21T00:40:50Z

Aren't python strings now always utf8?

No, they are character strings. (If you mean the internal in-memory representation....I think CPython actually uses byte arrays of various encodings, but that's never exposed to the user.)

You can convert a character string (character sequence) to a byte string (byte sequence) by choosing an encoding and calling str.encode(encoding). AFAIK the Python OTel API is all character strings.

As, I imagine, many languages are: Ruby, Rust, Go. (Java, JS and C++ don't, because they've never really gotten a proper Unicode character string....just byte or double-byte arrays.)

tsloughter · 2020-07-14T14:16:32Z

Agree with @Oberon00 on case sensitivity.

And to do so we need to define the normalization each library must use? NFC or NFD probably fine? I've seen references to NFKC being preferred for identifies for security reason, but that must refer to like username identifiers? So no need to go that route for metric names, attributes or labels.

yurishkuro · 2021-01-05T03:48:12Z

Just came here from an internal discussion and was surprised this is (a) not resolved, and (b) classified after-GA.

@AndrewAXue you removed the after-ga label, what was the justification for that?

yurishkuro · 2021-01-05T03:49:46Z

For the reference, W3C Baggage spec (https://w3c.github.io/baggage/) only allows US_ASCII for baggage keys. Granted, baggage has more narrow scope than trace attributes.

yurishkuro · 2021-01-05T03:50:31Z

In contrast, OpenMetrics requires labels to support UTF-8: https://github.com/OpenObservability/OpenMetrics/blob/master/specification/OpenMetrics.md#strings

tsloughter · 2021-01-05T17:31:21Z

+1 to wanting this before GA.

We could follow W3C baggage and say keys are ASCII and values are UTF-8. Though I'm still not sure what is supposed to happen when exporting to Prometheus which supports even fewer characters in a key.

yurishkuro · 2021-01-05T20:00:33Z

what is supposed to happen when exporting to Prometheus which supports even fewer characters in a key.

Substitution, specific to the exporter. I would rather do that than bubble up those restrictions to all of OTEL - there are many backends with more relaxed string limits for metric labels.

tsloughter · 2021-01-05T20:07:12Z

Right, I meant what would that substitution be, but, true, as long as the solution of substitution is accepted then it is a question that can be answered separate from, and later, than defining what is allowed in OTEL attribute key/values.

jmacd · 2022-09-09T17:37:49Z

This has to be addressed, at the very least we have to specify what an SDK does with invalid UTF-8 for metric names, attribute keys, attribute values, span names etc.
#504

pauldraper · 2022-09-09T18:35:12Z

what an SDK does with invalid UTF-8

I really continue to be befuddled by what the byte encoding has to do with anything.

Are attributes/tags bytes or text?

jmacd · 2022-09-09T18:44:17Z

See the discussion here: #3421

pauldraper · 2022-09-09T21:58:58Z

Now I'm confused if we're talking about OTLP or API/SDK.

austinlparker · 2024-04-23T20:26:44Z

Closed in favor of #3950 as a more specific instance of the needed clarification. If this is still relevant, please open new issues.

tsloughter mentioned this issue Apr 14, 2020

ot_resource exports attributes/1 without its result types key/0 and value/0 open-telemetry/opentelemetry-erlang#71

Closed

arminru added this to the v0.5 milestone Apr 14, 2020

Oberon00 mentioned this issue May 12, 2020

Support nested Attribute values (#376) #596

Closed

jmacd mentioned this issue May 19, 2020

Metrics API specification: incorporate old content on calling conventions, label sets. #601

Merged

carlosalberto modified the milestones: v0.5, v0.6 Jun 9, 2020

arminru added the area:api Cross language API specification issue label Jun 9, 2020

andrewhsu added the priority:p2 Medium priority level label Jul 17, 2020

Oberon00 mentioned this issue Jul 27, 2020

Limitation on number of attributes, links, events and behavior #182

Closed

Oberon00 mentioned this issue Aug 14, 2020

Add conventions for attribute names #807

Merged

jmacd mentioned this issue Aug 19, 2020

Metrics API: instrument name and label name guidelines and limitations #387

Closed

fbogsany mentioned this issue Aug 26, 2020

Optionally trim attribute keys & values? open-telemetry/opentelemetry-ruby#339

Closed

andrewhsu added release:after-ga Not required before GA release, and not going to work on before GA and removed priority:p2 Medium priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Sep 25, 2020

Oberon00 mentioned this issue Oct 23, 2020

Ability to limit length of values of attributes and metric values #1133

Closed

arminru added the spec:baggage Related to the specification/baggage directory label Mar 2, 2021

arminru mentioned this issue Mar 2, 2021

Support additional validation of keys and values #1490

Open

manolama mentioned this issue Aug 19, 2021

Attribute limits not supported in Metrics #1830

Open

dashpole mentioned this issue Apr 21, 2022

[receiver/hostmetrics]: Enforce valid UTF-8 in process command lines. open-telemetry/opentelemetry-collector-contrib#9324

Closed

jmacd mentioned this issue Sep 9, 2022

Invalid UTF-8 can be generated by span attribute limits open-telemetry/opentelemetry-java#4740

Closed

svrnm mentioned this issue Dec 4, 2023

Span attribute keys have unspecified case sensitivity #3782

Closed

austinlparker closed this as completed Apr 23, 2024

arminru closed this as not planned Won't fix, can't repro, duplicate, stale Apr 25, 2024

Define restrictions for attribute and label keys #504

Define restrictions for attribute and label keys #504

Comments

arminru commented Mar 6, 2020

jmacd commented Mar 9, 2020 • edited Loading

jmacd commented Mar 9, 2020

arminru commented Mar 10, 2020 • edited Loading

Oberon00 commented Mar 17, 2020 • edited Loading

SergeyKanzhelev commented Mar 17, 2020

jmacd commented Mar 18, 2020

pauldraper commented Mar 28, 2020

arminru commented Mar 30, 2020

arminru commented Mar 30, 2020

tsloughter commented Apr 14, 2020

tsloughter commented Apr 14, 2020

jmacd commented Apr 14, 2020

Oberon00 commented Apr 28, 2020

Oberon00 commented Apr 28, 2020

jmacd commented Apr 28, 2020

tsloughter commented Apr 28, 2020

tsloughter commented Apr 28, 2020 • edited Loading

tsloughter commented Apr 28, 2020

Oberon00 commented Apr 29, 2020 • edited Loading

tsloughter commented Apr 29, 2020

pauldraper commented May 20, 2020 • edited Loading

tsloughter commented May 21, 2020

pauldraper commented May 21, 2020 • edited Loading

tsloughter commented Jul 14, 2020

yurishkuro commented Jan 5, 2021

yurishkuro commented Jan 5, 2021

yurishkuro commented Jan 5, 2021

tsloughter commented Jan 5, 2021

yurishkuro commented Jan 5, 2021

tsloughter commented Jan 5, 2021

jmacd commented Sep 9, 2022

pauldraper commented Sep 9, 2022

jmacd commented Sep 9, 2022

pauldraper commented Sep 9, 2022

austinlparker commented Apr 23, 2024

jmacd commented Mar 9, 2020 •

edited

Loading

arminru commented Mar 10, 2020 •

edited

Loading

Oberon00 commented Mar 17, 2020 •

edited

Loading

tsloughter commented Apr 28, 2020 •

edited

Loading

Oberon00 commented Apr 29, 2020 •

edited

Loading

pauldraper commented May 20, 2020 •

edited

Loading

pauldraper commented May 21, 2020 •

edited

Loading