-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define restrictions for attribute and label keys #504
Comments
I support a size limit of 255 bytes. May we settle on the ANSI X3.4-1986 (i.e., "US-ASCII") character set for encoding? Can we say the same for Span and Metric names as for attribute/resource/label keys? (😢 I was hoping to put emojis in my metric names.) As for handling violations, I am in favor of specifying that the SDK MUST NOT reject the key, silently or otherwise. SDKs may take any of the other options but MUST yield a valid name somehow. For the default SDK, I'm in favor of trying to fix the key. For empty keys and null keys, I'm happy with "empty" and "null", but the file and line or class and method name would offer more help in locating the violation. |
(ISO 646) |
A limit of 255 sounds reasonable to me as well. Restricting it to a subset of ASCII (non-control, non-space, ...) is also most reasonable I think. This allows for maximum compatibility as this should be consumable by all possible back ends. I can't think of use cases that would yield to other sensible names (although emojis could of course be funny). As for handling violations I'd like to hear some more opinions from others on here as well. |
A concern in the spec SIG meeting that came up was, for which strings this should apply:
|
I want to make sure we are thinking about globalization. I saw many customers who wants to use native language to name spans or custom attributes. If there is no technical reasons to limit character set I would argue the product will be more inclusive if we will allow unicode characters in names and especially in values. Operators using the collected data may be better off with the names in the native language. Size limits are very reasonable and needed. Size limits for string values is a good way to address security and malicious attacks concerns. |
Choosing a non-7-bit character encoding will support internationalization, but it introduces a lot of complications for SDKs to be "Correct". For example, it becomes incorrect to truncate strings on an arbitrary byte offset. It also requires, sometimes and in some languages, the programmer to have explicit knowledge of the data they pass in--for example in C++, if we require strings to use a UTF8 encoding and the user supplies a string that is not UTF8-encoded, are we required to sanitize it? This seems to either introduce correctness challenges or implementation challenges. I find myself quietly wishing for Unicode and UTF8 here, but it has a lot to do with working in Go long enough to forget these headaches in C++. Just think--we can put emojis in our span names! (sorry not sorry) |
I don't think you'll find something that doesn't need sanitation. Some system is not going to support slashes, and some system is not going to support hyphens, and some other system won't like quote characters. Most telemetry libraries currently do client-side sanitation. I don't think you're going to get away from that without being draconian.
Do the same thing you would do if the users passes NaN for the metric value. Blow up or ignore it or whatever is best for the language's paradigm, and document that decision. |
@pauldraper I would like to add more lenient restrictions for what can be considered common and leaving further special restrictions (like not allowing slashes, hyphens, underscores) up to the exporters which should replace strings according to the system they export to. This way, users can rely that the data they provide can be handled by (at least almost) all systems since exporters will be able to transform it to something valid. Having a minimal supported (and thus maximal recommended) length would also help a lot. |
Regarding character set and encoding: ASCII would be the easiest and most foolproof solution but I agree with @SergeyKanzhelev that we should take internationalization into account. It would feel odd to build a "cloud-native" solution in 2020 that doesn't allow international characters (or emojis 🤷♂). @jmacd I don't think we need to define an encoding in the API. This is up to the exporter and wire protocol. We should only define the allowed character set (e.g., printable Unicode characters) and limit keys to a certain number of characters (not bytes) at which they are truncated. If a system only understands ASCII, then the exporter for that system should replace other characters accordingly. WDYT? |
Has there been any decision? I would vote for UTF8 for everything. |
And if using UTF8 then the easiest is to limit on bytes, I think? But require that the truncation happen on grapheme cluster and not code-point. Meaning if a span has 👩👩👦👦 in the name it will either appear as the whole family or be removed entirely, not possibly truncate to 👩👩👦. May also have to define normalization as NFC so that comparisons are the same. |
A useful resource popped up on HN today: http://utf8everywhere.org/ |
People (incl myself 😉) are putting 😄-reactions under @tsloughter's comment, but these are all real problems. Unicode is complex.
I think this requires (relatively) large Unicode databases and library support that not everyone has. I guess we can agree that we should at least only have valid UTF-8 after truncation (so no mid-codepoint truncation), because that is easy to implement in all languages (even without lib support if needs be). If truncation happens at all, that's already an error condition, so I wouldn't put too much effort into trying to make the error as small as possible.
Unicode normalization has the same problem of being hard to implement. I think normalization is a nice-to-have, but should not be a MUST-requirement. There should be a recommendation for which normal form to use if you normalize though, and a recommendation for users to already normalize their input. Where this is IMHO most important will be metric labels (keys and especially values), because this influences client-side aggregation. |
And there might be cases where you don't want to normalize, e.g. when you implement metrics for some Unicode normalization microservice (just pulling this example out of thin air here) and using the un-normalized input words as label values. |
I would be comfortable with a specification saying that everything is encoded in utf8, and that when truncating an SDK is permitted to use byte-level truncation, leaving an invalid code point. Consumers of this data should learn to deal with truncated content, mark it as such when presenting to the user. |
I have a couple issues with that. First, truncating doesn't necessarily leave an invalid code point, it could leave a valid codepoint but a grapheme that is different from the one the user used. But also, this entails not that the consumer library be able to properly handle UTF8 but that it can also handle a mixture of valid and invalid UTF8 and being able to truncate the invalid part. |
Well, http://utf8everywhere.org/ seems to say byte-level truncation is the way to go (says code-unit truncation but in the case of UTF8 that is a byte, so same thing). And I won't argue with that as the author knows much much more about UTF8 than I do :) |
Aside from truncation we'll need recommendation on what to do when exporting to something that only supports ASCII. Like what should the Prometheus exporter do for utf8 metric names that go beyond the ascii characters? Maybe make the requirement/suggestion that such characters are replaced by |
So that |
Ah yea, good point about conflicts, I was thinking specifically about Prometheus which doesn't simply limit to ascii but to I guess in the case of a system that allows all of ascii we could suggest url encoding and for something like Prometheus we could have our own form of url encoding? As in, instead of |
You're going to have guidelines for permitted characters and those will include EMOJIS? Like, it's not safe to a question mark, but it is safe to use a unicycle? I'm confused by the encoding thing. E.g. https://github.com/open-telemetry/opentelemetry-python has attributes/labels as character strings. AFAIK they aren't ever encoded into bytes (except by whatever Tracer implementation is being used which a free variable as far as the specification is concerned). Should they be byte strings instead?? |
Aren't python strings now always utf8? Is there a difference with "byte strings"? |
No, they are character strings. (If you mean the internal in-memory representation....I think CPython actually uses byte arrays of various encodings, but that's never exposed to the user.) You can convert a character string (character sequence) to a byte string (byte sequence) by choosing an encoding and calling str.encode(encoding). AFAIK the Python OTel API is all character strings. As, I imagine, many languages are: Ruby, Rust, Go. (Java, JS and C++ don't, because they've never really gotten a proper Unicode character string....just byte or double-byte arrays.) |
Agree with @Oberon00 on case sensitivity. And to do so we need to define the normalization each library must use? NFC or NFD probably fine? I've seen references to NFKC being preferred for identifies for security reason, but that must refer to like username identifiers? So no need to go that route for metric names, attributes or labels. |
Just came here from an internal discussion and was surprised this is (a) not resolved, and (b) classified after-GA. @AndrewAXue you removed the after-ga label, what was the justification for that? |
For the reference, W3C Baggage spec (https://w3c.github.io/baggage/) only allows US_ASCII for baggage keys. Granted, baggage has more narrow scope than trace attributes. |
In contrast, OpenMetrics requires labels to support UTF-8: https://github.com/OpenObservability/OpenMetrics/blob/master/specification/OpenMetrics.md#strings |
+1 to wanting this before GA. We could follow W3C baggage and say keys are ASCII and values are UTF-8. Though I'm still not sure what is supposed to happen when exporting to Prometheus which supports even fewer characters in a key. |
Substitution, specific to the exporter. I would rather do that than bubble up those restrictions to all of OTEL - there are many backends with more relaxed string limits for metric labels. |
Right, I meant what would that substitution be, but, true, as long as the solution of substitution is accepted then it is a question that can be answered separate from, and later, than defining what is allowed in OTEL attribute key/values. |
This has to be addressed, at the very least we have to specify what an SDK does with invalid UTF-8 for metric names, attribute keys, attribute values, span names etc. |
I really continue to be befuddled by what the byte encoding has to do with anything. Are attributes/tags bytes or text? |
See the discussion here: #3421 |
Now I'm confused if we're talking about OTLP or API/SDK. |
Closed in favor of #3950 as a more specific instance of the needed clarification. If this is still relevant, please open new issues. |
The spec should define restrictions for the keys used in span/link/event/resource attributes as well as metric labels.
This allows exporters, protocols and back ends to have a reliable definition on which keys they must expect to receive and process. Furthermore, from a user perspective, this assures users that the keys they define can be processed by every compliant exporter and back end.
Without any restriction, it is highly likely that SDKs, exporters and back ends introduce their own arbitrary limits or make assumptions on the data they receive. This could result in broken instrumentations when users define keys that are valid in one setup and then switch to another which uses different limits or has different assumptions.
I would propose to define following restrictions:
For metric names we already have defined the first two of these although I would like have the character set clarified more clearly. A length limit is currently missing.
opentelemetry-specification/specification/api-metrics-user.md
Lines 61 to 66 in ac75cfe
OTel-Java, for example, rejects metric names and resource labels that are longer than 255 chars or contain non-printable characters. I doubt that this is aligned across the other language implementations.
We will also have to decide on how we want to handle violations of those restrictions:
related: open-telemetry/opentelemetry-collector#7095
The text was updated successfully, but these errors were encountered: