From ad306bd6a40869c9fc60668de1661aaae74214dd Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Laurent=20Qu=C3=A9rel?= Date: Wed, 7 Feb 2024 08:36:22 -0800 Subject: [PATCH] Introducing Application Telemetry Schema in OpenTelemetry - Vision and Roadmap (#243) [Easy to read version with images: [https://github.com/lquerel/oteps/blob/app-telemetry-schema-vision-roadmap/text/0243-app-telemetry-schema-vision-roadmap.md](https://github.com/lquerel/oteps/blob/app-telemetry-schema-vision-roadmap/text/0243-app-telemetry-schema-vision-roadmap.md)] Unlike the traditional data ecosystem (OLTP and OLAP), the world of telemetry generally does not rely on the concept of a schema. Instrumentation is deeply embedded in the code of applications and libraries, making it difficult to discover all the possible telemetry signals an application can emit. This gap prevents or limits the development of CI/CD tools for checking, reporting, documenting, and generating artifacts from telemetry signals specific to an application. This document presents a long-term vision aimed at enabling the OpenTelemetry project to address this issue and extend its impact to a broader ecosystem. It proposes extending the initiatives of Telemetry Schema and Semantic Conventions to include concepts of Application Telemetry Schema and Resolved Telemetry Schema. A series of OTEPs and Tools will be proposed in this overarching document to detail each aspect of this vision. Similar (but proprietary) initiative from Facebook: [Positional Paper: Schema-First Application Telemetry](https://research.facebook.com/publications/positional-paper-schema-first-application-telemetry/) **EDIT 1: The OTel Weaver project (a PoC implementation of this OTEP and some of the others mentioned in the roadmap) is now available [here](https://github.com/f5/otel-weaver).** **EDIT 2: The Slack channel [#otel-weaver](https://cloud-native.slack.com/archives/C0697EXNTL3) is dedicated to this OTEP and the associated OTel Weaver project.** --------- Co-authored-by: Yuri Shkuro Co-authored-by: Reiley Yang Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com> --- ...243-app-telemetry-schema-vision-roadmap.md | 386 ++++++++++++++++++ .../img/0243-otel-weaver-component-schema.svg | 1 + text/img/0243-otel-weaver-concepts.svg | 1 + text/img/0243-otel-weaver-dev-strategies.svg | 1 + text/img/0243-otel-weaver-hierarchy.svg | 1 + text/img/0243-otel-weaver-overview.svg | 1 + text/img/0243-otel-weaver-resolved-schema.svg | 1 + ...tel-weaver-responsibilities-properties.svg | 1 + text/img/0243-otel-weaver-use-cases.svg | 1 + 9 files changed, 394 insertions(+) create mode 100644 text/0243-app-telemetry-schema-vision-roadmap.md create mode 100644 text/img/0243-otel-weaver-component-schema.svg create mode 100644 text/img/0243-otel-weaver-concepts.svg create mode 100644 text/img/0243-otel-weaver-dev-strategies.svg create mode 100644 text/img/0243-otel-weaver-hierarchy.svg create mode 100644 text/img/0243-otel-weaver-overview.svg create mode 100644 text/img/0243-otel-weaver-resolved-schema.svg create mode 100644 text/img/0243-otel-weaver-responsibilities-properties.svg create mode 100644 text/img/0243-otel-weaver-use-cases.svg diff --git a/text/0243-app-telemetry-schema-vision-roadmap.md b/text/0243-app-telemetry-schema-vision-roadmap.md new file mode 100644 index 00000000000..51fc746227f --- /dev/null +++ b/text/0243-app-telemetry-schema-vision-roadmap.md @@ -0,0 +1,386 @@ +# Introducing Application Telemetry Schema in OpenTelemetry - Vision and Roadmap + +---- +**Author**: Laurent Querel, F5 Inc. + +**Keywords**: Schema-First Approach, Telemetry Schema, Semantic Convention, +Discoverability, Interoperability, Type-Safe Client SDKs, Client SDKs Generation, +CI/CD Integration, Data Governance, Data Privacy. + +**Related OTEPs**: [OTEP0152](https://github.com/open-telemetry/oteps/blob/main/text/0152-telemetry-schemas.md), [OTEP0202](https://github.com/open-telemetry/oteps/blob/main/text/0202-events-and-logs-api.md). + +---- +_Unlike the traditional data ecosystem (OLTP and OLAP), the world of telemetry +generally does not rely on the concept of a schema. Instrumentation is deeply +embedded in the code of applications and libraries, making it difficult to +discover all the possible telemetry signals an application can emit. This gap +prevents or limits the development of CI/CD tools for checking, reporting, +documenting, and generating artifacts from telemetry signals specific to an +application. This document presents a long-term vision aimed at enabling the +OpenTelemetry project to address this issue and extend its impact to a broader +ecosystem. It proposes extending the initiatives of Telemetry Schema and +Semantic Conventions to include logical concepts of Component Telemetry Schema +and Resolved Telemetry Schema. A series of OTEPs and Tools will be proposed in +this overarching document to detail each aspect of this vision._ + +## Current State Overview + +Traditionally, the instrumentation of applications is deeply integrated into the +source code of the applications and their components. The current stack of +OpenTelemetry follows this approach and offers a unified mechanism that allows +this ecosystem to report telemetry data jointly via a generic client interface +and a common protocol through an SDK. Moreover, OpenTelemetry's semantic +conventions establish a vendor-agnostic and standardized vocabulary to ensure +more consistent instrumentation over time. This standardization facilitates the +analysis and utilization of these metadata data across the entire telemetry +pipeline. + +But this approach is not without challenges: + +* **Discoverability and Interoperability**: It is difficult to discover a priori + and automatically what an application as a whole specifically generates in terms + of telemetry and associated metadata. This makes it difficult to integrate with + enterprise data catalogs, compliance procedures, or automated privacy + enforcement in _CI/CD pipelines_. +* **User experience**: Although very flexible, generic clients do not offer the + same benefits as a strongly typed dedicated API. A type-safe API is more + ergonomic, more robust, and more easily maintainable. Modern IDEs are capable + of providing smart autocompletion and contextual documentation based on the API. + Compilers can automatically detect errors in case of refactoring or evolution + of the telemetry schema. +* **Extensibility**: Adding metadata to the basic signal specification is + essential for enabling use cases like data security, privacy enforcement, + metadata-driven data transformation, and knowledge graph enrichment. Currently, + there's no standard way to add metadata separate from an application's or + library's development cycle. These metadata definitions should be distinct from + the signal definitions and could be specified by teams other than development + teams. +* **Performance overheads**: A significant downside of generic telemetry + instrumentation is the various overheads it generally introduces due to inherent + layers of abstraction. For example, the collection of attributes is typically + represented as a list of key/value pairs or as hashmaps, resulting in memory + overhead. A simple struct or a set of well-typed function arguments will be more + efficient and less error-prone for representing this list of attributes. In the + same way, it is possible to use a dictionary encoding for values whose domain is + specified in the form of an enumeration in the schema definition. + +Databases and RPC systems (e.g., Protobuf & gRPC) have already addressed some of +these issues with a schema-first approach. There is nothing to prevent adopting +a similar approach in the context of telemetry. **This document discusses how to +apply a schema-first approach in the OpenTelemetry project and its implications +for the existing Telemetry Schema and Semantic Conventions.** + +## Possible/Desired Future State + +The following diagram provides a conceptual overview of the relationships +between the various components, processes, and artifacts of what could be a +typical schema-driven end-to-end telemetry system in the future. + +![Application Telemetry Schema Overview](./img/0243-otel-weaver-overview.svg) + +The Application Telemetry Schema concept is divided into two key logical parts: the +Component Telemetry Schema and the Resolved Telemetry Schema, as shown in the +previous diagram (more details on these two concepts in the proposal section). +These concepts are central to unlocking a variety of use cases. + +Examples of use cases include: + +* Automatic generation of Telemetry Client SDKs from telemetry schemas, + improving user experience and performance. +* CI/CD pipelines using telemetry schemas to: + * Check compatibility between different schema versions. + * Ensure security and privacy compliance. + * Integrate with enterprise data catalog systems. + * And more. +* Telemetry backends capable of: + * Automatically updating database schemas or dashboards. + * Triggering schema-driven transformations or processing in stream processors. + * And more. + +This recent [paper](https://arxiv.org/pdf/2311.07509.pdf#:~:text=The%20results%20of%20the%20benchmark%20provide%20evidence%20that%20supports%20our,LLM%20without%20a%20Knowledge%20Graph) +from [data.world](https://data.world/home/), along with +the [MetricFlow framework](https://docs.getdbt.com/docs/build/about-metricflow) +which underpins the [dbt Semantic Layer](https://www.getdbt.com/product/semantic-layer), +highlights the significance of adopting a schema-first approach in data +modeling, especially for Generative AI-based question answering systems. Tools +like Observability Query Assistants ( +e.g. [Elastic AI Assistant](https://www.elastic.co/fr/blog/introducing-elastic-ai-assistant) +and [Honeycomb Query Assistant](https://www.honeycomb.io/blog/introducing-query-assistant?utm_source=newswire&utm_medium=link&utm_campaign=query_assistant)) +are likely to become increasingly prevalent and efficient in the near future, +thanks to the adoption of a schema-first approach. + +> **Note: The names and formats of these concepts are still under discussion. A +> detailed analysis of pros and cons will be covered later in the document. The +> final decision will be deferred to future dedicated OTEPs.** + +Another problem this proposal aims to address is the inherent complexity of the +ecosystem where OpenTelemetry is utilized but not fully addressed by existing +solutions. OpenTelemetry has been adopted by enterprises of all sizes. While +offering the possibility to inherit standardized semantic conventions is +beneficial, it often proves insufficient due to the need for customizations in +diverse contexts, such as overriding some properties (e.g., changing the +requirement level from recommended to required). Additionally, the presence of +vendor-specific attributes and metrics in the existing official OpenTelemetry +semantic convention registry does not align with the goal of offering a catalog +of attributes, metrics, and signals that are vendor-agnostic. **These issues are +indicative of a lack of standardized mechanisms for extending, customizing, and +developing an ecosystem of schemas and semantic convention registries.** + +In response to these problems, a hierarchy of telemetry schemas can be defined, +ranging from the most general to one that is specifically refined for an +application. Each child schema inherits the properties of the parent schema and +can, if necessary, override these properties locally. Additionally, any +telemetry schema can import one or several semantic convention registries, +allowing for the definition of OpenTelemetry, vendor-specific, and +enterprise-level registries. These two enhancements make OpenTelemetry more +customizable, extensible, and ecosystem-friendly. + +The following section will elaborate on the concepts of the Component Telemetry +Schema, Resolved Telemetry Schema, Semantic Convention Registries, and their +relationship with both the existing OpenTelemetry Schema v1.1 and the +OpenTelemetry Semantic Conventions. + +## Proposal + +### Overview + +Conceptually, this proposal is based on three main concepts: **Component Telemetry +Schema**, **Semantic Convention Registry**, and **Resolved Telemetry Schema**. +The relationships between these entities are described in the following diagram. + +![Telemetry Schema Concepts](./img/0243-otel-weaver-concepts.svg) + +The Component Telemetry Schemas are created by the OpenTelemetry SIG members, +application, or library authors. A Component Telemetry Schema may import any +number of Semantic Convention +Registries as needed. During the schema resolution process, a Resolved Telemetry +Schema is created from a Component Telemetry Schema. This **Resolved Telemetry +Schema is self-contained** and has no external references. Optionally, a Component +Telemetry Schema may extend an existing Telemetry Schema, whether component or +resolved (this aspect is still under discussion). Typically, the official +OpenTelemetry Telemetry Schema is inherited by a Component Telemetry Schema to +include the standard OpenTelemetry Semantic Convention registry. In complex +cases, large enterprises might create their own intermediary telemetry schemas +for custom definitions. + +Each signal definition defined in the Component Telemetry Schema, where possible, +reuses the existing syntax and semantics defined by the semantic conventions. +Each signal definition is also identified by a unique name (or ID), making +schemas addressable, easy to traverse, validate, and diff. + +The key design principles to be followed in the definition of the Resolved +Telemetry Schema are: + +* **Self-contained**: No external references are allowed. This artifact contains + everything required to determine what an application or a library produces in + terms of telemetry. +* **Easy to exchange**: This artifact must be easily accessible to actors + via a URI. This artifact must be small and avoid the repetition of + definitions. +* **Easy to parse**: A widespread and well-defined format should be preferred. + JSON is an example of such a format. +* **Easy to interpret**: The internal structure of this artifact must be + straightforward to avoid any misinterpretation and must be efficient. +* **Platform- and Language-agnostic**: This artifact must be independent of any + platform architectures and programming languages. + +The following diagram describes two main use cases for the Resolved Telemetry +Schema. The key points to remember are: 1) both use cases result in a Resolved +Telemetry Schema, 2) Resolved Telemetry Schemas serve as the mechanism for +distributing Telemetry Schemas throughout the entire ecosystem, and 3) Resolved +Telemetry Schemas would replace/augment existing SchemaURL. + +![Use cases](./img/0243-otel-weaver-use-cases.svg) + +Note: The relationship between Telemetry Schema v1.1 +([OTEP 0152](https://github.com/open-telemetry/oteps/blob/main/text/0152-telemetry-schemas.md)) +and the Component and Resolved Telemetry Schema concepts is still being +discussed. This will be clarified in future OTEPs (refer to the last section). + +The following diagram illustrates a possible instance of a complex hierarchy of +schemas and semantic convention registries. It involves several vendor and +enterprise artifacts, in addition to the standardized OpenTelemetry artifacts. +The schema resolution process will produce a self-contained Resolved Telemetry +Schema that can be easily consumed by various tools and applications, such as a +Client SDK generator, compatibility checker, compliance checker, data catalog +feeder, and more. + +![Example of Telemetry Schema Hierarchy](./img/0243-otel-weaver-hierarchy.svg) + +This hierarchy of telemetry schemas helps large organizations in +collaborating on the Component Telemetry Schema. It enables different +aspects of a Component Telemetry Schema to be managed by various teams. + +For all the elements that make up the Component Telemetry Schema, a +general mechanism of annotation or tagging will be integrated in order to +attach additional traits, characteristics, or constraints, allowing vendors +and companies to extend the definition of concepts defined by OpenTelemetry. +This annotation mechanism will be included as part of the Component Telemetry +Schema definition. + +Annotations and Tags can also be employed to modify schemas for +diverse audiences. For example, the public version of a schema can exclude all +signals or other metadata labeled as private. Similarly, elements can be +designated as exclusively available for beta testers. These annotations can +also identify attributes as PII (Personally Identifiable Information), and +privacy policy enforcement can be implemented at various levels (e.g., in the +generated client SDK or in a proxy). + +For each important component, the following diagram defines the responsibilities +and key design properties that will be considered in future OTEPs defining the +Component and Resolved Telemetry Schemas. + +![Telemetry Schema - Levels of responsibilities & Key design properties](./img/0243-otel-weaver-responsibilities-properties.svg) + +This design enables the definition of semantic conventions in a distributed +manner. OpenTelemetry, vendors, and enterprises can define their own semantic +conventions in different registries simplifying the existing process. + +> Although there is no direct lineage between these systems, a similar approach +> was designed and deployed by Facebook to address the same type of problem but in +> a proprietary context (refer to +> this [positional paper](https://research.facebook.com/publications/positional-paper-schema-first-application-telemetry/) +> for more information). + +### Development Strategies + +Two development strategies coexist and must be supported. The first strategy, a +monorepo type (single source of truth), has complete control over the +applications, their dependencies, and the associated telemetry schemas. The +second strategy is more heterogeneous, composed of multiple repositories, where +the build process of external dependencies is out of the control of the entity +owning the applications or services. + +In the first model, each build process can independently apply telemetry schema +policies in parallel, knowing that the policies are shared and the entire +environment is controlled by the same entity. + +In the second model, the application or service environment does not control the +telemetry schema policies of external dependencies. There is a need for a method +to retrieve telemetry schemas from these dependencies. The mechanism for +distributing these schemas is still being discussed (refer to the Open Questions +section). Ultimately, this will enable the CI/CD pipeline of the application or +service to apply its own local policies to these schemas from the dependencies. + +![Development Strategies to Support](./img/0243-otel-weaver-dev-strategies.svg) + +## Open Questions + +During the review of this document, several questions were raised, and some +remain unresolved. We've decided to postpone the answers to these open questions +to future OTEPs. This approach seems more practical as the resolution of these +questions is not critical at this stage and will likely become clearer as we +define and implement future OTEPs. + +### Should we use a single Telemetry Schema or the combination of Component and Resolved Telemetry Schema? + +The debate between adopting a single Telemetry Schema or separate Component and +Resolved Telemetry Schemas remains unresolved. + +Advocates of a single-schema approach see it as a simplification in terms of +schema definition, implementation, and even cognitive overhead. They suggest +using the same schema but disallowing any external references for schemas +intended for publication and reuse by third parties. The schema resolution +process would then remove these references. + +Proponents of a two-schema approach believe that each format is intended for +different users and use cases (app/lib developers vs telemetry tool developers), +and therefore, having two distinct structures and formats would make it easier +to optimize each for its specific use case (in multiple dimensions). +Furthermore, the group of developers using the Component Telemetry Schema will +most likely be much larger than the group of developers who need to understand +the details of the Resolved Telemetry Schema. + +### What should be the schema(s) be named? + +The naming of the schema(s) was also discussed but without consensus. The current proposals are as follows: + +- Single schema approach: Retain the existing Telemetry Schema name, supporting +different formats depending on the use case (e.g., YAML for app and lib +developers, JSON + JSON schema for publication and tool consumption). +- Two schemas approach: + - Component Telemetry Schema alternative names: Telemetry Schema, Weaver Schema. + - Resolved Telemetry Schema alternative names: Compiled Telemetry Schema, Published Telemetry Schema, Weaver Schema. + +### What distribution mechanism should be used for retrieving dependency schemas? + +Currently, two main methods are being considered: + +1) Establishing a public, centralized repository for resolved schemas, where + library authors can register their resolved telemetry schemas as part of their + build process. +2) Embedding the Resolved Telemetry Schema directly within the library artifact + (such as binary libraries, jar files, crates, Go modules, etc.), ensuring + automatic collection during the application's build process. + +At present, the second option is preferred as it is fully decentralized and +eliminates the need for a global schema registry. A specific OTEP will be +developed to define this distribution mechanism. + +## Roadmap + +### OTEPs + +To facilitate the review process and progressively deliver value to the +OpenTelemetry community, a series of OTEPs and tools are suggested. + +* **Telemetry Schema(s) - Structure and Formats**: This OTEP will weigh the pros + and cons of a single schema versus a dual-schema approach. It aims to identify + the optimal solution and define the structures and formats for the two + concepts introduced in this OTEP: the Component and Resolved Telemetry Schema. + We plan several sub-OTEPs to progressively introduce and phase the concepts + discussed in this document: + * **Phase 1**: Introduce a registry section with a list of fully + resolved attributes, metrics, and other telemetry signals from the existing + semantic convention registry. Upon implementation, the OpenTelemetry project's + telemetry schema will include a registry of all standardized attributes, + metrics, and signals as defined by the semantic conventions. + * **Phase 2**: Add sections for resource, instrumentation library, + and schema sections to represent telemetry signals emitted by an application + (and its dependencies) or a library. Initially, only metrics, logs, and spans + definitions will be supported. + * **Phase 3**: Enable application or library authors to + override definitions inherited from a parent resolved schema. + * **Phase 4**: Implement support for events in the schema + section, pending the approval of events in OpenTelemetry. + * **Phase 5**: Introduce support for multivariate metrics in + the schema section, relevant only if the Client SDK Generator plans to support + multivariate metrics. +* **Dependency Management and Telemetry Schema Distribution**: This OTEP will +outline the method for collecting resolved telemetry schemas from dependencies. + +### Proof of Concept and Tools + +A proof of concept, OTel Weaver, is under development to test the feasibility of +the proposed approach. It will support the following commands: + +* `resolve registry`: Generates a Resolved Telemetry Schema from an +OpenTelemetry semantic convention registry. +* `resolve schema`: Creates a Resolved Telemetry Schema from a Component +Telemetry Schema. +* `search registry`: Offers search functionality within an OpenTelemetry +semantic convention registry. +* `search schema`: Provides search capabilities within a Component or Resolved +Telemetry Schema. +* `gen-client sdk`: Generates a Client SDK from a Component or Resolved +Telemetry Schema. +* `gen-client api`: Produces a Client API from a Component or Resolved +Telemetry Schema. + +A Plugin System is planned to allow community contributions to the OTel Weaver +tool. Proposed plugins include: + +* An example of plugin for gathering resolved telemetry schemas from external +dependencies. +* An example of compatibility checker plugin to ensure successive versions of +the same Telemetry Schema follow specified compatibility rules. + +Additional plugins and tools are anticipated to be developed by the broader +community, leveraging the OTel Weaver Plugin System and the standardized format +of the Resolved Telemetry Schema. + +## Links + +- [Positional Paper: Schema-First Application Telemetry](https://research.facebook.com/publications/positional-paper-schema-first-application-telemetry/) +- [A benchmark to understand the role of knowledge graphs on Large Language Model's accuracy for question answering on enterprise sql databases](https://arxiv.org/pdf/2311.07509.pdf#:~:text=The%20results%20of%20the%20benchmark%20provide%20evidence%20that%20supports%20our,LLM%20without%20a%20Knowledge%20Graph) +- [MetricFlow framework](https://docs.getdbt.com/docs/build/about-metricflow) diff --git a/text/img/0243-otel-weaver-component-schema.svg b/text/img/0243-otel-weaver-component-schema.svg new file mode 100644 index 00000000000..589be2e8c4e --- /dev/null +++ b/text/img/0243-otel-weaver-component-schema.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/text/img/0243-otel-weaver-concepts.svg b/text/img/0243-otel-weaver-concepts.svg new file mode 100644 index 00000000000..40594b2712a --- /dev/null +++ b/text/img/0243-otel-weaver-concepts.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/text/img/0243-otel-weaver-dev-strategies.svg b/text/img/0243-otel-weaver-dev-strategies.svg new file mode 100644 index 00000000000..fd1d893a362 --- /dev/null +++ b/text/img/0243-otel-weaver-dev-strategies.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/text/img/0243-otel-weaver-hierarchy.svg b/text/img/0243-otel-weaver-hierarchy.svg new file mode 100644 index 00000000000..b7ea11580a0 --- /dev/null +++ b/text/img/0243-otel-weaver-hierarchy.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/text/img/0243-otel-weaver-overview.svg b/text/img/0243-otel-weaver-overview.svg new file mode 100644 index 00000000000..33a17a5ec69 --- /dev/null +++ b/text/img/0243-otel-weaver-overview.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/text/img/0243-otel-weaver-resolved-schema.svg b/text/img/0243-otel-weaver-resolved-schema.svg new file mode 100644 index 00000000000..1424b913b8a --- /dev/null +++ b/text/img/0243-otel-weaver-resolved-schema.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/text/img/0243-otel-weaver-responsibilities-properties.svg b/text/img/0243-otel-weaver-responsibilities-properties.svg new file mode 100644 index 00000000000..273c342c3b8 --- /dev/null +++ b/text/img/0243-otel-weaver-responsibilities-properties.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/text/img/0243-otel-weaver-use-cases.svg b/text/img/0243-otel-weaver-use-cases.svg new file mode 100644 index 00000000000..77597df176e --- /dev/null +++ b/text/img/0243-otel-weaver-use-cases.svg @@ -0,0 +1 @@ + \ No newline at end of file