-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blog for AI Agent Observability #6390
base: main
Are you sure you want to change the base?
Conversation
07f18f7
to
213c221
Compare
e9d1487
to
7d42ebf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick pass through, looks fairly good so far
features. | ||
- Risk of version lock-in if the framework’s OpenTelemetry dependencies lag | ||
behind upstream updates. | ||
- Less flexibility for advanced users who prefer custom instrumentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually true? I've not heard of baking in instrumentation making it difficult for users who also wish to augment autoinstrumentation with custom instrumentation, nor those who wish to just turn off autoinstrumentation and do it all themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For baked-in, the customer will probably always need to wait for the whole release of the agent framework if they want to have some customized instrumentation; with otel, it will be relatively easier, only the instrumentation needs update and customer can package that, this is reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cartermp and @TaoChenOSU, all comments are now addressed.
features. | ||
- Risk of version lock-in if the framework’s OpenTelemetry dependencies lag | ||
behind upstream updates. | ||
- Less flexibility for advanced users who prefer custom instrumentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For baked-in, the customer will probably always need to wait for the whole release of the agent framework if they want to have some customized instrumentation; with otel, it will be relatively easier, only the instrumentation needs update and customer can package that, this is reasonable?
behind upstream updates. | ||
- Less flexibility for advanced users who prefer custom instrumentation. | ||
- Some best practices to follow if you consider this approach: | ||
- Provide a configuration setting that lets users easily enable or disable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it just users that need to be able to easily enable/disable the telemetry?
Isn't it other instrumentation packages, also, that need to be able to suppress the instrumentation (in order to prevent duplicative instrumentation)?
Do we want to recommend a uniform approach to this setting?
For example, it looks like one approach used is to have a per-instrumentation/per-library key in the OTel context (e.g. "suppress_instrumentation_${library_name}") and to check for the presence of that in the context to avoid instrumentation. (This way, calling libraries that generate duplicate information can set the key).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaelsafyan this is the baked-in option, if you take a look at crewai at https://docs.crewai.com/telemetry, you will see it has a parameter named as OTEL_SDK_DISABLED
to config.
- Pros | ||
- You can take on the maintenance overhead of keeping the instrumentation for | ||
telemetry up-to-date. | ||
- Simplifies adoption for users unfamiliar with OpenTelemetry configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this is actually true given the zero-configuration mechanisms in OTel for setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Customer still need to install otel instrumentation packages for different llm providers, sometimes, they may need to config otel collector to integrate with some 3rd party observability platforms.
With baked-in option, the agent may have build-in UI for observability which does not request any configuration, hope this helps.
- Adds bloat to the framework for users who do not need observability | ||
features. | ||
- Risk of version lock-in if the framework’s OpenTelemetry dependencies lag | ||
behind upstream updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A related con/risk to call out:
- You may not get feedback/review from OTel contributors familiar with current Semantic Conventions
- Your instrumentation may lag with respect to best practices/conventions (not just the version of the OTel library dependencies).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
[OpenTelemetry registry](/ecosystem/registry/) if you choose this path. | ||
- As a developer of an agent application, you may want to choose an agent | ||
framework with baked-in instrumentation if you prefer… | ||
- Minimal dependencies on external packages in your agent app code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the number of dependencies in the long run truly that different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. Like for otel, we need to install instrumentation code, otel collector, but with baked-in, do not need to install those dependencies manually.
- As a developer of an agent application, you may want to choose an agent | ||
framework with baked-in instrumentation if you prefer… | ||
- Minimal dependencies on external packages in your agent app code | ||
- Out-of-the-box observability without manual setup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it still likely that manual setup may still be required? For example, even if the library includes instrumentation, it is not going to wire-up Open Telemetry to the appropriate backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the manual set up will be mainly for the agent framework itself, like CrewAI or others. After the agent framework installed, there is no need to setup observability manually as it should be build-in with crewai or agent framework itself. Hope this helps.
- Minimal dependencies on external packages in your agent app code | ||
- Out-of-the-box observability without manual setup. | ||
|
||
#### Option 2: Instrumentation via OpenTelemetry contrib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Option 2 actually should be split up into:
"Option 2: External instrumentation"
- "Option 2a: External instrumentation in your own repository/package"
- "Option 2b: External instrumentation in an Open Telemetry-owned repository/package"
(Or maybe you just make "2a" and "2b" into 2 and 3).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
As a developer of an agent framework, here are some pros and cons of this | ||
baked-in instrumentation: | ||
|
||
- Pros |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional benefits:
- More likely to leverage best practices around Semantic Conventions
- More likely to leverage best practices around zero-code instrumentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
- Allows users to mix and match contrib libraries for their specific needs | ||
(e.g., cloud providers, LLM vendors). | ||
- Cons | ||
- Users must manually install and configure contrib libraries, increasing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is untrue. With zero-code setup, the instrumentation libraries can be auto-discovered.
See:
It is actually easier to leverage this with this approach, because opentelemetry-boostrap -a
will auto-install instrumentation libraries from this repo, but it won't auto-install other instrumentation packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
- Users must manually install and configure contrib libraries, increasing | ||
setup complexity. | ||
- Risk of fragmentation if users rely on incompatible or outdated contrib | ||
packages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain? How would they leverage "incompatible" ones? Won't that result in an error when attempting to install the dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but sometimes the install can be succeed, and there maybe some runtime inconsistencies, incompatibilities, and maintenance issues that can arise when different users or frameworks depend on different versions of contributed (contrib) packages
setup complexity. | ||
- Risk of fragmentation if users rely on incompatible or outdated contrib | ||
packages. | ||
- Less control over telemetry quality and coverage compared to baked-in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, though:
- The quality is likely to be higher given the higher bar held in that repo for telemetry quality.
- There is still the ability to contribute to it in order to improve it.
I think the "less control" is more relevant in relation to velocity rather than in relation to quality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, updated to below
Development velocity slows down when there are too many PRs in the OpenTelemetry review queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks all for the comments, really appreciated!
- Minimal dependencies on external packages in your agent app code | ||
- Out-of-the-box observability without manual setup. | ||
|
||
#### Option 2: Instrumentation via OpenTelemetry contrib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
As a developer of an agent framework, here are some pros and cons of this | ||
baked-in instrumentation: | ||
|
||
- Pros |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
- Allows users to mix and match contrib libraries for their specific needs | ||
(e.g., cloud providers, LLM vendors). | ||
- Cons | ||
- Users must manually install and configure contrib libraries, increasing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
- Users must manually install and configure contrib libraries, increasing | ||
setup complexity. | ||
- Risk of fragmentation if users rely on incompatible or outdated contrib | ||
packages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but sometimes the install can be succeed, and there maybe some runtime inconsistencies, incompatibilities, and maintenance issues that can arise when different users or frameworks depend on different versions of contributed (contrib) packages
setup complexity. | ||
- Risk of fragmentation if users rely on incompatible or outdated contrib | ||
packages. | ||
- Less control over telemetry quality and coverage compared to baked-in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, updated to below
Development velocity slows down when there are too many PRs in the OpenTelemetry review queue.
Signed-off-by: Guangya Liu <[email protected]> Co-authored-by: Sujay Solomon <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a first pass at some copy edits to bring this in line with our style guide. I'll try to finish up tomorrow.
Also, there is a typo in the agent-agent-framework.png
file. "Framwork" should be "Framework". Thanks!
cSpell:ignore: genai Guangya PydanticAI Sujay | ||
--- | ||
|
||
## 2025: The Year of AI Agents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## 2025: The Year of AI Agents | |
## 2025: The year of AI agents |
## 2025: The Year of AI Agents | ||
|
||
AI Agents are becoming the next big leap in artificial intelligence in 2025. | ||
From autonomous workflows to intelligent decision-making, AI Agents will power |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From autonomous workflows to intelligent decision-making, AI Agents will power | |
From autonomous workflows to intelligent decision making, AI Agents will power |
AI Agents are becoming the next big leap in artificial intelligence in 2025. | ||
From autonomous workflows to intelligent decision-making, AI Agents will power | ||
numerous applications across industries. However, with this evolution comes the | ||
critical need for AI Agent Observability - especially when scaling these agents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
critical need for AI Agent Observability - especially when scaling these agents | |
critical need for AI agent observability, especially when scaling these agents |
critical need for AI Agent Observability - especially when scaling these agents | ||
to meet enterprise needs. Without proper monitoring, tracing, and logging | ||
mechanisms, diagnosing issues, improving efficiency, and ensuring reliability in | ||
AI Agent-driven applications will be challenging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AI Agent-driven applications will be challenging. | |
AI agent-driven applications will be challenging. |
mechanisms, diagnosing issues, improving efficiency, and ensuring reliability in | ||
AI Agent-driven applications will be challenging. | ||
|
||
### What is an AI Agent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### What is an AI Agent | |
### What is an AI agent? |
It is crucial to distinguish between **AI Agent Application** and **AI Agent | ||
Frameworks**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is crucial to distinguish between **AI Agent Application** and **AI Agent | |
Frameworks**: | |
It is crucial to distinguish between **AI agent applications** and **AI agent | |
frameworks**: |
|
||
 | ||
|
||
- **AI Agent application** refer to individual AI-driven entities that perform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- **AI Agent application** refer to individual AI-driven entities that perform | |
- **AI agent applications** refer to individual AI-driven entities that perform |
- **AI Agent Framework** provide the necessary infrastructure to develop, | ||
manage, and deploy AI Agents often in a more streamlined way than building an | ||
agent from scratch. Examples include |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- **AI Agent Framework** provide the necessary infrastructure to develop, | |
manage, and deploy AI Agents often in a more streamlined way than building an | |
agent from scratch. Examples include | |
- **AI agent frameworks** provide the necessary infrastructure to develop, | |
manage, and deploy AI agents often in a more streamlined way than building an | |
agent from scratch. Examples include the following: |
[LangGraph](https://www.langchain.com/langgraph), | ||
[PydanticAI](https://ai.pydantic.dev/) and more. | ||
|
||
### Establishing a Standardized Semantic Convention |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Establishing a Standardized Semantic Convention | |
### Establishing a standardized semantic convention |
Today, the | ||
[GenAI observability project](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md) | ||
within OpenTelemetry is actively working on defining semantic conventions to | ||
standardize AI Agent observability. This effort is primarily driven by: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
standardize AI Agent observability. This effort is primarily driven by: | |
standardize AI agent observability. This effort is primarily driven by: |
Fixed #6389
@lmolkova @solsu01 ^^