Skip to content

Definition for a basic end-user app crash event#3448

Open
bidetofevil wants to merge 9 commits intoopen-telemetry:mainfrom
bidetofevil:basic-crash-event
Open

Definition for a basic end-user app crash event#3448
bidetofevil wants to merge 9 commits intoopen-telemetry:mainfrom
bidetofevil:basic-crash-event

Conversation

@bidetofevil
Copy link
Copy Markdown
Contributor

@bidetofevil bidetofevil commented Feb 17, 2026

Changes

Add a new event in the app namespace for a crash of an end-user facing app. Further specification of what a crash is defined as are provided in the event definition. There are implementations abound (async: Embrace native crash impl, synchronous: OTel Android JVM crash impl), but none strictly follow the convention as stated.

Merge requirement checklist

  • [x ] CONTRIBUTING.md guidelines followed.
  • [x ] Change log entry added, according to the guidelines in When to add a changelog entry.
    • If your PR does not need a change log, start the PR title with [chore]
  • [x ] Links to the prototypes or existing instrumentations (when adding or changing conventions)

Copy link
Copy Markdown
Contributor

@LikeTheSalad LikeTheSalad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bidetofevil, I think it's a great start! I just have one comment regarding app.crash.id, which, to me, is the only attribute that's not too clear what it's for. Apart from that, I think it's good to go.

Comment thread model/app/events.yaml Outdated
by those attributes caused the crash. When used, these attributes SHOULD be
usable for debugging without additional data not specified in the event
(e.g. stacktrace already deobfuscated). If that is not the case, additional
information like `app.build.id` SHOULD be provided to make the data useful for
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
information like `app.build.id` SHOULD be provided to make the data useful for
information like `app.build_id` SHOULD be provided to make the data useful for

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix

Comment thread model/app/events.yaml Outdated
requirement_level:
conditionally_required: Recommended if `app.async` is set, opt_in otherwise.
- ref: app.crash.id
requirement_level: required
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why this attribute is required? Based on its description, it seems like it could be used to reference it in other telemetry data. Is that for context purposes, similar to what the trace id value does?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this should probably be more "opt_in". It's meant to be a canonical ID for that crashed instance used for deduping and cross-referencing, like if you were getting this data from a source outside the app and don't want to report this more than once. But if the instrumentation is just capturing this live and creating an event for it, it's not strictly necessary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I agree it shouldn't be required, so thanks for updating it. I also think it could be very useful for cases like grouping crashes, as mentioned by @thompson-tomo in another comment, but I want to make sure I fully understand it first. The description summarised reads that:

  • It's a unique identifier for a crash.
  • It could come from outside of the instrumentation.
  • It "COULD be meaningful" and referenced elsewhere.
  • It can facilitate deduplication.

That info seems too generic for me to get a grasp of its usage. For example, regarding the "uniqueness" of it, does it mean that it must be globally unique or unique for the same or similar type of crash occurrences? If it's the latter, then it sounds like it could be used to group occurrences of the same crash?

Regarding "It COULD come from a source external to the instrumentation", can you mention an example of an external source?

What do you mean when you say that its value "COULD be meaningful"? For example, meaningful to group similar crashes as mentioned above? Or is it mostly meaningful to avoid exporting the same crash more than once? (in reference to the "facilitate deduplication" point)? Or both?

Comment thread model/app/events.yaml Outdated
Comment on lines +30 to +31
- ref: app.async
requirement_level: opt_in
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ref: app.async
requirement_level: opt_in

I think it would be cleaner & simpler if we removed this attribute. I would then focus app.crash on the sync use case including removing the unnecessary attributes. We could then introduce a process.crash to handle the async use case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The async use case is very common, like with native crashes on Android. I think any app crash event needs to include this use case, as it compries a large swath of them.

And the reason it's in an attribute of its own is because while it's most prominent for crashes, other types of async-report telemetry exists, so we should define a framework for it outside of crashes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem with the use case of async capturing hence the proposal of using app.* events for sync use case which we would eventually associate with an app entity and potentially the process. Then for async we use process.* events which we would eventually associate with a procees entity. This way the async events have access to a reproducible identity etc and we reduce the need for the conditionally attributes.

This gives you a framework of:

  • app.* events to describe the inner workings of a process providing user access to an app by including app specific attributes
  • process.* events to describe something related to a process running within a system
  • system.* events to describe the something related to the system

This continues what we already have for metrics between system & process.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the async crash may not have to do with the process at all, just a different mode of information-getting and telemetry recording. Making this an attribute that cuts across all events make more sense to me than carving out a special case in another namespace for which other async events have to duplicate as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that and i understand that it is a different mode of information-getting and telemetry recording which is why I am thinking about how this telemetry can be associated with entities.

This also reduces the need for so many conditional attributes based on the use case by having dedicated events

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, native is indeed overloaded, if you talk to a react native developer who develops the code in ts or js then for them native is java iteself

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @LikeTheSalad and @atulgpt - just getting back to this.

"Native", in this case, means anything (e.g. C++, Rust, etc.) compiled into a shared library file that is "native" to the device architecture that can be called through JNI (Java-Native Interface) and developed using the Android NDK (Native Development Kit). This is why the word native is used (on Android): it's literally what we call it.

But this is just the most common example: this is a boolean flag because it doesn't talk about the specific contents of the crash (e.g. C++ stacktraces), but rather the manner from which the data for the crash is obtained.

JVM crashes are the exception, where you can catch a crash before it happens (i.e. live), and record it synchronously with the same OTel SDK and app instance that is initialized with the CURRENT state of the app when the crash is taking place. Whereas async is the more common case, where you obtain a tombstone of information AFTER the app instance where the crash has taken place died, so you can't just use the reporting OTel SDK instance to provide the details of the app during the crash because it will be different, as the reporting is happening some time after.

Here's an example:

App version 1.1 produces a JVM crash, which is cause with the UncaughtExceptionHandler and written to an OTel event before the app crashes. In the OTLP, it will report that app.version of the crash is 1.1 because that's what the Resources say.

But say the same app version produced a C++ crash. The process dies, and the tombstone is written asynchronously to disk before the OTel event can be recorded. Before the app starts up next time, it is upgrade to 1.2, so when it starts up again, app.version of the new OTel instance will be 1.2.

Having this flag will allow us to distinguish between these two types of crashes when they are processed in the backend, not only for crashes, but for other types of telemetry as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough explanation, @bidetofevil. After going back through the thread, I think the main concern raised earlier is not about whether async crashes are a valid use case, but about whether this is the right design for them.

Your latest comment, combined with the rest of the thread, helped me get a better idea of the overall proposal and the reason for the questions and suggestions on the async flag. The issue is that this proposal seems to combine two different use cases into a single event. The question for me is whether app.crash should carry a boolean flag plus 5 conditional attributes that become recommended when it is enabled. That starts to feel like two schemas inside one event, which is harder to reason about.

It's also hard to evaluate confidently at this stage because we haven't implemented this flow in OTel Android yet. So far, we've focused on JVM stacktraces, which are still the most common type of crash Android developers tend to deal with. From my perspective, that makes this feel a bit early to standardize, especially as part of a "basic" crash event that doesn't otherwise need these attributes.

The more I look at it, the more it seems like the async case may need its own event. That would keep app.crash (or even something like app.crash.sync) simple and focused, while giving the async use case a dedicated schema that can evolve independently.

Let me know if you have concerns with the approach of having a separate event for each use case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed at the Android SIG meeting, I'll move the definition of app.async to another PR. I've modified the wording of the description to remove references to the attribute, but kept in the description the possibly of the event being logged async and included the opt_in attributes that are relevant in that case.

Let me know if this look OK @LikeTheSalad. We can have a discussion about the async stuff in a different PR :-)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Hanson!

Comment thread model/app/registry.yaml
Further, any related attributes not specified in the signal MUST be taken from the Resource (e.g.
if `service.version` is specified but `service.name` is not, it is assume it will be in the Resource).
For telemetry that omits this attribute, it SHOULD be assumed that it was NOT recorded asynchronously.
- id: app.crash.id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about

Suggested change
- id: app.crash.id
- id: exception.occurance.id

That way it can be better reused for things such as Grouping nested exceptions into a group using a shared attribute.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few reasons I want this in the app namespace:

  • I want to restrict this to end-user facing apps, which the namespaces does. Crash has a special meaning in this space and I want to keep the semantics
  • Exceptions aren't the only causes of crashes -- at least not ones that we can know the details of.
  • There are additional things that will be put on this. When part 2 of this PR comes out with additional details about the crash, it will become more apparent

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 7, 2026

This PR has been labeled as stale due to lack of activity. It will be automatically closed if there is no further activity over the next 7 days.

@github-actions github-actions Bot added the Stale label Mar 7, 2026
@bidetofevil
Copy link
Copy Markdown
Contributor Author

So who can approve this or give me guidance on what is blocking it from approval? @open-telemetry/semconv-client-approvers?

@bidetofevil bidetofevil removed the Stale label Mar 10, 2026
@LikeTheSalad
Copy link
Copy Markdown
Contributor

So who can approve this or give me guidance on what is blocking it from approval? @open-telemetry/semconv-client-approvers?

There are a couple of pending questions, at least from my side:

Can you take a look at those, @bidetofevil ?

@bidetofevil
Copy link
Copy Markdown
Contributor Author

Oh shoot. Sorry I missed these @LikeTheSalad. I'll take a look.

@github-actions
Copy link
Copy Markdown

This PR has been labeled as stale due to lack of activity. It will be automatically closed if there is no further activity over the next 7 days.

@github-actions github-actions Bot added the Stale label Mar 26, 2026
@github-actions github-actions Bot closed this Apr 3, 2026
@bidetofevil bidetofevil reopened this Apr 7, 2026
@github-actions github-actions Bot removed the Stale label Apr 8, 2026
@bidetofevil bidetofevil requested review from a team as code owners April 14, 2026 16:10
@bidetofevil bidetofevil requested review from a team and tigrannajaryan as code owners April 14, 2026 16:10
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Apr 14, 2026

CLA Signed

@bidetofevil
Copy link
Copy Markdown
Contributor Author

I eff the commit history. Will fix this.

@bidetofevil bidetofevil reopened this Apr 14, 2026
@bidetofevil
Copy link
Copy Markdown
Contributor Author

bidetofevil commented Apr 14, 2026

Reopen with proper rebasing and updates per discussion

Copy link
Copy Markdown
Contributor

@LikeTheSalad LikeTheSalad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@LikeTheSalad
Copy link
Copy Markdown
Contributor

@bidetofevil, I think you need to regenerate the markdown files to sort the CI check error.

@bidetofevil
Copy link
Copy Markdown
Contributor Author

Ugh, I thought I checked already. Will try again.

@bidetofevil
Copy link
Copy Markdown
Contributor Author

Removed service.name as an optional attribute for now given that the markdown generated refers to a markdown file that doesn't exist. Having a different service name is really truly an async use case that needs to be called out, though, so removing it for now isn't a bad idea anyway.

@lmolkova lmolkova moved this from Awaiting codeowners approval to Needs More Approval in Semantic Conventions Triage Apr 27, 2026
Comment thread model/app/events.yaml
Comment on lines +45 to +46
- ref: service.version
requirement_level: opt_in
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be removed given we removed service.name via ff0fe1c & we already have app.build_id?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build IDs aren't often used to track different version. There's also no sem-ver so you can only tell if one is different than the other, whereas version gives you a lot richer information. This is almost always preferred when folks are tagging telemetry from mobile apps as a point of code differentiation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but that meaning/understanding is only possible when you know what it corresponds to. In the case here we already removed service.name hence we don't know what it is for.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of service.name was because it was relatively low value and there's an unrelated issue that prevented the MD from being generated property (it contains text with a broken reference link when ported to this event). But the idea that there could be a different service.version still exists even if we don't explicitly say it was from an async recording of the event. In fact, it is still an important attribute - to know a crash came in a version that is different than the one producing the event.

Copy link
Copy Markdown
Member

@fractalwrench fractalwrench left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LikeTheSalad
Copy link
Copy Markdown
Contributor

@lmolkova is there anyone specifically that we should ask for approval?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:app enhancement New feature or request

Projects

Status: Needs More Approval

Development

Successfully merging this pull request may close these issues.

7 participants