Definition for a basic end-user app crash event#3448
Definition for a basic end-user app crash event#3448bidetofevil wants to merge 9 commits intoopen-telemetry:mainfrom
Conversation
LikeTheSalad
left a comment
There was a problem hiding this comment.
Thanks @bidetofevil, I think it's a great start! I just have one comment regarding app.crash.id, which, to me, is the only attribute that's not too clear what it's for. Apart from that, I think it's good to go.
| by those attributes caused the crash. When used, these attributes SHOULD be | ||
| usable for debugging without additional data not specified in the event | ||
| (e.g. stacktrace already deobfuscated). If that is not the case, additional | ||
| information like `app.build.id` SHOULD be provided to make the data useful for |
There was a problem hiding this comment.
| information like `app.build.id` SHOULD be provided to make the data useful for | |
| information like `app.build_id` SHOULD be provided to make the data useful for |
| requirement_level: | ||
| conditionally_required: Recommended if `app.async` is set, opt_in otherwise. | ||
| - ref: app.crash.id | ||
| requirement_level: required |
There was a problem hiding this comment.
I'm curious why this attribute is required? Based on its description, it seems like it could be used to reference it in other telemetry data. Is that for context purposes, similar to what the trace id value does?
There was a problem hiding this comment.
Yeah, this should probably be more "opt_in". It's meant to be a canonical ID for that crashed instance used for deduping and cross-referencing, like if you were getting this data from a source outside the app and don't want to report this more than once. But if the instrumentation is just capturing this live and creating an event for it, it's not strictly necessary.
There was a problem hiding this comment.
Got it. I agree it shouldn't be required, so thanks for updating it. I also think it could be very useful for cases like grouping crashes, as mentioned by @thompson-tomo in another comment, but I want to make sure I fully understand it first. The description summarised reads that:
- It's a unique identifier for a crash.
- It could come from outside of the instrumentation.
- It "COULD be meaningful" and referenced elsewhere.
- It can facilitate deduplication.
That info seems too generic for me to get a grasp of its usage. For example, regarding the "uniqueness" of it, does it mean that it must be globally unique or unique for the same or similar type of crash occurrences? If it's the latter, then it sounds like it could be used to group occurrences of the same crash?
Regarding "It COULD come from a source external to the instrumentation", can you mention an example of an external source?
What do you mean when you say that its value "COULD be meaningful"? For example, meaningful to group similar crashes as mentioned above? Or is it mostly meaningful to avoid exporting the same crash more than once? (in reference to the "facilitate deduplication" point)? Or both?
| - ref: app.async | ||
| requirement_level: opt_in |
There was a problem hiding this comment.
| - ref: app.async | |
| requirement_level: opt_in |
I think it would be cleaner & simpler if we removed this attribute. I would then focus app.crash on the sync use case including removing the unnecessary attributes. We could then introduce a process.crash to handle the async use case.
There was a problem hiding this comment.
The async use case is very common, like with native crashes on Android. I think any app crash event needs to include this use case, as it compries a large swath of them.
And the reason it's in an attribute of its own is because while it's most prominent for crashes, other types of async-report telemetry exists, so we should define a framework for it outside of crashes.
There was a problem hiding this comment.
I have no problem with the use case of async capturing hence the proposal of using app.* events for sync use case which we would eventually associate with an app entity and potentially the process. Then for async we use process.* events which we would eventually associate with a procees entity. This way the async events have access to a reproducible identity etc and we reduce the need for the conditionally attributes.
This gives you a framework of:
- app.* events to describe the inner workings of a process providing user access to an app by including app specific attributes
- process.* events to describe something related to a process running within a system
- system.* events to describe the something related to the system
This continues what we already have for metrics between system & process.
There was a problem hiding this comment.
So the async crash may not have to do with the process at all, just a different mode of information-getting and telemetry recording. Making this an attribute that cuts across all events make more sense to me than carving out a special case in another namespace for which other async events have to duplicate as well.
There was a problem hiding this comment.
I get that and i understand that it is a different mode of information-getting and telemetry recording which is why I am thinking about how this telemetry can be associated with entities.
This also reduces the need for so many conditional attributes based on the use case by having dedicated events
There was a problem hiding this comment.
Hi, native is indeed overloaded, if you talk to a react native developer who develops the code in ts or js then for them native is java iteself
There was a problem hiding this comment.
Sorry @LikeTheSalad and @atulgpt - just getting back to this.
"Native", in this case, means anything (e.g. C++, Rust, etc.) compiled into a shared library file that is "native" to the device architecture that can be called through JNI (Java-Native Interface) and developed using the Android NDK (Native Development Kit). This is why the word native is used (on Android): it's literally what we call it.
But this is just the most common example: this is a boolean flag because it doesn't talk about the specific contents of the crash (e.g. C++ stacktraces), but rather the manner from which the data for the crash is obtained.
JVM crashes are the exception, where you can catch a crash before it happens (i.e. live), and record it synchronously with the same OTel SDK and app instance that is initialized with the CURRENT state of the app when the crash is taking place. Whereas async is the more common case, where you obtain a tombstone of information AFTER the app instance where the crash has taken place died, so you can't just use the reporting OTel SDK instance to provide the details of the app during the crash because it will be different, as the reporting is happening some time after.
Here's an example:
App version 1.1 produces a JVM crash, which is cause with the UncaughtExceptionHandler and written to an OTel event before the app crashes. In the OTLP, it will report that app.version of the crash is 1.1 because that's what the Resources say.
But say the same app version produced a C++ crash. The process dies, and the tombstone is written asynchronously to disk before the OTel event can be recorded. Before the app starts up next time, it is upgrade to 1.2, so when it starts up again, app.version of the new OTel instance will be 1.2.
Having this flag will allow us to distinguish between these two types of crashes when they are processed in the backend, not only for crashes, but for other types of telemetry as well.
There was a problem hiding this comment.
Thanks for the thorough explanation, @bidetofevil. After going back through the thread, I think the main concern raised earlier is not about whether async crashes are a valid use case, but about whether this is the right design for them.
Your latest comment, combined with the rest of the thread, helped me get a better idea of the overall proposal and the reason for the questions and suggestions on the async flag. The issue is that this proposal seems to combine two different use cases into a single event. The question for me is whether app.crash should carry a boolean flag plus 5 conditional attributes that become recommended when it is enabled. That starts to feel like two schemas inside one event, which is harder to reason about.
It's also hard to evaluate confidently at this stage because we haven't implemented this flow in OTel Android yet. So far, we've focused on JVM stacktraces, which are still the most common type of crash Android developers tend to deal with. From my perspective, that makes this feel a bit early to standardize, especially as part of a "basic" crash event that doesn't otherwise need these attributes.
The more I look at it, the more it seems like the async case may need its own event. That would keep app.crash (or even something like app.crash.sync) simple and focused, while giving the async use case a dedicated schema that can evolve independently.
Let me know if you have concerns with the approach of having a separate event for each use case.
There was a problem hiding this comment.
As discussed at the Android SIG meeting, I'll move the definition of app.async to another PR. I've modified the wording of the description to remove references to the attribute, but kept in the description the possibly of the event being logged async and included the opt_in attributes that are relevant in that case.
Let me know if this look OK @LikeTheSalad. We can have a discussion about the async stuff in a different PR :-)
| Further, any related attributes not specified in the signal MUST be taken from the Resource (e.g. | ||
| if `service.version` is specified but `service.name` is not, it is assume it will be in the Resource). | ||
| For telemetry that omits this attribute, it SHOULD be assumed that it was NOT recorded asynchronously. | ||
| - id: app.crash.id |
There was a problem hiding this comment.
What about
| - id: app.crash.id | |
| - id: exception.occurance.id |
That way it can be better reused for things such as Grouping nested exceptions into a group using a shared attribute.
There was a problem hiding this comment.
A few reasons I want this in the app namespace:
- I want to restrict this to end-user facing apps, which the namespaces does. Crash has a special meaning in this space and I want to keep the semantics
- Exceptions aren't the only causes of crashes -- at least not ones that we can know the details of.
- There are additional things that will be put on this. When part 2 of this PR comes out with additional details about the crash, it will become more apparent
|
This PR has been labeled as stale due to lack of activity. It will be automatically closed if there is no further activity over the next 7 days. |
|
So who can approve this or give me guidance on what is blocking it from approval? @open-telemetry/semconv-client-approvers? |
There are a couple of pending questions, at least from my side:
Can you take a look at those, @bidetofevil ? |
|
Oh shoot. Sorry I missed these @LikeTheSalad. I'll take a look. |
|
This PR has been labeled as stale due to lack of activity. It will be automatically closed if there is no further activity over the next 7 days. |
|
I eff the commit history. Will fix this. |
cdb06b4 to
f4fdde1
Compare
|
Reopen with proper rebasing and updates per discussion |
|
@bidetofevil, I think you need to regenerate the markdown files to sort the CI check error. |
|
Ugh, I thought I checked already. Will try again. |
|
Removed |
| - ref: service.version | ||
| requirement_level: opt_in |
There was a problem hiding this comment.
Should this also be removed given we removed service.name via ff0fe1c & we already have app.build_id?
There was a problem hiding this comment.
Build IDs aren't often used to track different version. There's also no sem-ver so you can only tell if one is different than the other, whereas version gives you a lot richer information. This is almost always preferred when folks are tagging telemetry from mobile apps as a point of code differentiation.
There was a problem hiding this comment.
Yes but that meaning/understanding is only possible when you know what it corresponds to. In the case here we already removed service.name hence we don't know what it is for.
There was a problem hiding this comment.
The removal of service.name was because it was relatively low value and there's an unrelated issue that prevented the MD from being generated property (it contains text with a broken reference link when ported to this event). But the idea that there could be a different service.version still exists even if we don't explicitly say it was from an async recording of the event. In fact, it is still an important attribute - to know a crash came in a version that is different than the one producing the event.
|
@lmolkova is there anyone specifically that we should ask for approval? |
Changes
Add a new event in the
appnamespace for a crash of an end-user facing app. Further specification of what a crash is defined as are provided in the event definition. There are implementations abound (async: Embrace native crash impl, synchronous: OTel Android JVM crash impl), but none strictly follow the convention as stated.Merge requirement checklist
[chore]