[inst-xhr/fetch] Network events incorrectly dropped #5314

chancancode · 2025-01-10T07:09:08Z

What happened?

Steps to Reproduce

Setup the fetch instrumentation
Setup a service worker (such as with conveniences like msw)
Make a fetch() request that is fulfilled by the service worker
Observe that (in Chrome, at least) this request resulted in a ResourceTiming entry similar to this:

{
  "name": "http://localhost:9876/api/status.json",
  "entryType": "resource",
  "startTime": 2171.4000000953674,
  "duration": 2.0999999046325684,
  "initiatorType": "fetch",
  "deliveryType": "cache",
  "nextHopProtocol": "",
  "renderBlockingStatus": "non-blocking",
  "workerStart": 2171.600000143051,
  "redirectStart": 0,
  "redirectEnd": 0,
  "fetchStart": 2171.7000000476837,
  "domainLookupStart": 2171.7000000476837,
  "domainLookupEnd": 2171.7000000476837,
  "connectStart": 2171.7000000476837,
  "secureConnectionStart": 0,
  "connectEnd": 2171.7000000476837,
  "requestStart": 2171.600000143051,
  "responseStart": 2173.4000000953674,
  "firstInterimResponseStart": 2173.4000000953674,
  "responseEnd": 2173.5,
  "transferSize": 0,
  "encodedBodySize": 11,
  "decodedBodySize": 11,
  "responseStatus": 200,
  "serverTiming": []
}

Expected Result

The instrumentation produces a span with the usual network events. At present, that would be these 8 events:

fetchStart
domainLookupStart
domainLookupEnd
connectStart
connectEnd
requestStart
responseStart
responseEnd

(I would advocate that workerStart, if present, should be included as well, but that's a different issue)

Actual Result

The instrumentation produces a span with these 7 events:

fetchStart
domainLookupStart
domainLookupEnd
connectStart
connectEnd
responseStart
responseEnd

Note that requestStart is missing.

Additional Details

This is a direct result of #4486, which is intended to be a fix for #4478:

When browser navigation or resource timing apis don't set a certain timing value or report it as 0 (e.g connectionStart, responseStart,...) the spans generated by instrumentations like like xhr/fetch/resource report a value that is lower than fetchStart ( timeOrigin ) , which can result in wrong metrics by backend.

The wording implies/assumes that:

When certain timing values are unavailable (0), the instrumentation code has a bug that causes nonsensical events to be created
fetchStart can be safely assumed to be the "time origin" of the resource timings

(1) is definitely a bug worth fixing, but (2) didn't turn out to be correct. As shown in the example above, when a worker is involved, requestStart === workerStart < fetchStart (in my version of Chrome at least – but IMO any instances of real world occurrences is sufficient to demonstrate the problem with the current code). Thus, the requestStart network event ends up getting dropped even though it is perfectly valid.

I found this issue with service worker, but I think the assumption is fundamentally flawed here, and there may be other cases where this results in the same bug.

If we want to check against a "time origin" value, startTime would be the correct one to check.

However, we could probably just checking for timingValue === 0 for the purpose of skipping events. 0 is defined in the resource timing spec as a special default value when the information is not available, so it's not like that's a random coincidence.

@dyladan pointed out 0 could be a valid timing for some of these values, so perhaps that's why we didn't go with that approach. However, I'm not sure if that's the case for our special use case here: these values are high-res performance timestamps relative to performance.timeOrigin, which is typically the navigation event for the initial page load. It doesn't seem possible for a fetch/xhr event to happen at exactly the same instant that the page began to load, though admittedly I haven't confirmed that with the spec.

OpenTelemetry Setup Code

No response

package.json

No response

Relevant log output

No response

Operating System and Version

No response

Runtime and Version

No response

The text was updated successfully, but these errors were encountered:

@johnbley

Background: 1. For historical reasons, the perf/resource timing spec uses 0 as a special value to denote when some timing information is either not applicable (e.g. no redirect occured) or not available (when producing an opaque resource timing object for CORS resources). 2. However, in some limited cases, 0 can also be a legitimate value for these timing events. Note that this is rare in real life – these fields are high-res performance timestamp relative to the performance time-origin, which is typically the navigation event for the initial page load. For a _resource_ timing to have a 0 timestamp, it would have to be initiated simultaneously with page load, it is unclear if this can actually happen IRL. Reportedly (open-telemetry#1769), at one point this was observed in some browsers during automated tests where things happen very fast and the browser artifically supress the timing resolution. It was unclear if the report was about the navigation timing entry or resource timing entries. It is also unclear if these utilities are intended for anything other than the internal fetch/XHR instrumentation, but they are public API, so if someone uses these functions on the initial page navigation event, then it is possible for the function to see legitimate 0-value inputs. 3. When creating span events, we do not use the timeOrigin-relative perf timestamps as-is. Rather, we convert them back to the UNIX epoch timestamps. When this conversion is applied to those 0 timestamps, it creates nonsensical/misleading events that are quite difficult to distinguish for downstream consumers. It would be nice if the W3C specs would have left the N/A values as `undefined`, but that's not the world we live in and so we have to work with what we've got. History: 1. Initially, the code ignored 0-value timestamps. 2. open-telemetry#1769 identified cases of valid 0-value timestamps and removed the check. 3. This causes the other category of bugs where we created the nonsensical events (open-telemetry#2457, open-telemetry#3848, open-telemetry#4478) 4. open-telemetry#3897 added a special-case for `secureConnectionStart` by way of tapping into auxiliary metadata 5. That appoach cannot be generalized for the other cases, so open-telemetry#4486 added some rather convoluted checks. 6. As part of refactoring the tests to use service workers, a new bug open-telemetry#5314 was identified. Presumably, the patch in open-telemetry#4486 was writen that way (as opposed to just checking for 0) to avoid breaking open-telemetry#1769, but I suspect it ends up breaking some of those use cases anyway. Options: 1. Keep the patch from open-telemetry#4486 largely as-is but change the reference time from `fetchStart` to `startTime`. 2. Revert to the very original intent of the code and drop everything with 0-values. 3. This commit attempts a compromise position, when `startTime === 0` (which probably means we are called with the initial navigation event), 0-values are plausibly valid so we include them by default, but in any other cases (`startTime > 0`), legitimate 0-values should not be possible so we drop them by default. Part of the issue here is that it's unclear how these utilities are being used since they are public APIs. In core, these functions are only called by the fetch/xhr instrumentation with resource timing entries, where legitimate 0-value seems extremely unlikely. In my opinion, the rarity of legitimate 0-value timing values in the real world (recall that it doesn't just mean something happened very fast, but something happened very fast _at the same instant that the page loaded_) should yield to the conflicting interst (to avoid non-sensical events that are difficult to process downstream) here, especially when it appears that the only observed cases are from automated testsing. Personally I would be in favor of the stronger position (2), but attempted to strike a balance here to keep things useful the other cases. In open-telemetry#2457, @johnbley argued: > I think that for this piece of tech, our telemetry should report > whatever the browser says with minimal processing, and let > downstream/backend/easier-to-modify components deal with > normalization/truncation/processing if desired (e.g., throwing > away absurd times like "that page took 6 years or -5 hours to > load", or declaring that "no redirects happened so the redirect > timing is not 0 but non-existent"). Does that not work in your > world for some reason? I could see this perspective if the "minimal processing" means that we sends the 0-values to the backends as 0s. It's still unfortunate but at least it would be easy to write a query/filter to process them away. However, given we are actively normalizes the 0-values into the absolute epoch timestamp of `performance.timeOrigin`, I think this becomes actively hostile to backend/downstream processing, as it turns a relatively clear signal (value=0) into something that would require stateful machinary to reverse engineer (infer the timeOrigin from a trace), or imperfect huristics (ignore things that seems to take way too long). It also takes up bytesize/bandwidth to transmit in the vast majority of cases where it would be actively unhelpful. Instead, I'd argue it creates way less harm to drop the 0-values. When a 0-value is expected but not sent, it may skew the distrubtion on some aggregrate metrics slightly (e.g. making "average time for SSL handshake" appears large), but probably nothing would break. If we are expecting backends to do work to normalize data anyway, then IMO it would also be way easier for them to notice the missing items and insert it back than the other way around. Fixes open-telemetry#3199 Fixes open-telemetry#5314 See also open-telemetry#1769, open-telemetry#2457, open-telemetry#3848, open-telemetry#3879, open-telemetry#4478, open-telemetry#4486

@johnbley

Background: 1. For historical reasons, the perf/resource timing spec uses 0 as a special value to denote when some timing information is either not applicable (e.g. no redirect occured) or not available (when producing an opaque resource timing object for CORS resources). 2. However, in some limited cases, 0 can also be a legitimate value for these timing events. Note that this is rare in real life – these fields are high-res performance timestamp relative to the performance time-origin, which is typically the navigation event for the initial page load. For a _resource_ timing to have a 0 timestamp, it would have to be initiated simultaneously with page load, it is unclear if this can actually happen IRL. Reportedly (open-telemetry#1769), at one point this was observed in some browsers during automated tests where things happen very fast and the browser artifically supress the timing resolution. It was unclear if the report was about the navigation timing entry or resource timing entries. It is also unclear if these utilities are intended for anything other than the internal fetch/XHR instrumentation, but they are public API, so if someone uses these functions on the initial page navigation event, then it is possible for the function to see legitimate 0-value inputs. 3. When creating span events, we do not use the timeOrigin-relative perf timestamps as-is. Rather, we convert them back to the UNIX epoch timestamps. When this conversion is applied to those 0 timestamps, it creates nonsensical/misleading events that are quite difficult to distinguish for downstream consumers. It would be nice if the W3C specs would have left the N/A values as `undefined`, but that's not the world we live in and so we have to work with what we've got. History: 1. Initially, the code ignored 0-value timestamps. 2. open-telemetry#1769 identified cases of valid 0-value timestamps and removed the check. 3. This causes the other category of bugs where we created the nonsensical events (open-telemetry#2457, open-telemetry#3848, open-telemetry#4478) 4. open-telemetry#3897 added a special-case for `secureConnectionStart` by way of tapping into auxiliary metadata 5. That appoach cannot be generalized for the other cases, so open-telemetry#4486 added some rather convoluted checks. 6. As part of refactoring the tests to use service workers, a new bug open-telemetry#5314 was identified. Presumably, the patch in open-telemetry#4486 was writen that way (as opposed to just checking for 0) to avoid breaking open-telemetry#1769, but I suspect it ends up breaking some of those use cases anyway. Options: 1. Keep the patch from open-telemetry#4486 largely as-is but change the reference time from `fetchStart` to `startTime`. 2. Revert to the very original intent of the code and drop everything with 0-values. 3. This commit attempts a compromise position, when `startTime === 0` (which probably means we are called with the initial navigation event), 0-values are plausibly valid so we include them by default, but in any other cases (`startTime > 0`), legitimate 0-values should not be possible so we drop them by default. Part of the issue here is that it's unclear how these utilities are being used since they are public APIs. In core, these functions are only called by the fetch/xhr instrumentation with resource timing entries, where legitimate 0-value seems extremely unlikely. In my opinion, the rarity of legitimate 0-value timing values in the real world (recall that it doesn't just mean something happened very fast, but something happened very fast _at the same instant that the page loaded_) should yield to the conflicting interst (to avoid non-sensical events that are difficult to process downstream) here, especially when it appears that the only observed cases are from automated testsing. Personally I would be in favor of the stronger position (2), but attempted to strike a balance here to keep things useful the other cases. In open-telemetry#2457, @johnbley argued: > I think that for this piece of tech, our telemetry should report > whatever the browser says with minimal processing, and let > downstream/backend/easier-to-modify components deal with > normalization/truncation/processing if desired (e.g., throwing > away absurd times like "that page took 6 years or -5 hours to > load", or declaring that "no redirects happened so the redirect > timing is not 0 but non-existent"). Does that not work in your > world for some reason? I could see this perspective if the "minimal processing" means that we sends the 0-values to the backends as 0s. It's still unfortunate but at least it would be easy to write a query/filter to process them away. However, given we are actively normalizes the 0-values into the absolute epoch timestamp of `performance.timeOrigin`, I think this becomes actively hostile to backend/downstream processing, as it turns a relatively clear signal (value=0) into something that would require stateful machinary to reverse engineer (infer the timeOrigin from a trace), or imperfect huristics (ignore things that seems to take way too long). It also takes up bytesize/bandwidth to transmit in the vast majority of cases where it would be actively unhelpful. Instead, I'd argue it creates way less harm to drop the 0-values. When a 0-value is expected but not sent, it may skew the distrubtion on some aggregrate metrics slightly (e.g. making "average time for SSL handshake" appears large), but probably nothing would break. If we are expecting backends to do work to normalize data anyway, then IMO it would also be way easier for them to notice the missing items and insert it back than the other way around. Fixes open-telemetry#3199 Fixes open-telemetry#5314 See also open-telemetry#1769, open-telemetry#2457, open-telemetry#3848, open-telemetry#3879, open-telemetry#4478, open-telemetry#4486

Without this, the next set of test fails intermittently

@johnbley

Background: 1. For historical reasons, the perf/resource timing spec uses 0 as a special value to denote when some timing information is either not applicable (e.g. no redirect occured) or not available (when producing an opaque resource timing object for CORS resources). 2. However, in some limited cases, 0 can also be a legitimate value for these timing events. Note that this is rare in real life – these fields are high-res performance timestamp relative to the performance time-origin, which is typically the navigation event for the initial page load. For a _resource_ timing to have a 0 timestamp, it would have to be initiated simultaneously with page load, it is unclear if this can actually happen IRL. Reportedly (open-telemetry#1769), at one point this was observed in some browsers during automated tests where things happen very fast and the browser artifically supress the timing resolution. It was unclear if the report was about the navigation timing entry or resource timing entries. It is also unclear if these utilities are intended for anything other than the internal fetch/XHR instrumentation, but they are public API, so if someone uses these functions on the initial page navigation event, then it is possible for the function to see legitimate 0-value inputs. 3. When creating span events, we do not use the timeOrigin-relative perf timestamps as-is. Rather, we convert them back to the UNIX epoch timestamps. When this conversion is applied to those 0 timestamps, it creates nonsensical/misleading events that are quite difficult to distinguish for downstream consumers. It would be nice if the W3C specs would have left the N/A values as `undefined`, but that's not the world we live in and so we have to work with what we've got. History: 1. Initially, the code ignored 0-value timestamps. 2. open-telemetry#1769 identified cases of valid 0-value timestamps and removed the check. 3. This causes the other category of bugs where we created the nonsensical events (open-telemetry#2457, open-telemetry#3848, open-telemetry#4478) 4. open-telemetry#3897 added a special-case for `secureConnectionStart` by way of tapping into auxiliary metadata 5. That appoach cannot be generalized for the other cases, so open-telemetry#4486 added some rather convoluted checks. 6. As part of refactoring the tests to use service workers, a new bug open-telemetry#5314 was identified. Presumably, the patch in open-telemetry#4486 was writen that way (as opposed to just checking for 0) to avoid breaking open-telemetry#1769, but I suspect it ends up breaking some of those use cases anyway. Options: 1. Keep the patch from open-telemetry#4486 largely as-is but change the reference time from `fetchStart` to `startTime`. 2. Revert to the very original intent of the code and drop everything with 0-values. 3. This commit attempts a compromise position, when `startTime === 0` (which probably means we are called with the initial navigation event), 0-values are plausibly valid so we include them by default, but in any other cases (`startTime > 0`), legitimate 0-values should not be possible so we drop them by default. Part of the issue here is that it's unclear how these utilities are being used since they are public APIs. In core, these functions are only called by the fetch/xhr instrumentation with resource timing entries, where legitimate 0-value seems extremely unlikely. In my opinion, the rarity of legitimate 0-value timing values in the real world (recall that it doesn't just mean something happened very fast, but something happened very fast _at the same instant that the page loaded_) should yield to the conflicting interst (to avoid non-sensical events that are difficult to process downstream) here, especially when it appears that the only observed cases are from automated testsing. Personally I would be in favor of the stronger position (2), but attempted to strike a balance here to keep things useful the other cases. In open-telemetry#2457, @johnbley argued: > I think that for this piece of tech, our telemetry should report > whatever the browser says with minimal processing, and let > downstream/backend/easier-to-modify components deal with > normalization/truncation/processing if desired (e.g., throwing > away absurd times like "that page took 6 years or -5 hours to > load", or declaring that "no redirects happened so the redirect > timing is not 0 but non-existent"). Does that not work in your > world for some reason? I could see this perspective if the "minimal processing" means that we sends the 0-values to the backends as 0s. It's still unfortunate but at least it would be easy to write a query/filter to process them away. However, given we are actively normalizes the 0-values into the absolute epoch timestamp of `performance.timeOrigin`, I think this becomes actively hostile to backend/downstream processing, as it turns a relatively clear signal (value=0) into something that would require stateful machinary to reverse engineer (infer the timeOrigin from a trace), or imperfect huristics (ignore things that seems to take way too long). It also takes up bytesize/bandwidth to transmit in the vast majority of cases where it would be actively unhelpful. Instead, I'd argue it creates way less harm to drop the 0-values. When a 0-value is expected but not sent, it may skew the distrubtion on some aggregrate metrics slightly (e.g. making "average time for SSL handshake" appears large), but probably nothing would break. If we are expecting backends to do work to normalize data anyway, then IMO it would also be way easier for them to notice the missing items and insert it back than the other way around. Fixes open-telemetry#3199 Fixes open-telemetry#5314 See also open-telemetry#1769, open-telemetry#2457, open-telemetry#3848, open-telemetry#3879, open-telemetry#4478, open-telemetry#4486

Without this, the next set of test fails intermittently

chancancode added bug Something isn't working triage labels Jan 10, 2025

pichlermarc added priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect pkg:instrumentation-fetch pkg:instrumentation-xml-http-request and removed triage labels Jan 10, 2025

chancancode linked a pull request Jan 11, 2025 that will close this issue

fix(inst-fetch,inst-xhr) Ignore network events with zero-timing #5332

Open

8 tasks

chancancode added a commit to tildeio/opentelemetry-js that referenced this issue Jan 13, 2025

Temporary fix for open-telemetry#5314

7213c2b

Without this, the next set of test fails intermittently

chancancode added a commit to tildeio/opentelemetry-js that referenced this issue Jan 14, 2025

Temporary fix for open-telemetry#5314

90c890e

Without this, the next set of test fails intermittently

chancancode added a commit to tildeio/opentelemetry-js that referenced this issue Jan 14, 2025

Temporary fix for open-telemetry#5314

b2379d0

Without this, the next set of test fails intermittently

chancancode added a commit to tildeio/opentelemetry-js that referenced this issue Jan 14, 2025

Temporary fix for open-telemetry#5314

3a07a5f

Without this, the next set of test fails intermittently

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inst-xhr/fetch] Network events incorrectly dropped #5314

[inst-xhr/fetch] Network events incorrectly dropped #5314

chancancode commented Jan 10, 2025 •

edited

Loading

[inst-xhr/fetch] Network events incorrectly dropped #5314

[inst-xhr/fetch] Network events incorrectly dropped #5314

Comments

chancancode commented Jan 10, 2025 • edited Loading

What happened?

Steps to Reproduce

Expected Result

Actual Result

Additional Details

OpenTelemetry Setup Code

package.json

Relevant log output

Operating System and Version

Runtime and Version

chancancode commented Jan 10, 2025 •

edited

Loading