-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timestamps with decimals cause metric forwarding to get stuck #620
Comments
This change normalized metric timestamps to prevent those inadvertently created with floating point numbers from breaking the processing of the metric. It also wraps the processing with a try/catch to make sure invalid metrics are correctly discarded so metric processing doesn't get stuck.
@sethrwebster do you have a repro for making the OpenTelemetry SDK generate non-integer numbers? This sounds like a bug upstream, I think consumers should safely be expecting int since I was wondering if it's possibly related to open-telemetry/opentelemetry-js#4014 but I don't think so given that is converting from HrTime.
Could you explain a bit more, what is processing these metrics? A stack trace would be helpful |
Looks like it should be an integer coming from the SDK https://github.com/open-telemetry/opentelemetry-js/blob/1a8652aa5466510d2df2a232a0c8aa78857619c4/packages/opentelemetry-core/src/common/time.ts#L30-L37, so I'm inclined to say its a bug upstream |
Hi, thanks for taking a look at this. I think you are correct that this is a bug upstream. But because the underlying HrTime type allows invalid data, it seemed useful to prevent the issue here rather than rely solely on finding all places HrTime objects are created and making sure they are done so properly (though clients supplying invalid data should also be fixed). Unfortunately, In our case, we are using metric objects such as Histograms to store the metric values. We don't supply any timestamps, so those are being generated somewhere in the |
I believe they're only created with the code I linked above, which is using |
I'll admit I'm at a bit of a loss on how to proceed. Our code generating the metrics doesn't create the timestamps, and all the places I've looked in the
Any thoughts on how to resolve this issue would be greatly appreciated. |
How often are you seeing the SDK produce these non-integers? If you could repro it and open an issue in upstream open-telemetry/opentelemetry-js repo that would be awesome. That stack trace is useful, I agree it's probably coming from $ node
Welcome to Node.js v18.13.0.
Type ".help" for more information.
> const {PreciseDate} = require('@google-cloud/precise-date');
undefined
> let d = new PreciseDate([1695096840, 534000118.00000006]);
undefined
> d
PreciseDate Invalid Date { _micros: 0, _nanos: 6e-8 }
> d.toISOString()
Uncaught RangeError: Invalid time value
at PreciseDate.toISOString (<anonymous>)
at PreciseDate.toISOString (/usr/local/google/home/aaronabbott/repo/opentelemetry-operations-js/node_modules/@google-cloud/precise-date/build/src/index.js:297:22)
>
That definitely seems like a bug outside of the specific issue but I'm confused how that's happening because we don't buffer/retry failed exports. What do you mean by pending metrics? Afaik the exception would bubble out of this function: opentelemetry-operations-js/packages/opentelemetry-cloud-monitoring-exporter/src/monitoring.ts Line 156 in 666a6d4
Causing the promise to reject and do this opentelemetry-operations-js/packages/opentelemetry-cloud-monitoring-exporter/src/monitoring.ts Line 117 in 666a6d4
|
I think what's actually happening is that the bad timestamp is the start timestamp, from your original bug report:
OTel will keep reporting this same start timestamp every export cycle (as the metrics are CUMULATIVE) which seems like the actual cause of you seeing the issue repeatedly. So I don't think the try/catch you added would help |
One other thing I noticed. The |
That makes sense that the problem is the same start time being used on each call. Looking at |
One other weird thing, I noticed the timestamps in your example are beyond millisecond precision Are you sure about the dependency versions you mentioned above?
|
I was able to repro the problem with
The setup restarts the node script after 10,000 attempts to get a new |
Hi, I think you are exactly correct. We are on a recent version of If it is useful to you, I can remove the extra |
No worries, glad we figured it out.
I think we determined that invalid metrics won't block future metrics as there is no retry happening. In that case, I'd rather the exceptions go up the stack. If that's OK with you, can you close this issue/PR out? |
What version of OpenTelemetry are you using?
What version of Node are you using?
v16.14.2
What did you do?
Very occasionally, metrics created with OpenTelemetry end up with a decimal part to their timestamps. This is presumably due to numeric precision issues causing integer math to create a floating point number. Attempting to process metrics with floating point timestamps causes a RangeError exception to be thrown. The exception isn't caught until somewhere well up the stack. A side effect is that the list of pending metrics is not cleared out, resulting in this error being hit every time the system tries to forward the list of pending metrics. The following is an example of a metric seen when this problem was encountered:
The function that transforms the metrics should:
I'll submit a PR shortly with the above fixes.
The text was updated successfully, but these errors were encountered: