Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamps with decimals cause metric forwarding to get stuck #620

Closed
sethrwebster opened this issue Sep 19, 2023 · 13 comments
Closed

Timestamps with decimals cause metric forwarding to get stuck #620

sethrwebster opened this issue Sep 19, 2023 · 13 comments
Labels
bug Something isn't working

Comments

@sethrwebster
Copy link
Contributor

sethrwebster commented Sep 19, 2023

What version of OpenTelemetry are you using?

"@opentelemetry/api": "1.4.1",
"@opentelemetry/core": "1.15.2",
"@opentelemetry/resources": "1.15.2",
"@opentelemetry/sdk-metrics": "1.15.2",

What version of Node are you using?

v16.14.2

What did you do?

Very occasionally, metrics created with OpenTelemetry end up with a decimal part to their timestamps. This is presumably due to numeric precision issues causing integer math to create a floating point number. Attempting to process metrics with floating point timestamps causes a RangeError exception to be thrown. The exception isn't caught until somewhere well up the stack. A side effect is that the list of pending metrics is not cleared out, resulting in this error being hit every time the system tries to forward the list of pending metrics. The following is an example of a metric seen when this problem was encountered:

    "attributes":
    {
    },
    "startTime":
    [
        1695096840,
        534000118.00000006
    ],
    "endTime":
    [
        1695096846,
        89824450
    ],
    "value": 22

The function that transforms the metrics should:

  1. Round timestamps to prevent inadvertent decimal values from creating problems
  2. Catch any exceptions to prevent them from bubbling up the stack and bypassing clean-up code. There is already a precedent in the function for discarding invalid metrics. This would extend it to discard any metrics where an exception was thrown while it was being processed.

I'll submit a PR shortly with the above fixes.

@sethrwebster sethrwebster added the bug Something isn't working label Sep 19, 2023
sethrwebster added a commit to mablhq/opentelemetry-operations-js that referenced this issue Sep 19, 2023
This change normalized metric timestamps to prevent those inadvertently created
with floating point numbers from breaking the processing of the metric. It also
wraps the processing with a try/catch to make sure invalid metrics are correctly
discarded so metric processing doesn't get stuck.
@aabmass
Copy link
Contributor

aabmass commented Sep 20, 2023

@sethrwebster do you have a repro for making the OpenTelemetry SDK generate non-integer numbers? This sounds like a bug upstream, I think consumers should safely be expecting int since process.hrtime.bigint() exists

I was wondering if it's possibly related to open-telemetry/opentelemetry-js#4014 but I don't think so given that is converting from HrTime.

Attempting to process metrics with floating point timestamps causes a RangeError exception to be thrown. The exception isn't caught until somewhere well up the stack.

Could you explain a bit more, what is processing these metrics? A stack trace would be helpful

@aabmass
Copy link
Contributor

aabmass commented Sep 20, 2023

@sethrwebster
Copy link
Contributor Author

Hi, thanks for taking a look at this. I think you are correct that this is a bug upstream. But because the underlying HrTime type allows invalid data, it seemed useful to prevent the issue here rather than rely solely on finding all places HrTime objects are created and making sure they are done so properly (though clients supplying invalid data should also be fixed). Unfortunately, HrTime is defined as a type and not a class, so I don't know any way to enforce constraints (like the values being integers) when it is created.

In our case, we are using metric objects such as Histograms to store the metric values. We don't supply any timestamps, so those are being generated somewhere in the opentelemetry-js library.

@aabmass
Copy link
Contributor

aabmass commented Sep 21, 2023

I believe they're only created with the code I linked above, which is using Math.round() and Math.trunc() to ensure ints

@sethrwebster
Copy link
Contributor Author

I'll admit I'm at a bit of a loss on how to proceed. Our code generating the metrics doesn't create the timestamps, and all the places I've looked in the opentelemtry-js library where metrics are recorded appear to be properly using the millisToHrTime function. I'm not familiar with this codebase, so I'm sure I've missed code paths though. Regardless, we are definitely seeing examples of metrics that contain a non-integer nanosecond value and it is causing the metrics processing to get stuck. The stack trace isn't very informative, as it seems the place the error is thrown is just exporting a set of previously generated metrics, so it doesn't indicate where or by whom the metrics were created:

Invalid time value
{"stack":"Error: PeriodicExportingMetricReader: metrics export failed (error RangeError: Invalid time value)
    at PeriodicExportingMetricReader._runOnce (/app/node_modules/@opentelemetry/sdk-metrics/src/export/PeriodicExportingMetricReader.ts:91:13)
    at runNextTicks (node:internal/process/task_queues:60:5)
    at listOnTimeout (node:internal/timers:538:9)
    at processTimers (node:internal/timers:512:7)
    at Timeout._onTimeout (/app/node_modules/@opentelemetry/sdk-metrics/src/export/PeriodicExportingMetricReader.ts:101:9)","message":"PeriodicExportingMetricReader: metrics export failed (error RangeError: Invalid time value)","name":"Error"}

Any thoughts on how to resolve this issue would be greatly appreciated.

@aabmass
Copy link
Contributor

aabmass commented Sep 22, 2023

How often are you seeing the SDK produce these non-integers? If you could repro it and open an issue in upstream open-telemetry/opentelemetry-js repo that would be awesome.

That stack trace is useful, I agree it's probably coming from new PreciseDate(...).toIsoString() since I can repro that:

$ node
Welcome to Node.js v18.13.0.
Type ".help" for more information.
> const {PreciseDate} = require('@google-cloud/precise-date');
undefined
> let d = new PreciseDate([1695096840, 534000118.00000006]);
undefined
> d
PreciseDate Invalid Date { _micros: 0, _nanos: 6e-8 }
> d.toISOString()
Uncaught RangeError: Invalid time value
    at PreciseDate.toISOString (<anonymous>)
    at PreciseDate.toISOString (/usr/local/google/home/aaronabbott/repo/opentelemetry-operations-js/node_modules/@google-cloud/precise-date/build/src/index.js:297:22)
>

A side effect is that the list of pending metrics is not cleared out, resulting in this error being hit every time the system tries to forward the list of pending metrics.

That definitely seems like a bug outside of the specific issue but I'm confused how that's happening because we don't buffer/retry failed exports. What do you mean by pending metrics? Afaik the exception would bubble out of this function:

...createTimeSeries(metric, resource, this._metricPrefix)

Causing the promise to reject and do this

resultCallback({code: ExportResultCode.FAILED, error: err});

@aabmass
Copy link
Contributor

aabmass commented Sep 22, 2023

I think what's actually happening is that the bad timestamp is the start timestamp, from your original bug report:

    "startTime":
    [
        1695096840,
        534000118.00000006
    ],

OTel will keep reporting this same start timestamp every export cycle (as the metrics are CUMULATIVE) which seems like the actual cause of you seeing the issue repeatedly. So I don't think the try/catch you added would help

@aabmass
Copy link
Contributor

aabmass commented Sep 22, 2023

places I've looked in the opentelemtry-js library where metrics are recorded appear to be properly using the millisToHrTime function. I'm not familiar with this codebase, so I'm sure I've missed code paths though.

One other thing I noticed. The hrTime() function calls addHrTimes() which is doing some arithmetic. I wonder if that's where the floating point error is introduced.

@sethrwebster
Copy link
Contributor Author

That makes sense that the problem is the same start time being used on each call. Looking at addHrTimes, it is only doing addition and subtraction. Is it possible to end up with non-integer values when doing addition or subtraction starting with integers in Javascript? I didn't think so, but could be wrong.

@aabmass
Copy link
Contributor

aabmass commented Sep 25, 2023

One other weird thing, I noticed the timestamps in your example are beyond millisecond precision [1695096846, 89824450] while OpenTelemetry metric SDK seems to always give millisecond precision timestamps like [ 1695671283, 798000000 ] when I tested it out. I tracked this change down to open-telemetry/opentelemetry-js#3514 which was released in SDK version 1.9.1. (the changelog is wrong if you look at the code at the release tags)

Are you sure about the dependency versions you mentioned above?

"@opentelemetry/api": "1.4.1",
"@opentelemetry/core": "1.15.2",
"@opentelemetry/resources": "1.15.2",
"@opentelemetry/sdk-metrics": "1.15.2",

@aabmass
Copy link
Contributor

aabmass commented Sep 25, 2023

I was able to repro the problem with @opentelemetry/[email protected], here's a branch https://github.com/GoogleCloudPlatform/opentelemetry-operations-js/compare/main...aabmass:opentelemetry-operations-js:hrtime-repro?expand=1. Cd that directory, npm install, and run the repro.sh script and possibly wait ~30s for it to encounter the issue.

[ 1695674335, 751246217 ] 249.5499540567398
[ 1695674335, 751262637 ] 249.56618404388428
[ 1695674335, 751285377 ] 249.5889140367508
[ 1695674335, 751309937 ] 249.6135540008545
[ 1695674335, 751332797 ] 249.63635408878326
[ 1695674335, 751354687 ] 249.65818405151367
[ 1695674335, 751377217 ] 249.68073403835297
[ 1695674335, 751397727 ] 249.70124399662018
[ 1695674335, 751419097 ] 249.72259402275085
[ 1695674335, 751438417 ] 249.74190402030945
[ 1695674335, 751460227 ] 249.76378405094147
[ 1695674335, 751483757 ] 249.78736400604248
[ 1695674335, 751506607 ] 249.8101440668106
[ 1695674335, 751529427 ] 249.83304405212402
[ 1695674335, 751551937 ] 249.8555040359497
[ 1695674335, 751573487 ] 249.87701404094696
[ 1695674335, 751595077 ] 249.89872407913208
[ 1695674335, 751617277 ] 249.92093408107758
[ 1695674335, 751638907 ] 249.9424340724945
[ 1695674335, 751661077 ] 249.9645940065384
[ 1695674335, 751700917.0000001 ] 250.00472402572632
/usr/local/google/home/aaronabbott/repo/opentelemetry-operations-js/hrtime-repro/index.js:5
    throw new RangeError(
    ^

RangeError: Got non-int hrTime: 1695674335,751700917.0000001, timeorigin=1695674335501.697, now=250.0495640039444
    at check (/usr/local/google/home/aaronabbott/repo/opentelemetry-operations-js/hrtime-repro/index.js:5:11)
    at Object.<anonymous> (/usr/local/google/home/aaronabbott/repo/opentelemetry-operations-js/hrtime-repro/index.js:16:3)
    at Module._compile (node:internal/modules/cjs/loader:1218:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1272:10)
    at Module.load (node:internal/modules/cjs/loader:1081:32)
    at Module._load (node:internal/modules/cjs/loader:922:12)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:81:12)
    at node:internal/main/run_main_module:23:47

Node.js v18.13.0

The setup restarts the node script after 10,000 attempts to get a new performance.timeOrigin which seems necessary to repro the bug. The other really bizarre thing is that it always happens right after performance.now() passes 250, but not every time. If you add some logging to the actual hrTime() function you may be able to figure out the problem. But it would be better to upgrade your dependencies if possible because it seems you're on an older version that obsoleted the issue

@sethrwebster
Copy link
Contributor Author

sethrwebster commented Sep 26, 2023

Hi, I think you are exactly correct. We are on a recent version of opentelemetry-operations-js, but an older version of opentelemetry-js. I think I accidentally grabbed the versions out of the the opentelemetry-operations-js package.json instead of ours when filling out the bug report because I had too many files open and that confused the matter further. I apologize for wasting your time on this, but really appreciate the help in getting to the bottom of the issue.

If it is useful to you, I can remove the extra Math.round from my PR but leave the try/catch as a guard against any future errors that show up, or we can just close the PR if you'd rather let exceptions go up the stack and be dealt with by the caller. Regardless, this issue can be closed as it appears upgrading to a recent version of opentelemetry-js looks to be the answer.

@aabmass
Copy link
Contributor

aabmass commented Sep 26, 2023

No worries, glad we figured it out.

Wraps a try/catch around the metric processing to make sure any other errors don't cause the invalid metric to be left in place, blocking all future metric processing.

I think we determined that invalid metrics won't block future metrics as there is no retry happening. In that case, I'd rather the exceptions go up the stack. If that's OK with you, can you close this issue/PR out?

@sethrwebster sethrwebster closed this as not planned Won't fix, can't repro, duplicate, stale Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants