Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

influxdb receiver dropping metrics contain 'count' #30433

Closed
fiona-huang-tyro opened this issue Jan 11, 2024 · 8 comments
Closed

influxdb receiver dropping metrics contain 'count' #30433

fiona-huang-tyro opened this issue Jan 11, 2024 · 8 comments
Labels

Comments

@fiona-huang-tyro
Copy link

Describe the bug

We have found out if influxdb receiver receives metrics contain 'count' it will ignore all the metrics and drop everything. There is no clear debug log showing we used count in the metrics and results in failure, instead its only showing skipping unrecognized histogram field. We can not find any documentation around the keywords used by influxdb receiver.
It will be good

  • if you can provide detailed document on the keywords we should avoid in the influx metrics.
  • if you can update the error message to reflect the exact error
  • If you can skip only the keyword metric and convert the rest value rather than drop everything. e.g, in the below example, create cpu_load_short_value metric and drop cpu_load_short_count

Steps to reproduce

Send an influxdb metrcs to 8086 port
curl -i -XPOST 'http://localhost:8086/write' --data-binary --data-binary 'cpu_load_short,host=ps15 value=10i,count=5i 1704933600251384987'

What did you expect to see?

2024-01-11T11:43:16.688+1100	info	MetricsExporter	{"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-01-11T11:43:16.688+1100	info	ResourceMetrics #0
Resource SchemaURL:
ScopeMetrics #0
ScopeMetrics SchemaURL:
InstrumentationScope
Metric #0
Descriptor:
     -> Name: cpu_load_short_value
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> host: Str(ps15)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-01-11 00:40:00.251384987 +0000 UTC
Value: 10
Metric open-telemetry/opentelemetry-collector#1
Descriptor:
     -> Name: cpu_load_short_count
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> host: Str(ps15)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-01-11 00:40:00.251384987 +0000 UTC
Value: 5

What did you see instead?
2024-01-11T11:45:30.047+1100 debug [email protected]/logger.go:22 skipping unrecognized histogram field {"kind": "receiver", "name": "influxdb", "data_type": "metrics", "field": "value", "value": 10}

What version did you use?
Version: otel/opentelemetry-collector-contrib:0.89.0

What config did you use?

Otel Config

receivers:
  influxdb:
    endpoint: 0.0.0.0:8086

# Exports data to the console
exporters:
  debug:
    verbosity: detailed
service:
  telemetry:
    logs:
      level: "debug"
  pipelines:
    metrics:
      receivers: [influxdb]
      processors: []
      exporters: [debug]
@fiona-huang-tyro fiona-huang-tyro added the bug Something isn't working label Jan 11, 2024
@mx-psi mx-psi transferred this issue from open-telemetry/opentelemetry-collector Jan 11, 2024
Copy link
Contributor

Pinging code owners for receiver/influxdb: @jacobmarble. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

Hello @fiona-huang-tyro, sorry for the delayed response. It looks like the error is being logged in a dependency when the receiver adds a data point.

From the location of the log message, it appears to be unrelated to the count field, or having count in the name. Is there a reason it seemed like having count in the metric name was the cause of data being dropped?

The log message shows that the metric that the collector is receiving includes an attribute key value, but the Influx library converting metrics to otel format can't handle that. I don't have enough context at this point to know if this is expected or a bug.

@oldNoakes
Copy link

hey @crobert-1 - will follow up on this one with you (same team).

The behaviour we are seeing is that using certain words in the field for the metrics causes those metrics to not be sent. The clearest example I can provide is as follows:

I setup the otel-config exactly as @fiona-huang-tyro has done above:

receivers:
  influxdb:
    endpoint: 0.0.0.0:8086

# Exports data to the console
exporters:
  debug:
    verbosity: detailed
service:
  telemetry:
    logs:
      level: "debug"
  pipelines:
    metrics:
      receivers: [influxdb]
      processors: []
      exporters: [debug]

If I send the following metric, it works as expected:

curl -i -XPOST 'http://otel:8086/write' --data-binary 'cpu_load_short,tag=tag1 measurement=1i 1709252309000000000'
HTTP/1.1 204 No Content
Date: Fri, 01 Mar 2024 00:41:41 GMT

as I can see in the detailed logs from the otel-collector container that the metric has been converted as expected:

docker_otel[11260]: 2024-03-01T00:41:41.450Z        info        ResourceMetrics #0
docker_otel[11260]: Resource SchemaURL:
docker_otel[11260]: ScopeMetrics #0
docker_otel[11260]: ScopeMetrics SchemaURL:
docker_otel[11260]: InstrumentationScope
docker_otel[11260]: Metric #0
docker_otel[11260]: Descriptor:
docker_otel[11260]:      -> Name: cpu_load_short_measurement
docker_otel[11260]:      -> Description:
docker_otel[11260]:      -> Unit:
docker_otel[11260]:      -> DataType: Gauge
docker_otel[11260]: NumberDataPoints #0
docker_otel[11260]: Data point attributes:
docker_otel[11260]:      -> tag: Str(tag1)
docker_otel[11260]: StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
docker_otel[11260]: Timestamp: 2024-03-01 00:18:29 +0000 UTC
docker_otel[11260]: Value: 1
docker_otel[11260]:         {"kind": "exporter", "data_type": "metrics", "name": "debug"}

but, if I change the metric to have a field called count, it fails and there are no logs:

curl -i -XPOST 'http://otel:8086/write' --data-binary 'cpu_load_short,tag=tag1 count=1i 1709252309000000000'
HTTP/1.1 400 Bad Request
Date: Fri, 01 Mar 2024 00:44:09 GMT
Content-Length: 29
Content-Type: text/plain; charset=utf-8

failed to append to the batch

similarily for setting a field called sum:

curl -i -XPOST 'http://otel:8086/write' --data-binary 'cpu_load_short,tag=tag1 sum=1i 1709252309000000000'
HTTP/1.1 400 Bad Request
Date: Fri, 01 Mar 2024 00:46:47 GMT
Content-Length: 29
Content-Type: text/plain; charset=utf-8

failed to append to the batch

Finally, if we have either of these 2 fields in a metric with multiple fields, we lose all the metrics:

curl -i -XPOST 'http://otel:9086/write' --data-binary 'cpu_load_short,tag=tag1 measurement=1i,count=1i 1709252309000000000'
HTTP/1.1 400 Bad Request
Date: Fri, 01 Mar 2024 00:49:12 GMT
Content-Length: 29
Content-Type: text/plain; charset=utf-8

failed to append to the batch

What we are trying to understand is:

  1. Is there a list of reserved keywords that we need to avoid using for the field value in the metrics our teams are generating?
  2. The failed to append to batch error seems to be geting thrown here: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/influxdbreceiver/receiver.go#L186 - but am not sure why it is being thrown - will take a look as well but making it clear why these field values cause the error (in the message) would help the error be more clear
  3. In the final example (with 2 fields, one of which is invalid), I would expect that the bad field should not stop the other metric from getting created and sent

Will try and dig into what is going on under the covers but any help appreciated

@crobert-1
Copy link
Member

crobert-1 commented Mar 4, 2024

Hi @oldNoakes, thanks for posting the config and simple repro, it was super helpful!

I was able to reproduce. For context, the receiver uses an InfluxDB package to automatically detect the data schema it receives (source). For the data point you're sending via curl that works as expected (no sum or count field), it doesn't detect a known schema, so it proceeds to manually go through values and fields setting its type to gauge by default. This is a successful operation as you've seen.

However, when including a field with the name sum or count, this is detected as a histogram type (source). For histograms, the expected type of count is a float64, so the failure occurs when passing in an int instead (source).

If you're intentionally sending histograms, please update the type of the values being sent with these keys to be what's expected. Otherwise, please rename the field so it's not automatically detected as a histogram.

Summary
I believe this is functioning as expected, but I agree this should be more clear from the error message, and potentially be in the README as well.

@crobert-1 crobert-1 added documentation Improvements or additions to documentation and removed bug Something isn't working needs triage New item requiring triage labels Mar 4, 2024
@oldNoakes
Copy link

@crobert-1 - awesome explanation - thank you so mouch for that - we are actively attempting to move internal teams off of the legacy custom solution that has this issue and over to the open-telemetry instrumentation libraries. I can use this as another stick to get them moving.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label May 20, 2024
@crobert-1 crobert-1 removed the Stale label May 20, 2024
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jul 22, 2024
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants