Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic prometheus metrics #745

Closed
jmprusi opened this issue Jun 5, 2018 · 23 comments
Closed

Basic prometheus metrics #745

jmprusi opened this issue Jun 5, 2018 · 23 comments
Assignees
Milestone

Comments

@jmprusi
Copy link
Contributor

jmprusi commented Jun 5, 2018

APIcast ships with prometheus support, but only exposes the nginx_metric_errors_total metric.

I would like to propose some basic metrics to be added to the APIcast base policy:

Counters

  • Request:

    • Total
    • Request 2xx
    • Request 4xx
    • Request 5xx
  • Connections:

    • Read
    • Write
    • Wait
    • Open
  • Nginx Error Log

  • Free Dictionary Space

  • Threescale (fetching config):

    • Update
    • Reload

Histogram

  • Request latency

Some of them where already added to apicast-cloud-hosted: https://github.com/3scale/apicast-cloud-hosted/pull/5/files#diff-047b1780b0ffeb4eba7b3d05beb76d5e

What do you think? Any other metrics to add?

@mikz
Copy link
Contributor

mikz commented Jun 7, 2018

APIcast policy does several things:

  • extract credentials from the request and possibly terminates the request with an error
  • maps mapping rules to metrics
  • calls 3scale for authorization
  • sends the request to upstream
  • uses round robin load balancer

I think the APIcast policy metrics should be focused on those operations (and maybe on some other I missed). IMO that does not include nginx error log, free shdict space (unless monitoring shdicts APIcast uses), nginx connections.

If we want those metrics, then they should be in some other (possible active by default ) policy.

Error log and shdict space monitoring are very important metrics that should be available somehow.

One metric we could do is how many requests (and with what status) was terminated a policy and not came from upstream).

@andrewdavidmackenzie
Copy link
Member

Reading between the lines, it sounds like there's a slight difference in where the metrics are implemented.

Joaquim lists a list of metrics, and suggests doing in apicast policy.

Michal lists a set that are related to apicast policy operations and would make sense to do there.

But not clear about how to do the ones that are not apicast policy related?

BTW: I think I saw people asking for "#BytesTransferred" also in sme-apis.

@mikz
Copy link
Contributor

mikz commented Jun 7, 2018

@andrewdavidmackenzie metrics is a phase implemented by policies. Each policy can expose metrics about its' operation and they are in the end merged together. So APIcast policy would expose metrics about itself and other policies would expose metrics about their functionality. Of course there could be policies that just expose metrics.

@andrewdavidmackenzie
Copy link
Member

OK, cool.

My main point was that it sounded to me like Joaquim was asking for some metrics that are not related to any specific policy, but the underlying NGINX?
(Free dictionary space etc....)

@mikz
Copy link
Contributor

mikz commented Jun 7, 2018

Yes. And we can expose them in some other policy or make every policy responsible for monitoring own free space. But we will need some global non apicast/3scale metrics anyway, so probably better to shove it to some nginx metrics policy.

@andrewdavidmackenzie
Copy link
Member

👍

@jmprusi
Copy link
Contributor Author

jmprusi commented Jun 12, 2018

@mikz Yes, it makes sense to have specific apicast policy metrics (related mostly to threescale, and operation mode) and then have another policy for basic metrics.

@davidor
Copy link
Contributor

davidor commented Jul 26, 2018

Ping @3scale/product
Can you provide your input for this feature and decide whether it should be part of the next release?
Thanks.

@mikz
Copy link
Contributor

mikz commented Jul 31, 2018

@davidor I think this is not necessary for the next release, but it is possible will be there done by ostia team.

@andrewdavidmackenzie
Copy link
Member

This was raised last week by Product as a last minute request.
If @MarkCheshire can get us a simple list, ASAP, then we said we'd consider it.
But I'd say it's at the bottom of the priority list and we shouldn't delay the release for this.

@mikz mikz added this to the 3.3 milestone Aug 22, 2018
@MarkCheshire
Copy link
Contributor

I recommend the base set of metrics to start:

3scale-auth status codes: Total, 2xx, 4xx, 5xx
Upstream status codes: Total, 2xx, 4xx, 5xx
Request time:

  • must full end to end latency
  • must upstream latency
  • optional breakdown of latency in APIcast pre- and post-request
    Connections (per Joaquim): Read, Write, Wait, Open

@davidor davidor self-assigned this Aug 28, 2018
@davidor davidor mentioned this issue Aug 29, 2018
@davidor
Copy link
Contributor

davidor commented Aug 31, 2018

This is what is going to be included in 3.3: #860
I'll keep the issue open so we can discuss what to include in future versions.

@davidor davidor removed this from the 3.3 milestone Aug 31, 2018
@andrewdavidmackenzie
Copy link
Member

or close and have a new enhancement issue to discuss what to add?

(It's nice to see issues get closed.....)

@gnunn1
Copy link

gnunn1 commented Sep 5, 2018

Are prometheus metrics available in the current master version of apicast? I'm curling the /metrics endpoint where you would normally find prometheus metrics and not seeing anything:

sh-4.2$ curl -i http://127.0.0.1:8080/metrics
HTTP/1.1 404 Not Found
Server: openresty/1.13.6.2
Date: Wed, 05 Sep 2018 19:41:18 GMT
Content-Type: text/plain
Transfer-Encoding: chunked
Connection: keep-alive

sh-4.2$ curl -i http://127.0.0.1:8090/metrics
HTTP/1.1 404 Not Found
Server: openresty/1.13.6.2
Date: Wed, 05 Sep 2018 19:41:23 GMT
Content-Type: text/plain
Transfer-Encoding: chunked
Connection: keep-alive

Could not resolve GET /metrics - nil

Is there an environment variable that needs to be enabled?

@mikz
Copy link
Contributor

mikz commented Sep 6, 2018

@gnunn1 Metrics are exposed on port 9421.

$ curl localhost:9421/metrics                                                                                                                     
# HELP nginx_http_connections Number of HTTP connections
# TYPE nginx_http_connections gauge
nginx_http_connections{state="accepted"} 1
nginx_http_connections{state="active"} 1
nginx_http_connections{state="handled"} 1
nginx_http_connections{state="reading"} 0
nginx_http_connections{state="total"} 1
nginx_http_connections{state="waiting"} 0
nginx_http_connections{state="writing"} 1
# HELP nginx_metric_errors_total Number of nginx-lua-prometheus errors
# TYPE nginx_metric_errors_total counter
nginx_metric_errors_total 0
# HELP openresty_shdict_capacity OpenResty shared dictionary capacity
# TYPE openresty_shdict_capacity gauge
openresty_shdict_capacity{dict="api_keys"} 10485760
openresty_shdict_capacity{dict="batched_reports"} 1048576
openresty_shdict_capacity{dict="batched_reports_locks"} 1048576
openresty_shdict_capacity{dict="cached_auths"} 1048576
openresty_shdict_capacity{dict="configuration"} 10485760
openresty_shdict_capacity{dict="init"} 16384
openresty_shdict_capacity{dict="limiter"} 1048576
openresty_shdict_capacity{dict="locks"} 1048576
openresty_shdict_capacity{dict="prometheus_metrics"} 16777216
# HELP openresty_shdict_free_space OpenResty shared dictionary free space
# TYPE openresty_shdict_free_space gauge
openresty_shdict_free_space{dict="api_keys"} 10412032
openresty_shdict_free_space{dict="batched_reports"} 1032192
openresty_shdict_free_space{dict="batched_reports_locks"} 1032192
openresty_shdict_free_space{dict="cached_auths"} 1032192
openresty_shdict_free_space{dict="configuration"} 10412032
openresty_shdict_free_space{dict="init"} 4096
openresty_shdict_free_space{dict="limiter"} 1032192
openresty_shdict_free_space{dict="locks"} 1032192
openresty_shdict_free_space{dict="prometheus_metrics"} 16662528

@gnunn1
Copy link

gnunn1 commented Sep 6, 2018

Thanks @mikz, that works fine. Are 4xx and 5xx response codes supposed to be captured in the prometheus metrics like the 2xx response codes? If I execute a request in postman that generates a 404 response, i.e. requesting a REST entity that doesn't exist or where a mapping rule hasn't been set in 3scale, I don't see the 4xx response status codes being returned with backend_response{status="4xx"}

I do have APICAST_RESPONSE_CODES set to true and do see the 4xx response codes in 3scale analytics.

Here's the output of metrics after making a few 404 requests:

sh-4.2$ curl localhost:9421/metrics
# HELP backend_response Response status codes from 3scale's backend
# TYPE backend_response counter
backend_response{status="2xx"} 13
# HELP nginx_http_connections Number of HTTP connections
# TYPE nginx_http_connections gauge
nginx_http_connections{state="accepted"} 180
nginx_http_connections{state="active"} 2
nginx_http_connections{state="handled"} 180
nginx_http_connections{state="reading"} 0
nginx_http_connections{state="total"} 195
nginx_http_connections{state="waiting"} 1
nginx_http_connections{state="writing"} 1
# HELP nginx_metric_errors_total Number of nginx-lua-prometheus errors
# TYPE nginx_metric_errors_total counter
nginx_metric_errors_total 0
# HELP openresty_shdict_capacity OpenResty shared dictionary capacity
# TYPE openresty_shdict_capacity gauge
openresty_shdict_capacity{dict="api_keys"} 10485760
openresty_shdict_capacity{dict="batched_reports"} 1048576
openresty_shdict_capacity{dict="batched_reports_locks"} 1048576
openresty_shdict_capacity{dict="cached_auths"} 1048576
openresty_shdict_capacity{dict="configuration"} 10485760
openresty_shdict_capacity{dict="init"} 16384
openresty_shdict_capacity{dict="limiter"} 1048576
openresty_shdict_capacity{dict="locks"} 1048576
openresty_shdict_capacity{dict="prometheus_metrics"} 16777216
# HELP openresty_shdict_free_space OpenResty shared dictionary free space
# TYPE openresty_shdict_free_space gauge
openresty_shdict_free_space{dict="api_keys"} 10407936
openresty_shdict_free_space{dict="batched_reports"} 1032192
openresty_shdict_free_space{dict="batched_reports_locks"} 1032192
openresty_shdict_free_space{dict="cached_auths"} 1032192
openresty_shdict_free_space{dict="configuration"} 10412032
openresty_shdict_free_space{dict="init"} 4096
openresty_shdict_free_space{dict="limiter"} 1032192
openresty_shdict_free_space{dict="locks"} 1032192
openresty_shdict_free_space{dict="prometheus_metrics"} 16662528

@davidor
Copy link
Contributor

davidor commented Sep 6, 2018

@gnunn1 , I'm using the version in the master branch and it works for me. I made a request with a valid user_key and another with an invalid one and this is what I get:

# HELP backend_response Response status codes from 3scale's backend
# TYPE backend_response counter
backend_response{status="2xx"} 1
backend_response{status="4xx"} 1

Keep in mind that the backend_response counter only shows status codes received from the 3scale backend. The backend_response counter does not show the status codes received by whoever is calling APIcast. I think this is what is causing confusion here.

When a request does not match any mapping rules, APIcast does not need to contact the 3scale backend because mapping rules are stored in the APIcast configuration. APIcast only needs to call backend to validate credentials (user_key, app_key, etc.) and to report metrics.

@gnunn1
Copy link

gnunn1 commented Sep 6, 2018

@davidor The URL I am using to hit the service is:

http://apicast-rhoar.apps.ocplab.com/orders/3

This returns a 200 since order 3 is an available item. However if I change 3 to 5 as follows:

http://apicast-rhoar.apps.ocplab.com/orders/5

The backend service returns a 404 since order 5 doesn't exist. Interestingly the prometheus metrics increments the 2xx response as a result of this despite postman showing that 404 is returned. Is this a case where a 404 is considered "successful" since in REST calls it can be a valid response? Doesn't feel intuitive though if this is the case and maybe deserving of it's own category?

Validating this with curl against apicast:

curl -i -H "user-key:xxxx" http://apicast-rhoar.apps.ocplab.com/orders/5
HTTP/1.1 404 
Server: openresty/1.13.6.2
Date: Thu, 06 Sep 2018 22:29:59 GMT
Content-Type: application/json
Content-Length: 35
X-Application-Context: gateway:kubernetes
Set-Cookie: 49d08fc35ccdc462e0e3e881ac73eef6=8b49e0c49ad92d0f89a37ec8b6bca4a4; path=/; HttpOnly
Cache-control: private

404 - Requested order doesn't exist

And then directly against the backend:

curl -i -H "user-key:xxxx" http://gateway-rhoar.apps.ocplab.com/orders/5
HTTP/1.1 404 
X-Application-Context: gateway:kubernetes
Date: Thu, 06 Sep 2018 22:31:56 GMT
Content-Type: application/json
Content-Length: 35
Set-Cookie: 565785e74d5ae03867cd44d8c8709949=e9cb74129f1f17d3a9280501f4c6d9cf; path=/; HttpOnly
Cache-control: private

404 - Requested order doesn't exist

If I change the user-key to an invalid entry then the 4xx is incremented in response to 403 forbidden as per your findings.

With regards to your explanation about why the Not Matching mapping rules scenario doesn't increment the counter, I'm curious why a bad user-key increments the 4xx counter on a 403 since presumably apicast never calls the backend in this scenario either?

@mikz
Copy link
Contributor

mikz commented Sep 7, 2018

@gnunn1 the metric backend_response is the 3scale backend response, not your upstream response. You can see it increment 403 when you use wrong user key for example.

But it is a good point to rename the metric, as backend_response does not indicate it is 3scale specific.

@andrewdavidmackenzie
Copy link
Member

"backend" is an internal term we try to avoid using "externally" (customer visible).

(3scale) "Service Management API" is the official term of the API that is used and returns that response.

Either a generic "3scale" or "authrep request" or something is needed to clarify this.

@davidor
Copy link
Contributor

davidor commented Oct 16, 2018

We've added several metrics in different PRs. All of them are linked in this issue.
You can check the full list of metrics exported in this document: https://github.com/3scale/apicast/blob/master/doc/prometheus-metrics.md

@davidor davidor closed this as completed Oct 16, 2018
@gnunn1
Copy link

gnunn1 commented Oct 16, 2018

@davidor I installed the apicast from master and can see the response times however if I am reading them correctly they are global to the gateway rather then service or mapping/metric specific. Are there any plans to make these more granular so we could build more specific dashboards in something like grafana?

# HELP total_response_time_seconds Time needed to sent a response to the client (in seconds).
# TYPE total_response_time_seconds histogram
total_response_time_seconds_bucket{le="00.005"} 5
total_response_time_seconds_bucket{le="00.010"} 5
total_response_time_seconds_bucket{le="00.020"} 5
total_response_time_seconds_bucket{le="00.030"} 5
total_response_time_seconds_bucket{le="00.050"} 5
total_response_time_seconds_bucket{le="00.075"} 5
total_response_time_seconds_bucket{le="00.100"} 5
total_response_time_seconds_bucket{le="00.200"} 5
total_response_time_seconds_bucket{le="00.300"} 6
total_response_time_seconds_bucket{le="00.400"} 6
total_response_time_seconds_bucket{le="00.500"} 6
total_response_time_seconds_bucket{le="00.750"} 6
total_response_time_seconds_bucket{le="01.000"} 6
total_response_time_seconds_bucket{le="01.500"} 6
total_response_time_seconds_bucket{le="02.000"} 6
total_response_time_seconds_bucket{le="03.000"} 6
total_response_time_seconds_bucket{le="04.000"} 6
total_response_time_seconds_bucket{le="05.000"} 6
total_response_time_seconds_bucket{le="10.000"} 6
total_response_time_seconds_bucket{le="+Inf"} 6
total_response_time_seconds_count 6
total_response_time_seconds_sum 0.3
# HELP upstream_response_time_seconds Response times from upstream servers
# TYPE upstream_response_time_seconds histogram
upstream_response_time_seconds_bucket{le="00.005"} 5
upstream_response_time_seconds_bucket{le="00.010"} 5
upstream_response_time_seconds_bucket{le="00.020"} 5
upstream_response_time_seconds_bucket{le="00.030"} 5
upstream_response_time_seconds_bucket{le="00.050"} 5
upstream_response_time_seconds_bucket{le="00.075"} 5
upstream_response_time_seconds_bucket{le="00.100"} 6
upstream_response_time_seconds_bucket{le="00.200"} 6
upstream_response_time_seconds_bucket{le="00.300"} 6
upstream_response_time_seconds_bucket{le="00.400"} 6
upstream_response_time_seconds_bucket{le="00.500"} 6
upstream_response_time_seconds_bucket{le="00.750"} 6
upstream_response_time_seconds_bucket{le="01.000"} 6
upstream_response_time_seconds_bucket{le="01.500"} 6
upstream_response_time_seconds_bucket{le="02.000"} 6
upstream_response_time_seconds_bucket{le="03.000"} 6
upstream_response_time_seconds_bucket{le="04.000"} 6
upstream_response_time_seconds_bucket{le="05.000"} 6
upstream_response_time_seconds_bucket{le="10.000"} 6
upstream_response_time_seconds_bucket{le="+Inf"} 6
upstream_response_time_seconds_count 6
upstream_response_time_seconds_sum 0.1
# HELP upstream_status HTTP status from upstream servers
# TYPE upstream_status counter
upstream_status{status="200"} 6

@davidor
Copy link
Contributor

davidor commented Oct 17, 2018

@gnunn1 Including services, upstreams, metrics, etc. in the Prometheus labels is something we'll evaluate in the future. Prometheus might not be the right tool to store that kind of information.

According to the Prometheus guidelines it is not recommended to use labels for dimensions that can have a large number of values, and in some deployments, the number of services, upstreams, and 3scale metrics can be very high.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants