Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary counter breaks under high load #189

Closed
DifferentialOrange opened this issue Feb 9, 2021 · 4 comments · Fixed by #241 or #407
Closed

Summary counter breaks under high load #189

DifferentialOrange opened this issue Feb 9, 2021 · 4 comments · Fixed by #241 or #407
Assignees
Labels
bug Something isn't working

Comments

@DifferentialOrange
Copy link
Member

DifferentialOrange commented Feb 9, 2021

I tried to monitor event loop with

local function monitor(collector)
    local time_before
    while true do
        time_before = clock.monotonic()
        fiber.yield()
        collector:observer(clock.monotonic() - time_before)
    end
end

local function init()
    local collector = require('metrics').summary('tnt_fiber_event_loop', 'event loop time',
        { [0.5] = 0.01, [0.9] = 0.01, [0.99] = 0.01, })

    fiber.create(monitor, collector)
end

after half a minute I've got

{
   "label_pairs":{
      "alias":"tnt_router"
   },
   "timestamp":1612882820817429,
   "metric_name":"tnt_fiber_event_loop_count",
   "value":7766500
},
{
   "label_pairs":{
      "alias":"tnt_router"
   },
   "timestamp":1612882820817429,
   "metric_name":"tnt_fiber_event_loop_sum",
   "value":7.6474393837925
},
{
   "label_pairs":{
      "quantile":0.5,
      "alias":"tnt_router"
   },
   "timestamp":1612882820821171,
   "metric_name":"tnt_fiber_event_loop",
   "value":8.4099883679301e-07
},
{
   "label_pairs":{
      "quantile":0.9,
      "alias":"tnt_router"
   },
   "timestamp":1612882820873518,
   "metric_name":"tnt_fiber_event_loop",
   "value":"inf"
},
{
   "label_pairs":{
      "quantile":0.99,
      "alias":"tnt_router"
   },
   "timestamp":1612882820893734,
   "metric_name":"tnt_fiber_event_loop",
   "value":"inf"
}

Also observation count in collector wasn't equal to the one in structure

help: event loop time
    observations:
      '': 7997000
    name: tnt_fiber_event_loop_count
    label_pairs:
      '': &6 []
  registry: *1
  objectives:
    0.5: 0.01
    0.9: 0.01
    0.99: 0.01
  sum_collector:
    registry: *1
    help: event loop time
    observations:
      '': 8.0931331436805
    name: tnt_fiber_event_loop_sum
    label_pairs:
      '': *6
  help: event loop time
  observations:
    '':
      b: 'cdata<double [?]>: 0x4113c3e0'
      compress_cnt: 500
      __max_samples: 500
      b_len: 0
      stream:
        f: 'function: 0x4113c100'
        l_len: 141219
        l_cap: 194581
        l: 'cdata<struct 913 [?]>: 0x460de020'
        n: 8008162
      sorted: true
  name: tnt_fiber_event_loop
  label_pairs:
    '': []
@yngvar-antonsson yngvar-antonsson added the bug Something isn't working label Feb 9, 2021
@yngvar-antonsson yngvar-antonsson self-assigned this Feb 9, 2021
@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days

@yngvar-antonsson
Copy link
Collaborator

Seems that problem is not reproducing after #241

@yngvar-antonsson yngvar-antonsson linked a pull request May 26, 2021 that will close this issue
2 tasks
@DifferentialOrange
Copy link
Member Author

DifferentialOrange commented May 6, 2022

Unfortunately, it's still relevant.

See https://github.com/tarantool/grafana-dashboard/tree/DifferentialOrange/crud-report . Start a cluster with docker-compose up, see tnt_crud_stats{operation="select",alias="tnt_router",status="ok",name="customers",quantile="0.99"} or tnt_crud_stats{operation="insert",alias="tnt_router",status="ok",name="customers",quantile="0.99"} metric drastically turn into -Inf.

Setup:

local DEFAULT_QUANTILES = {
    [0.99] = 1e-2,
}

local DEFAULT_AGE_PARAMS = {
    age_buckets_count = 2,
    max_age_time = 60,
}

If changed to

local DEFAULT_QUANTILES = {
    [0.99] = 1e-3,
}

everything seems fine.

DifferentialOrange added a commit to tarantool/crud that referenced this issue May 6, 2022
Make metrics quantile collector tolerated error [1] configurable. Change
metrics quantile collector default tolerated error from 1e-2 to 1e-3.

The motivation of this patch is a tarantool/metrics bug [2]. Sometimes
quantile values turn to `-Inf` under high load when observations are
small. It was reproduced in process of developing Grafana dashboard
panels for CRUD stats.

Quantile tolerated error could be changed with crud.cfg:

  crud.cfg{stats_quantile_tolerated_error = 1e-4}

1. https://www.tarantool.io/ru/doc/latest/book/monitoring/api_reference/#summary
2. tarantool/metrics#189
3. https://github.com/tarantool/grafana-dashboard/tree/DifferentialOrange/crud-report

Revert "stats: make quantile tolerated error configurable"

This reverts commit 32c6f5eabecc907ef570b66e15029dc9b4d6debf.
DifferentialOrange added a commit to tarantool/crud that referenced this issue May 6, 2022
Make metrics quantile collector tolerated error [1] configurable. Change
metrics quantile collector default tolerated error from 1e-2 to 1e-3.

The motivation of this patch is a tarantool/metrics bug [2]. Sometimes
quantile values turn to `-Inf` under high load when observations are
small. It was reproduced in process of developing Grafana dashboard
panels for CRUD stats.

Quantile tolerated error could be changed with crud.cfg:

  crud.cfg{stats_quantile_tolerated_error = 1e-4}

1. https://www.tarantool.io/ru/doc/latest/book/monitoring/api_reference/#summary
2. tarantool/metrics#189
3. https://github.com/tarantool/grafana-dashboard/tree/DifferentialOrange/crud-report
DifferentialOrange added a commit to tarantool/crud that referenced this issue May 6, 2022
Make metrics quantile collector tolerated error [1] configurable. Change
metrics quantile collector default tolerated error from 1e-2 to 1e-3.

The motivation of this patch is a tarantool/metrics bug [2]. Sometimes
quantile values turn to `-Inf` under high load when observations are
small. It was reproduced in process of developing Grafana dashboard
panels for CRUD stats [3].

Quantile tolerated error could be changed with crud.cfg:

  crud.cfg{stats_quantile_tolerated_error = 1e-4}

1. https://www.tarantool.io/ru/doc/latest/book/monitoring/api_reference/#summary
2. tarantool/metrics#189
3. https://github.com/tarantool/grafana-dashboard/tree/DifferentialOrange/crud-report
DifferentialOrange added a commit to tarantool/crud that referenced this issue May 6, 2022
Make metrics quantile collector tolerated error [1] configurable. Change
metrics quantile collector default tolerated error from 1e-2 to 1e-3.

The motivation of this patch is a tarantool/metrics bug [2]. Sometimes
quantile values turn to `-Inf` under high load when observations are
small. It was reproduced in process of developing Grafana dashboard
panels for CRUD stats [3].

Quantile tolerated error could be changed with crud.cfg:

  crud.cfg{stats_quantile_tolerated_error = 1e-4}

1. https://www.tarantool.io/ru/doc/latest/book/monitoring/api_reference/#summary
2. tarantool/metrics#189
3. https://github.com/tarantool/grafana-dashboard/tree/DifferentialOrange/crud-report
DifferentialOrange added a commit to tarantool/crud that referenced this issue May 6, 2022
Make metrics quantile collector tolerated error [1] configurable. Change
metrics quantile collector default tolerated error from 1e-2 to 1e-3.

The motivation of this patch is a tarantool/metrics bug [2]. Sometimes
quantile values turn to `-Inf` under high load when observations are
small. It was reproduced in process of developing Grafana dashboard
panels for CRUD stats [3].

Quantile tolerated error could be changed with crud.cfg:

  crud.cfg{stats_quantile_tolerated_error = 1e-4}

1. https://www.tarantool.io/ru/doc/latest/book/monitoring/api_reference/#summary
2. tarantool/metrics#189
3. https://github.com/tarantool/grafana-dashboard/tree/DifferentialOrange/crud-report
DifferentialOrange added a commit to tarantool/crud that referenced this issue May 6, 2022
Make metrics quantile collector tolerated error [1] configurable. Change
metrics quantile collector default tolerated error from 1e-2 to 1e-3.

The motivation of this patch is a tarantool/metrics bug [2]. Sometimes
quantile values turn to `-Inf` under high load when observations are
small. It was reproduced in process of developing Grafana dashboard
panels for CRUD stats [3].

Quantile tolerated error could be changed with crud.cfg:

  crud.cfg{stats_quantile_tolerated_error = 1e-4}

1. https://www.tarantool.io/ru/doc/latest/book/monitoring/api_reference/#summary
2. tarantool/metrics#189
3. https://github.com/tarantool/grafana-dashboard/tree/DifferentialOrange/crud-report
@filonenko-mikhail
Copy link
Contributor

repro

local fiber = require('fiber')
local clock = require('clock')
local log = require('log')

local function monitor(collector)
    local time_before
    while true do
        time_before = clock.monotonic()
        fiber.yield()
        collector:observe(clock.monotonic() - time_before)
    end
end

local function init()
    local collector = require('metrics').summary('tnt_fiber_event_loop', 'event loop time',
        { [0.5] = 0.01, [0.9] = 0.01, [0.99] = 0.01, })

    fiber.create(function() monitor(collector) end)
end

init()

require('console').start() os.exit(1)

And manually:

metrics = require('metrics')
 metrics.collect()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants