Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus support #1507

Open
fujita opened this issue Nov 22, 2017 · 32 comments
Open

prometheus support #1507

fujita opened this issue Nov 22, 2017 · 32 comments

Comments

@fujita
Copy link
Member

fujita commented Nov 22, 2017

Is anyone interested in?

@CoreTex
Copy link

CoreTex commented Nov 23, 2017

Yes, but more interesting would be Zabbix.

@benerudolph
Copy link

Maybe there is a common way to handle both monitoring via prometheus and other monitoring systems in a generic way?

@nstgt
Copy link
Contributor

nstgt commented Nov 25, 2017

These metrics must be helpful for us.

  • routes # of adj-in, accepted, and local RIB for route server mode
  • per peer bgp packets counter
  • buffering status

@kilo-bit
Copy link

kilo-bit commented Dec 8, 2017

Prometheus support would be greatly appreciated.

@imcom
Copy link
Contributor

imcom commented Dec 26, 2017

+1 for Prometheus

@tonidas
Copy link

tonidas commented Jan 4, 2018

+1 for Prometheus!

1 similar comment
@xiaopeng163
Copy link
Contributor

+1 for Prometheus!

@greenpau
Copy link
Contributor

greenpau commented May 2, 2018

Is anyone interested in?

@fujita, I am interested 😃 There are at least two ways to do that.

  1. A separate project for gobgp_exporter, which queries gobgp API.
  2. A separate go routine in gobgp

Any preference?

@greenpau
Copy link
Contributor

greenpau commented May 3, 2018

@fujita , I looked that gobgp code and I personally like point 1. Could you create a separate project, say gobgp_exporter for the exporter under osrg? I could put a PR for the exporter code pretty fast. The client API for gobgp is pretty straight-forward, e.g. showRibInfo() function.

@fujita
Copy link
Member Author

fujita commented Jun 2, 2018

Sorry about the delay, I've just started to investigate implementation options.

@greenpau why can't we pick the best of option 1 and 2? Having a separate project (exporter) forces users to manage another daemon. We can't run a go routine to execute the gRPC API and handle http requests from prometheus?

I've just create a branch to only send the received number in each peer:
fujita@090cca2

@greenpau
Copy link
Contributor

greenpau commented Jun 2, 2018

@fujita, it depends on the circumstances.

  1. Sync vs Async polling...imagine you want to monitor a routing table contining 1000s routes. Do you want the polling to be sync or async?
  2. Containerizing daemons. Once you containerized gobgp with prom built-in, the different go routine compete for the same cpu, because 1 cpu for a container.
  3. Relid configs. How do you reload prom config?
  4. etc.

Prom exporters are designed on per app basis. Compare node_exporter and backbox_exporter and you will get the gist.

@fujita
Copy link
Member Author

fujita commented Jun 3, 2018

  1. Sync vs Async polling.

I don't understand. I don't see how this is related with the exporter and exposing models.

  1. Containerizing daemons.

1 cpu for a container? Why? You can configure what you like.

  1. Relid configs. How do you reload prom config?

I had quick look at the official exporters, mysql, consul. Looks like they don't support the config reloading.

@greenpau
Copy link
Contributor

greenpau commented Jun 4, 2018

@fujita , see below.

1 cpu for a container? Why? You can configure what you like.

Not in containers. With docker for example, you can configure cpusets. However, that setting just allows you to switch between CPUs, it does not allow running on multiple CPUs concurrently.

I don't understand. I don't see how this is related with the exporter and exposing models.

Polling = scraping

Sync scraping means you make a call to the URL of an exporter, and subsequent data collection starts. Once the data collected it outputs it to your scraper.

Async scraping means that data collection runs independent of the scrape calls. In that mode, the scraper gets that data from data collection buffer.

Why that matters. Consider your exporter has 100-100s of metrics and data collection takes 1-3s. That's the time your GoBGP server routine competes with the collectors' routines for single CPU time. It obviously does not matter when you collect status of BGP peers, because it probably takes 0.00001s.

Looks like they don't support the config reloading.

Right. Consider you want to change the configuration of the exporter, would you restart GoBGP and drop neighbor relations with your peers?

@fujita
Copy link
Member Author

fujita commented Jun 4, 2018

Not in containers. With docker for example, you can configure cpusets. However, that setting just allows you to switch between CPUs, it does not allow running on multiple CPUs concurrently.

Hmm, really? Inside a single container, can you try to run two processes; each execute an endless loop? Then see the output of top command on a host OS.

@greenpau
Copy link
Contributor

greenpau commented Jun 4, 2018

Hmm, really?

Your GoBGP and Prometheus go routines are not separate processes. They are go routines. Hence, will be on the same CPU.

@greenpau
Copy link
Contributor

greenpau commented Jun 4, 2018

can you try to run two processes; each execute an endless loop?

@fujita , that would work :-) However, that would NOT be parent and child processes like in a container. These would be two separate parents. In a container, you have a single parent (akin init) and bunch of child processes.

@greenpau
Copy link
Contributor

greenpau commented Jun 4, 2018

By default, processes can run on any CPU. However, processes can be instructed to run on a predetermined selection of CPUs, by changing the affinity of the process. Child processes inherent the CPU affinities of their parents.
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Affinity.html

@fujita , refreshed my memory on the subject. Here, there is a snippet to run the test you proposed.

As the sample runs below demonstrate, the amount of real and CPU time
consumed when running the program will depend on intra-core caching
effects and whether the processes are using the same CPU.

@fujita
Copy link
Member Author

fujita commented Jun 4, 2018

can you try to run two processes; each execute an endless loop?

@fujita, that would work :-) However, that would NOT be parent and child processes like in a container. These would be two separate parents. In a container, you have a single parent (akin init) and bunch of child processes.

Please run a process that create two child processes executing an endless loop and wait forever. See the output of top on the host.

@greenpau
Copy link
Contributor

greenpau commented Jun 5, 2018

Please run a process that create two child processes executing an endless loop and wait forever. See the output of top on the host.

@fujita , yes, two processes running in a container can be on separate CPUs. My mistake.

The test script:

import os

def child():
   print('\nA new child ',  os.getpid())
   os._exit(0)

def parent():
   while True:
      newpid = os.fork()
      if newpid == 0:
         child()
      else:
         pids = (os.getpid(), newpid)
         print("parent: %d, child: %d\n" % pids)
      reply = input("q for quit / c for new fork")
      if reply == 'c':
          continue
      else:
          break

parent()

Running the test:

# python test.py
parent: 29, child: 30

q for quit / c for new fork('\nA new child ', 30)

The two processes are 1847 (parent) and 1848 (child):

$ ps -ef | grep python
root      1847  1817  0 20:05 pts/3    00:00:00 python test.py
root      1848  1847  0 20:05 pts/3    00:00:00 [python] <defunct>

CPU Sets:

# cat /proc/1847/status  | grep Cpus
Cpus_allowed:	f
Cpus_allowed_list:	0-3

# cat /proc/1848/status  | grep Cpus
Cpus_allowed:	f
Cpus_allowed_list:	0-3

The CPU process:

#  cat /proc/1848/stat | cut -d' ' -f39
1

#  cat /proc/1847/stat | cut -d' ' -f39
0

@fujita
Copy link
Member Author

fujita commented Jun 5, 2018

Good, now you understand goroutines DON'T compete for a single CPU inside a container. A container is just namespaces and processes from the perspective of Linux kernel so it can use as many CPUs as you have.

For me, some official exporters don't support reloading because it's not useful much. So I go with the current design. If someone needs reloading, it's pretty easy for him to implement a own gobgp exporter by using gobgp/exporter as a package; he just needs a main function.

@greenpau
Copy link
Contributor

greenpau commented Jun 5, 2018

now you understand goroutines DON'T compete for a single CPU inside a container.

@fujita , overconfidence 😆 they DO compete when there is only 1 CPU to compete over 😄 They also compete when you have more goroutines than CPUs. I get that an application, over time, might use all CPUs. How do you know that goroutines are being load balanced across all CPUs?

@fujita
Copy link
Member Author

fujita commented Jun 5, 2018

No difference between the exporter design and the current design on that.

@greenpau
Copy link
Contributor

greenpau commented Jun 5, 2018

FYI, check this article for CPU, memory profiling with Go.

@jeffbean
Copy link
Contributor

Tally allows you to define and emit metrics in different ways. Its an option that allows you to configure multiple ways of collecting/emitting metrics. Just a suggestion.

@greenpau
Copy link
Contributor

@fujita , I started with external exporter here: https://github.com/greenpau/gobgp_exporter
Would love to get your feedback on the code. There are probably some issues with concurrency. Still getting up to speed with Go. I spent my weekend on this 😄

@fujita
Copy link
Member Author

fujita commented Aug 23, 2018

@greenpau nice, but why you modify pb.go file?

@greenpau
Copy link
Contributor

greenpau commented Aug 24, 2018

why you modify pb.go file?

@fujita , I guess the primary reason is that I want to control the timeout on the client.

The Collect() will be called on each scrape. If the daemon does not respond, the srcape holds the session and will not timeout right away. This results in hanging scrapes. It is not long, but should timeout in a second or two.
In my last commit, I added a logic where the GoBGP polling and scraping happen async. There is also some concurrency logic.

Additionally, ... somewhat related, but not ... since the changes to the API (hiding behind the /internal), the only way to get to some functionality (e.g. things related to ptype, any, etc.) is by adding extra functions in the way I did it with extended client. Not ideal. I would prefer to have that code in upstream, but I don't know whether you want to have it there.

@greenpau
Copy link
Contributor

I think I am close to getting some of the things mentioned here done.

@fujita
Copy link
Member Author

fujita commented Aug 25, 2018

why not using context.WithTimeout?

@greenpau
Copy link
Contributor

@fujita, because I don’t know how context works. I will take time to read up on it.

@greenpau
Copy link
Contributor

why not using context.WithTimeout?

@fujita , I was reading up on the Go's context package. Say I implemented the context with timeout. The timeout gets triggered. How do I close the connection? (Ideally, in case of a timeout, I would prefer not keeping half-closed connections). How do I assess whether the connection is still alive?

@nemith
Copy link

nemith commented Oct 9, 2020

@greenpau you do a select on ctx.Done() at some point of your connection. This is automatic with the net package I believe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests