Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrote monitoring guide #221

Merged
merged 2 commits into from
Feb 24, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 139 additions & 15 deletions documentation/guide-monitoring.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -3,25 +3,149 @@ toc::[]

= Monitoring

TODO
Monitoring is a very comprehensive topic. For devon4j we will focus on selected core topics which are most important when developing production-ready applications.
On a high level view we strongly suggest to separate the application to be monitoring from the "monitoring tool" itself.
The monitoring system covers aspects like

- Collect monitoring information
- Aggregate, process and visualizate data, e.g. in dashboards
- Provide alarms
- ...

== JMX
We use JMX to provide monitoring information via MBeans.
In distributed systems it is crucial that a monitoring systems provides a central overview over all applications in your application landscape. So it is not feasible to provide a monitoring system as part of your application. Your application is responsible for providing information to the monitoring system.

=== MBeans
[source,java]
TODO MBean example
== How to provide monitoring information

=== Using JMX
TODO: explain +jvisualvm+ and +VisualVM-MBeans+ Plugin
give some hints and links on heap analysis and samplers
=== Java Management Extensions (JMX)

=== Integration with Nagios
JMX is the official java monitoring solution. It is part of the JDK. Your application may provide monitoring information or receive monitoring related commands via MBeans. There is a huge amount of information about JMX available. A good starting point might be link:https://en.wikipedia.org/wiki/Java_Management_Extensions:[Wikipedia].
Traditionally JMX uses RMI for communication. In many environments HTTP is preffered, so be careful on deciding if JMX is the right solution.
Alternativly your application could provide a monitoring API via REST.

== GC-Logging
TODO: Explain JVM options, minimum overhead, always turn on, link to gcvisualizer
Explain G1 collector that scales for large heaps
=== REST APIs

== Loadbalancing
TODO explain status URL and integration with LB
Since REST APIs are very popular it is quite natural to provide monitoring information via REST. If your application already uses REST and there is no JMX/RMI based monitoring solution in place, it is often a better approach to provide monitoring APIs via REST than introducing a new protocol with JMX.

=== Logging

Some aspects of monitoring are covered by "logging". A typical case might be the requirement to "monitor" the application your REST APIs. For that it is perfectly fine to log the perfomance of all (or some) request and ship this information to a logging system like link:www.graylog.org[Graylog]. So please carefully read the link:guide-logging.asciidoc[logging guide]. To allow efficient processing of those logs you should use JSON based logs.

== Which monitoring informaton to provide?

It is not possible to provide a final list of which monitoring information is exactly required for your application. But we give you a general advice what type of information your application should provide at a minimum.

=== General information

It is often useful if your application provides basic information about itself. This should cover

* The name of the application
* The version (or buildno., or commit-id, ...) of the application

This is espicially useful in complex environments, e.g. to very that deployments went correctly.

=== Metrics

Metric means providing key figures about your applications like

* performance key figures
** request duration
** queue lengths
** duration of database queries
** ..
* business related information
** number of successful / unsuccesful requests
** size of result sets
** worth of shopping baskets
** ..
* Technical key figures
** JVM heap usage
** cache sizes
** (database) pool usage
** ...
* ...

Remember that processing of this data should be done mainly in the monitoring system. You might have noticed that there are different types of metrics those that represent current values (like JVM heap usage, queue length, ...), others base on (timed) events like (duration of requests). Handling of different types of metrics might be different.
For handling events you may:

* Write log statements for each (or a sample of) event. These logs must then be shipped to your monitoring systems.
* Send data for the event via an API of your monitoring systems
* Provide a REST API (or JMX MBeans) with pre-aggregated key figures, which is periodically polled by your monitoring system. This solution is a bit inferior since the aggregation is part of your application and might not fit to the desired visualization in your monitoring systems.

For actual values you may:

* Write them perodically to your log. These logs must then be shipped to your monitoring systems.
* Send them peridocally via an API of your monitoring systems
* Provide a REST API (or JMX MBeans) which is periodically polled by your monitoring system.

[health]
=== Health (Watchdog)

For monitoring a complex application landscaper it is crucial to have a exact overview if all applications are up and running. So your application should offer an API for the monitoring systems which allows to easily check if the application is alive. Often this alive information is polled by the monitoring systems with a kind of watchdog.
The health check should include checks if the application is working "correctly". For that we suggest to check if all required neighbour systems and infrastructure components are usable:

* Check if your database can be queried (with a dummy query)
* Check if you can reach your messaging system
* Check if you can reach all your neighbour system, e.g. by querying their info-endpoint

You should be very careful to not cascade those requests, e.g. your system should only test their direct neighbours. This test should not lead to additional tests in these systems.

The healthcheck should a return a simple OK/NOK result for use in dashboards, and may addtionally include detail results for each check.


== Implementation with Spring Boot Actuator

To implement a monitoring API for your systems we suggest to use link:https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-features.html[Spring Boot Actuator]. Actuator offers APIs which provide monitoring information including metrics via HTTP and JMX. It also contains a framework to implement xref:health[health checks].
Please consult the original documentation for information about how to use it.
Basically to use it by adding the following dependency to the `pom.xml` of your application core:

[source,xml]
----
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
</dependencies>
----

There will be several link:https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-features.html#production-ready-endpoints[endpoints] with monitoring information available out-of-the-box.
We *strongly* advice to check carefully which information is required in your context, normally this is `ìnfo`, `health` and `metrics`. Be careful not to expose any security related information via this mechanismen (e.g. by exposing those endpoints externally).

To make the info-endpoint useful you need to provide information to actuactor. A goodway to achive this is by using the provided link:https://docs.spring.io/spring-boot/docs/current/reference/html/howto.html#howto-automatic-expansion[maven module].

For first steps it might be useful to deactive security for the actuator endpoints (this is *just for testing*, *never release it!*). This can be accomblished by implementing the following class:

[source, java]
----
@Configuration
@EnableWebSecurity
@Profile(SpringProfileConstants.NOT_JUNIT)
public class WebSecurityConfig extends BaseWebSecurityConfig {

@Override
public void configure(WebSecurity web) throws Exception {

super.configure(web);
web.ignoring().requestMatchers(EndpointRequest.toAnyEndpoint());
}

}
----


=== Devonfw additions

Devonfw includes the following additions for Spring boot actuator:

* link:link:TODO[Kafka Health Check] in devon4j-kafka (WIP)

== Integration of monitoring information

=== Loadbalancers

To loadbalance HTTP requests the loadbalancers needs to know which instances of the desired application are available and functioning. Often loadbalancers support reacting on the HTTP status code of an HTTP request to the service. The loadbalancer will periodically poll the service to find out if is available or not.
To configure this you may use the healthcheck of the service to find out if the instance is functioning correctly or not.

=== Docker

Docker supports a link:https://docs.docker.com/engine/reference/builder/#healthcheck[healtcheck]. You may but a simple local curl to your application here to find out if the service is healthy or not. But be careful often unhealthy containers are automatically restarted. If you use the xref:health[health information] of your application this may lead to undesired effects. Since the health checks rellies on querying all neighbour systems and infrastucure components, applications often become unhealthy because of 3rd system has problems. Restarting the application itself will not heal the problem and be inexpedient. So generally it is better you query the info endpoint of your application to just check if the service itself is up and running.