Monitoring Applications
Monitoring applications remains a critical part of the microservice world. The challenges associated with monitoring microservices are typically unique to your ecosystem and failures can oftentimes be discreet - a small module's failure can go unnoticed for some time.
If we look into a more traditional monolithic application, installed as a single executable library or service - failures are typically more explicit as its modules aren't meant to run as standalone services.
During development, monitoring oftentimes isn't taken into much consideration initially, since there are typically more pressing matters to attend to. Though, once deployed, especially if the traffic to the application starts to increase - monitoring bottlenecks and the health of the system becomes necessary for quick turnaround in case something fails.
System failures are the best case for monitoring your applications. Complex distributed systems, as well as generic monoliths, might operate in a degraded state that impacts performance. These degraded states often lead to eventual failures. Monitoring the behavior of applications can alert operators to the degraded state before total failure occurs.
In this guide, we'll look into Prometheus and Grafana to monitor a Node.js application. We'll be using a Node.js library to send useful metrics to Prometheus, which then in turn exports them to Grafana for data visualization.
Prometheus - A Product With a DevOps Mindset
Prometheus is an open-source monitoring system and a member of the Cloud Native Computing Foundation. It was originally created as an in-house monitoring solution for SoundCloud, but is now maintained by a developer and user community.
Features of Prometheus
Some of the key features of Prometheus are:
- Prometheus collects the metrics from the server or device by pulling their metric endpoints over HTTP at a predefined time interval.
- A multi-dimensional time-series data model. In simpler terms - it keeps track of time-series data for different features/metrics (dimensions).
- It offers a proprietary functional query language, know as PromQL (Prometheus Query Language). PromQL can be used for data selection and aggregation.
- Pushgateway - a metrics cache, developed for saving batch jobs' metrics whose short-life typically make them unreliable or impossible to scrape at regular intervals over HTTP.
- A web UI to execute PromQL expressions and visualize the results in a table or graph over time.
- It also provides alerting features to send alerts to an Alertmanager on matching a defined rule and send notifications via email or other platforms.
- The community maintains a lot of third party exporters and integrators that help in pulling metrics.
Architecture Diagram
Credit: Prometheus.io
Introducing prom-client
Prometheus runs on its own server. To bridge your own application to the Prometheus server, you'll need to use a metrics exporter, and expose the metrics so that Prometheus can pull them via HTTP.
We'll be relying on the prom-client library to export metrics from our application. It supports data exports required to produce histograms, summaries, gauges and counters.
Installing prom-client
The easiest way to install the prom-client
module is via npm
:
$ npm install prom-client
Exposing Default Prometheus Metrics with prom-client
The Prometheus team has a set of recommended metric to keep track of, which prom-client
consequently includes as the default metrics, which can be obtained from the client via collectDefaultMetrics()
.
These are, amongst other metrics, the virtual memory size, number of open file descriptors, total CPU time spent, etc:
const client = require('prom-client');
// Create a Registry to register the metrics
const register = new client.Registry();
client.collectDefaultMetrics({register});
We keep track of the metrics collected in a Registry
- so when collecting the default metrics from the client, we pass in the Registry
instance. You can also supply other customization options in the collectDefaultMetrics()
call:
const client = require('prom-client');
// Create a Registry to register the metrics
const register = new client.Registry();
client.collectDefaultMetrics({
app: 'node-application-monitoring-app',
prefix: 'node_',
timeout: 10000,
gcDurationBuckets: [0.001, 0.01, 0.1, 1, 2, 5],
register
});
Here, we've added the name of our app, a prefix
for the metrics for ease of navigation, a timeout
parameter to specify when requests timeout as well as a gcDurationBuckets
which define how big the buckets should be for the Garbage Collection Histogram.
Collecting any other metrics follows the same pattern - we'll collect them via the client
and then register them into the registry. More on this later.
Once the metrics are situated in the register, we can return them from the register on an endpoint that Prometheus will be scraping from. Let's create an HTTP server, exposing a /metrics
endpoint, which returns the metrics()
from the register
when hit:
const client = require('prom-client');
const express = require('express');
const app = express();
// Create a registry and pull default metrics
// ...
app.get('/metrics', async (req, res) => {
res.setHeader('Content-Type', register.contentType);
res.send(await register.metrics());
});
app.listen(8080, () => console.log('Server is running on http://localhost:8080, metrics are exposed on http://localhost:8080/metrics'));
We've used Express.js to expose an endpoint at port 8080
, which when hit with a GET
request returns the metrics from the registry. Since metrics()
returns a Promise
, we've used the async
/await
syntax to retrieve the results.
If you're unfamiliar with Express.js - read our Guide to Building a REST API with Node.js and Express.
Let's go ahead and send a curl
request to this endpoint:
$ curl -GET localhost:8080/metrics
# HELP node_process_cpu_user_seconds_total Total user CPU time spent in seconds.
# TYPE node_process_cpu_user_seconds_total counter
node_process_cpu_user_seconds_total 0.019943
# HELP node_process_cpu_system_seconds_total Total system CPU time spent in seconds.
# TYPE node_process_cpu_system_seconds_total counter
node_process_cpu_system_seconds_total 0.006524
# HELP node_process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE node_process_cpu_seconds_total counter
node_process_cpu_seconds_total 0.026467
# HELP node_process_start_time_seconds Start time of the process since unix epoch in seconds.
...
The metrics consist of a bunch of useful metrics, each explained through comments. Though, coming back to the statement from the introduction - in a lot of cases, your monitoring needs might be ecosystem-specific. Thankfully, you have full flexibility to expose your own custom metrics as well.
Exposing Custom Metrics with prom-client
Although exposing default metrics is a good starting point to understand the framework as well as your application - at some point, we will need to define custom metrics to employ a hawk-eye into a few request flows.
Let's create a metric that keeps track of the HTTP request durations. To simulate a heavy operation on a certain endpoint, we'll create a mock operation that takes 3-6 seconds to return a response. We'll visualize a Histogram of the response times and the distribution that they have. We'll also be taking the routes and their return codes into consideration.
To register and keep track of a metric such as this - we'll create a new Histogram
and use the startTimer()
method to start a timer. The return type of the startTimer()
method is another function that you can invoke to observe (log) the recorded metrics and end the timer, passing in the labels you'd like to associate the histogram's metrics with.
You can manually observe()
values, though, it's easier and cleaner to invoke the returned method.
Let's first go ahead and create a custom Histogram
for this:
// Create a custom histogram metric
const httpRequestTimer = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'code'],
buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10] // 0.1 to 10 seconds
});
// Register the histogram
register.registerMetric(httpRequestTimer);
Note: The buckets
are simply the labels for our Histogram and refer to the length of requests. If a request takes less than 0.1s to execute, it belongs to the 0.1
bucket.
We'll refer to this instance every time we'd like to time some requests and log their distribution. Let's also define a a delay handler, which delays the response and thus simulates a heavy operation:
// Mock slow endpoint, waiting between 3 and 6 seconds to return a response
const createDelayHandler = async (req, res) => {
if ((Math.floor(Math.random() * 100)) === 0) {
throw new Error('Internal Error')
}
// Generate number between 3-6, then delay by a factor of 1000 (miliseconds)
const delaySeconds = Math.floor(Math.random() * (6 - 3)) + 3
await new Promise(res => setTimeout(res, delaySeconds * 1000))
res.end('Slow url accessed!');
};
Finally, we can define our /metrics
and /slow
endpoints, one of which uses the delay handler to delay the responses. Each of these will be timed with our httpRequestTimer
instance, and logged:
// Prometheus metrics route
app.get('/metrics', async (req, res) => {
// Start the HTTP request timer, saving a reference to the returned method
const end = httpRequestTimer.startTimer();
// Save reference to the path so we can record it when ending the timer
const route = req.route.path;
res.setHeader('Content-Type', register.contentType);
res.send(await register.metrics());
// End timer and add labels
end({ route, code: res.statusCode, method: req.method });
});
//
app.get('/slow', async (req, res) => {
const end = httpRequestTimer.startTimer();
const route = req.route.path;
await createDelayHandler(req, res);
end({ route, code: res.statusCode, method: req.method });
});
// Start the Express server and listen to a port
app.listen(8080, () => {
console.log('Server is running on http://localhost:8080, metrics are exposed on http://localhost:8080/metrics')
});
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Now, every time we send a request to the /slow
endpoint, or the /metrics
endpoint - the request duration is being logged and added to Prometheus' registry. Incidentally, we also expose these metrics on the /metrics
endpoint. Let's send a GET
request to /slow
and then observe the /metrics
again:
$ curl -GET localhost:8080/slow
Slow url accessed!
$ curl -GET localhost:8080/metrics
# HELP http_request_duration_seconds Duration of HTTP requests in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="0.3",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="0.5",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="0.7",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="1",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="3",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="5",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="7",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="10",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="+Inf",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_sum{route="/metrics",code="200",method="GET"} 0.0042126
http_request_duration_seconds_count{route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="0.1",route="/slow",code="200",method="GET"} 0
http_request_duration_seconds_bucket{le="0.3",route="/slow",code="200",method="GET"} 0
http_request_duration_seconds_bucket{le="0.5",route="/slow",code="200",method="GET"} 0
http_request_duration_seconds_bucket{le="0.7",route="/slow",code="200",method="GET"} 0
http_request_duration_seconds_bucket{le="1",route="/slow",code="200",method="GET"} 0
http_request_duration_seconds_bucket{le="3",route="/slow",code="200",method="GET"} 0
http_request_duration_seconds_bucket{le="5",route="/slow",code="200",method="GET"} 0
http_request_duration_seconds_bucket{le="7",route="/slow",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="10",route="/slow",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="+Inf",route="/slow",code="200",method="GET"} 1
http_request_duration_seconds_sum{route="/slow",code="200",method="GET"} 5.0022148
http_request_duration_seconds_count{route="/slow",code="200",method="GET"} 1
The histogram has several buckets and keeps track of the route
, code
and method
we've used to access an endpoint. It took 0.0042126
seconds to access /metrics
, but a whopping 5.0022148
to access the /slow
endpoint. Now, even though this is a really small log, keeping track of a single request each to only two endpoints - it's not very easy on the eyes. Humans are not great at digesting a huge amount of info like this - so it's best to refer to visualizations of this data instead.
To do this, we'll use Grafana to consume the metrics from the /metrics
endpoint and visualize them. Grafana, much like Prometheus, runs on its own server, and an easy way to get both of them up alongside our Node.js application is through a Docker Compose Cluster.
Docker Compose Cluster Setup
Let's begin by creating a docker-compose.yml
file which we'll use to let Docker know how to start up and expose the respective ports for the Node.js server, the Prometheus server and Grafana server. Since Prometheus and Grafana are available as Docker images, we can directly pull their images from Docker Hub:
version: '2.1'
networks:
monitoring:
driver: bridge
volumes:
prometheus_data: {}
grafana_data: {}
services:
prometheus:
image: prom/prometheus:v2.20.1
container_name: prometheus
volumes:
- ./prometheus:/etc/prometheus
- prometheus_data:/prometheus
ports:
- 9090:9090
expose:
- 9090
networks:
- monitoring
grafana:
image: grafana/grafana:7.1.5
container_name: grafana
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_AUTH_DISABLE_LOGIN_FORM=true
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
ports:
- 3000:3000
expose:
- 3000
networks:
- monitoring
node-application-monitoring-app:
build:
context: node-application-monitoring-app
ports:
- 8080:8080
expose:
- 8080
networks:
- monitoring
The Node application is being exposed on port 8080
, Grafana is exposed on 3000
and Prometheus is exposed on 9090
. Alternatively, you can clone our GitHub repository:
$ git clone https://github.com/StackAbuse/node-prometheus-grafana.git
You may also refer to the repository if you're unsure which configuration files are supposed to be situated in which directories.
All the docker containers can be started at once using the docker-compose
command. As a prerequisite, whether you want to host this cluster on a Windows, Mac or Linux machine, Docker Engine and Docker Compose need to be installed.
If you'd like to read more about Docker and Docker Compose, you can read our guide to Docker: A High Level Introduction or How Docker can Make your Life Easier as a Developer.
Once installed, you can run the following command in the project root directory:
$ docker-compose up -d
After executing this command, three applications will be running in the background - a Node.js server, Prometheus Web UI and server as well as Grafana UI.
Configuring Prometheus to Scrape Metrics
Prometheus scrapes the relevant endpoint at given time intervals. To know when to scrape, as well as where, we'll need to create a configuration file - prometheus.yml
:
global:
scrape_interval: 5s
scrape_configs:
- job_name: "node-application-monitoring-app"
static_configs:
- targets: ["docker.host:8080"]
Note: docker.host
needs to be replaced with the actual hostname of the Node.js server configured in the docker-compose
YAML file.
Here, we've scheduled it to scrape the metrics every 5 seconds. The global setting by default is 15 seconds, so we've made it a bit more frequent. The job name is for our own convenience and to identify the app we're keeping tabs on. Finally, the /metrics
endpoint of the target is what Prometheus will be peeking on.
Configure Data Source for Grafana
While we're configuring Prometheus - let's also create a data source for Grafana. As mentioned before, and as will be further elaborated - it accepts data from a data source and visualizes it. Of course, these data sources need to conform to some protocols and standards.
The datasources.yml
file houses the configuration for all of Grafana's data sources. We just have one - our Prometheus server, exposed on port 9090
:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
orgId: 1
url: http://docker.prometheus.host:9090
basicAuth: false
isDefault: true
editable: true
Note: docker.prometheus.host
is to be replaced with the actual Prometheus hostname configured in the docker-compose
YAML file.
Simulate Production-Grade Traffic
Finally, it'll be easiest to view the results if we generate some synthetic traffic on the application. You can simply reload the pages multiple times, or send many requests, but since this would be time-consuming to do by hand - you can use any of the various tools such as ApacheBench, Ali, API Bench, etc.
Our Node.js app will use the prom-client to log these and submit them to the Prometheus server. All that's left is to use Grafana to visualize them.
Grafana - An Easy-to-Setup Dashboard
Grafana is an analytics platform used to monitor and visualize all kinds of metrics. It allows you to add custom queries for its data sources, visualize, alert on and understand your metrics no matter where they are stored. You can create, explore, and share dashboards with your team and foster a data-driven culture.
Grafana collects data from various data sources and Prometheus is just one of them.
Grafana Monitoring Dashboards
A few dashboards are bundled out-of-the-box to provide an overview of what's going on. The NodeJS Application Dashboard collects the default metrics and visualizes them:
The High Level Application Metrics dashboard shows high-level metrics for the Node.js Application using default metrics such as the error rate, CPU usage, memory usage, etc:
The Request Flow Dashboard shows request flow metrics using the APIs that we have created in the Node.js application. Namely, here's where the Histogram
we've created gets to shine:
Memory Usage Chart
Instead of the out-of-the-box dashboards, you can also create aggregations to calculate different metrics. For example, we can calculate the memory usage over time via:
avg(node_nodejs_external_memory_bytes / 1024) by (route)
Request Per Second Histogram Chart
Or, we can plot a graph displaying requests per second (in 2 minute intervals), using the data from our own data collector:
sum(rate(http_request_duration_seconds_count[2m]))
Conclusion
Prometheus and Grafana are powerful open-source tools for application monitoring. With an active community and many client libraries and integrations, few lines of code bring up a pretty neat and clean insight into the system.