Node.js Application Monitoring with Prometheus and Grafana

Node.js Application Monitoring with Prometheus and Grafana

Monitoring Applications

Monitoring applications remains a critical part of the microservice world. The challenges associated with monitoring microservices are typically unique to your ecosystem and failures can oftentimes be discreet - a small module's failure can go unnoticed for some time.

If we look into a more traditional monolithic application, installed as a single executable library or service - failures are typically more explicit as its modules aren't meant to run as standalone services.

During development, monitoring oftentimes isn't taken into much consideration initially, since there are typically more pressing matters to attend to. Though, once deployed, especially if the traffic to the application starts to increase - monitoring bottlenecks and the health of the system becomes necessary for quick turnaround in case something fails.

System failures are the best case for monitoring your applications. Complex distributed systems, as well as generic monoliths, might operate in a degraded state that impacts performance. These degraded states often lead to eventual failures. Monitoring the behavior of applications can alert operators to the degraded state before total failure occurs.

In this guide, we'll look into Prometheus and Grafana to monitor a Node.js appllication. We'll be using a Node.js library to send useful metrics to Prometheus, which then in turn exports them to Grafana for data visualization.

Prometheus - A Product With a DevOps Mindset

Prometheus is an open-source monitoring system and a member of the Cloud Native Computing Foundation. It was originally created as an in-house monitoring solution for SoundCloud, but is now maintained by a developer and user community.

Features of Prometheus

Some of the key features of Prometheus are:

  • Prometheus collects the metrics from the server or device by pulling their metric endpoints over HTTP at a predefined time interval.
  • A multi-dimensional time-series data model. In simpler terms - it keeps track of time-series data for different features/metrics (dimensions).
  • It offers a proprietary functional query language, know as PromQL (Prometheus Query Language). PromQL can be used for data selection and aggregation.
  • Pushgateway - a metrics cache, developed for saving batch jobs' metrics whose short-life typically make them unreliable or impossible to scrape at regular intervals over HTTP.
  • A web UI to execute PromQL expression and visualize the results in a table or graph over time.
  • It also provides alerting features to send alerts to an Alertmanager on matching a defined rule and send notifications via email or other platforms.
  • The community maintains a lot of third party exporters and integrators that help in pulling metrics.

Architecture Diagram

Credit: Prometheus.io

Introducing prom-client

Prometheus runs on its own server. To bridge your own application to the Prometheus server, you'll need to use a metrics exporter, and expose the metrics so that Prometheus can pull them via HTTP.

We'll be relying on the prom-client library to export metrics from our application. It supports data exports required to produce histograms, summaries, gauges and counters.

Installing prom-client

The easiest way to install the prom-client module is via npm:

$ npm install prom-client

Exposing Default Prometheus Metrics with prom-client

The Prometheus team has a set of recommended metric to keep track of, which prom-client consequently includes as the default metrics, which can be obtained from the client via collectDefaultMetrics().

These are, amongst other metrics, the virtual memory size, number of open file descriptors, total CPU time spent, etc:

const client = require('prom-client');

// Create a Registry to register the metrics
const register = new client.Registry();
client.collectDefaultMetrics({register});

We keep track of the metrics collected in a Registry - so when collecting the default metrics from the client, we pass in the Registry instance. You can also supply other customization options in the collectDefaultMetrics() call:

const client = require('prom-client');

// Create a Registry to register the metrics
const register = new client.Registry();

client.collectDefaultMetrics({
    app: 'node-application-monitoring-app',
    prefix: 'node_',
    timeout: 10000,
    gcDurationBuckets: [0.001, 0.01, 0.1, 1, 2, 5],
    register
});

Here, we've added the name of our app, a prefix for the metrics for ease of navigation, a timeout parameter to specify when requests time out as well as a gcDurationBuckets which define how big the buckets should be for the Garbage Collection Histogram.

Collecting any other metrics follows the same pattern - we'll collect them via the client and then register them into the registry. More on this later.

Once the metrics are situated in the register, we can return them from the register on an endpoint that Prometheus will be scraping from. Let's create an HTTP server, exposing a /metrics endpoint, which returns the metrics() from the register when hit:

const client = require('prom-client');
const express = require('express');
const app = express();

// Create a registry and pull default metrics
// ...

app.get('/metrics', async (req, res) => {
    res.setHeader('Content-Type', register.contentType);
    res.send(await register.metrics());
});

app.listen(8080, () => console.log('Server is running on http://localhost:8080, metrics are exposed on http://localhost:8080/metrics'));

We've used Express.js to expose an endpoint at port 8080, which when hit with a GET request returns the metrics from the registry. Since metrics() returns a Promise, we've used the async/await syntax to retrieve the results.

If you're unfamiliar with Express.js - read our Guide to Building a REST API with Node.js and Express.

Let's go ahead and send a curl request to this endpoint:

$ curl -GET localhost:8080/metrics
# HELP node_process_cpu_user_seconds_total Total user CPU time spent in seconds.
# TYPE node_process_cpu_user_seconds_total counter
node_process_cpu_user_seconds_total 0.019943

# HELP node_process_cpu_system_seconds_total Total system CPU time spent in seconds.
# TYPE node_process_cpu_system_seconds_total counter
node_process_cpu_system_seconds_total 0.006524

# HELP node_process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE node_process_cpu_seconds_total counter
node_process_cpu_seconds_total 0.026467

# HELP node_process_start_time_seconds Start time of the process since unix epoch in seconds.
...

The metrics consist of a bunch of useful metrics, each explained through comments. Though, coming back to the statement from the introduction - in a lot of cases, your monitoring needs might be ecosystem-specific. Thankfully, you have full flexibility to expose your own custom metrics as well.

Exposing Custom Metrics with prom-client

Although exposing default metrics is a good starting point to understand the framework as well as your application - at some point, we will need to define custom metrics to employ a hawk-eye into a few request flows.

Let's create a metric that keeps track of the HTTP request durations. To simulate a heavy operation on a certain endpoint, we'll create a mock operation that takes 3-6 seconds to return a response. We'll visualize a Histogram of the response times and the distribution that they have. We'll also be taking the routes and their return codes into consideration.

To register and keep track of a metric such as this - we'll create a new Histogram and use the startTimer() method to start a timer. The return type of the startTimer() method is another function that you can invoke to observe (log) the recorded metrics and end the timer, passing in the labels you'd like to associate the histogram's metrics with.

You can manually observe() values, though, it's easier and cleaner to invoke the returned method.

Let's first go ahead and create a custom Histogram for this:

// Create a custom histogram metric
const httpRequestTimer = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'code'],
  buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10] // 0.1 to 10 seconds
});

// Register the histogram
register.registerMetric(httpRequestTimer);

Note: The buckets are simply the labels for our Histogram and refer to the length of requests. If a request takes less than 0.1s to execute, it belongs to the 0.1 bucket.

We'll refer to this instance every time we'd like to time some requests and log their distribution. Let's also define a a delay handler, which delays the response and thus simulates a heavy operation:

// Mock slow endpoint, waiting between 3 and 6 seconds to return a response
const createDelayHandler = async (req, res) => {
  if ((Math.floor(Math.random() * 100)) === 0) {
    throw new Error('Internal Error')
  }
  // Generate number between 3-6, then delay by a factor of 1000 (miliseconds)
  const delaySeconds = Math.floor(Math.random() * (6 - 3)) + 3
  await new Promise(res => setTimeout(res, delaySeconds * 1000))
  res.end('Slow url accessed!');
};

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Finally, we can define our /metrics and /slow endpoints one of which uses the delay handler to delay the responses. Each of these will be timed with our httpRequestTimer instance, and logged:

// Prometheus metrics route
app.get('/metrics', async (req, res) => {
  // Start the HTTP request timer, saving a reference to the returned method
  const end = httpRequestTimer.startTimer();
  // Save reference to the path so we can record it when ending the timer
  const route = req.route.path;
    
  res.setHeader('Content-Type', register.contentType);
  res.send(await register.metrics());

  // End timer and add labels
  end({ route, code: res.statusCode, method: req.method });
});

// 
app.get('/slow', async (req, res) => {
  const end = httpRequestTimer.startTimer();
  const route = req.route.path;
  await createDelayHandler(req, res);
  end({ route, code: res.statusCode, method: req.method });
});

// Start the Express server and listen to a port
app.listen(8080, () => {
  console.log('Server is running on http://localhost:8080, metrics are exposed on http://localhost:8080/metrics')
});

Now, every time we send a request to the /slow endpoint, or the /metrics endpoint - the request duration is being logged and added to Prometheus' registry. Incidentally, we also expose these metrics on the /metrics endpoint. Let's send a GET request to /slow and then observe the /metrics again:

$ curl -GET localhost:8080/slow
Slow url accessed!

$ curl -GET localhost:8080/metrics
# HELP http_request_duration_seconds Duration of HTTP requests in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="0.3",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="0.5",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="0.7",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="1",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="3",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="5",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="7",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="10",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="+Inf",route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_sum{route="/metrics",code="200",method="GET"} 0.0042126
http_request_duration_seconds_count{route="/metrics",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="0.1",route="/slow",code="200",method="GET"} 0
http_request_duration_seconds_bucket{le="0.3",route="/slow",code="200",method="GET"} 0
http_request_duration_seconds_bucket{le="0.5",route="/slow",code="200",method="GET"} 0
http_request_duration_seconds_bucket{le="0.7",route="/slow",code="200",method="GET"} 0
http_request_duration_seconds_bucket{le="1",route="/slow",code="200",method="GET"} 0
http_request_duration_seconds_bucket{le="3",route="/slow",code="200",method="GET"} 0
http_request_duration_seconds_bucket{le="5",route="/slow",code="200",method="GET"} 0
http_request_duration_seconds_bucket{le="7",route="/slow",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="10",route="/slow",code="200",method="GET"} 1
http_request_duration_seconds_bucket{le="+Inf",route="/slow",code="200",method="GET"} 1
http_request_duration_seconds_sum{route="/slow",code="200",method="GET"} 5.0022148
http_request_duration_seconds_count{route="/slow",code="200",method="GET"} 1

The histogram has several buckets and keeps track of the route, code and method we've used to access an endpoint. It took 0.0042126 seconds to access /metrics, but a whopping 5.0022148 to access the /slow endpoint. Now, even though this is a really small log, keeping track of a single request each to only two endpoints - it's not very easy on the eyes. Humans are not great at digesting a huge amount of info like this - so it's best to refer to visualizations of this data instead.

To do this, we'll use Grafana to consume the metrics from the /metrics endpoint and visualize them. Grafana, much like Prometheus, runs on its own server, and an easy way to get both of them up alongside our Node.js application is through a Docker Compose Cluster.

Docker Compose Cluster Setup

Let's begin by creating a docker-compose.yml file which we'll use let Docker know how to start up and expose the respective ports for the Node.js server, the Prometheus server and Grafana server. Since Prometheus and Grafana are available as Docker images, we can directly pull their images from Docker Hub:

version: '2.1'
networks:
  monitoring:
    driver: bridge
volumes:
    prometheus_data: {}
    grafana_data: {}
services:
  prometheus:
    image: prom/prometheus:v2.20.1
    container_name: prometheus
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    ports:
      - 9090:9090
    expose:
      - 9090
    networks:
      - monitoring
  grafana:
    image: grafana/grafana:7.1.5
    container_name: grafana
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_AUTH_DISABLE_LOGIN_FORM=true
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    ports:
      - 3000:3000
    expose:
      - 3000
    networks:
      - monitoring
  node-application-monitoring-app:
    build:
      context: node-application-monitoring-app
    ports:
      - 8080:8080
    expose:
      - 8080
    networks:
      - monitoring

The Node application is being exposed on port 8080, Grafana is exposed on 3000 and Prometheus is exposed on 9090. Alternatively, you can clone our GitHub repository:

$ git clone https://github.com/StackAbuse/node-prometheus-grafana.git

You may also refer to the repository if you're unsure which configuration files are supposed to be situated in which directories.

All the docker containers can be started at once using the docker-compose command. As a prerequisite, whether you want to host this cluster on a Windows, Mac or Linux machine, Docker Engine and Docker Compose need to be installed.

Note: If you'd like to read more about Docker and Docker Compose, you can read our guide to Docker: A High Level Introduction or How Docker can Make your Life Easier as a Developer.

Once installed, you can run the following command in the project root directory:

$ docker-compose up -d

After executing this command, three applications will be running in the background - a Node.js server, Prometheus Web UI and server as well as Grafana UI.

Configuring Prometheus to Scrape Metrics

Prometheus scrapes the relevant endpoint at given time intervals. To know when to scrape, as well as where, we'll need to create a configuration file - prometheus.yml:

global:
  scrape_interval: 5s
scrape_configs:
  - job_name: "node-application-monitoring-app"
    static_configs:
      - targets: ["docker.host:8080"]

Note: docker.host needs to be replaced with the actual hostname of the Node.js server configured in the docker-compose YAML file.

Here, we've scheduled it to scrape the metrics every 5 seconds. The global setting by default is 15 seconds, so we've made it a bit more frequent. The job name is for our own convenience and to identify the app we're keeping tabs on. Finally, the /metrics endpoint of the target is what Prometheus will be peeking on.

Configure Data Source for Grafana

While we're configuring Prometheus - let's also create a data source for Grafana. As mentioned before, and as will be further elaborated - it accepts data from a data source and visualizes it. Of course, these data sources need to conform to some protocols and standards.

The datasources.yml file houses the configuration about all of Grafana's data sources. We just have one - our Prometheus server, exposed on port 9090:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    orgId: 1
    url: http://docker.prometheus.host:9090
    basicAuth: false
    isDefault: true
    editable: true

Note: docker.prometheus.host is to be replaced with the actual Prometheus hostname configured in the docker-compose YAML file.

Simulate Production-Grade Traffic

Finally, it'll be easiest to view the results if we generate some synthetic traffic on the application. You can simply reload the pages multiple times, or send many requests, but since this would be time-consuming to do by hand - you can use any of the various tools such as ApacheBench, ali, API Bench, etc.

Our Node.js app will use the prom-client to log these and submit them to the Prometheus server. All that's left is to use Grafana to visualize them.

Grafana - An Easy-to-Setup Dashboard

Grafana is an analytics platform used to monitor and visualize all kinds of metrics. It allows you to add custom queries for its data sources, visualize, alert on and understand your metrics no matter where they are stored. You can create, explore, and share dashboards with your team and foster a data-driven culture.

Grafana collects data from various data sources and Prometheus is just one of them.

Grafana Monitoring Dashboards

A few dashboards are bundled out-of-the-box to provide an overview of what's going on. The NodeJS Application Dashboard collects the default metrics and visualizes them:

The High Level Application Metrics dashboard shows high-level metrics for the Node.js Application using default metrics such as the error rate, CPU usage, memory usage, etc:

The Request Flow Dashboard shows request flow metrics using the APIs that we have created in the Node.js application. Namely, here's where the Histogram we've created gets to shine:

Memory Usage Chart

Instead of the out-of-the-box dashboards, you can also create aggregations to calculate different metrics. For example, we can calculate the memory usage over time via:

avg(node_nodejs_external_memory_bytes / 1024) by (route)

Request Per Second Histogram Chart

Or, we can plot a graph displaying requests per second (in 2 minute intervals), using the data from our own data collector:

sum(rate(http_request_duration_seconds_count[2m]))

Conclusion

Prometheus and Grafana are powerful open-source tools for application monitoring. With an active community and many client libraries and integrations, few lines of code bring up a pretty neat and clean insight into the system.

Last Updated: June 29th, 2021
Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

Arpendu Kumar GaraiAuthor

Full-Stack developer with deep knowledge in Java, Microservices, Cloud Computing, Big Data, MERN, Javascript, Golang, and its relative frameworks. Besides coding and programming, I am a big foodie, love cooking, and love to travel.

Want a remote job?

    Prepping for an interview?

    • Improve your skills by solving one coding problem every day
    • Get the solutions the next morning via email
    • Practice on actual problems asked by top companies, like:
     
     
     

    Getting Started with AWS in Node.js

    Build the foundation you'll need to provision, deploy, and run Node.js applications in the AWS cloud. Learn Lambda, EC2, S3, SQS, and more!

    © 2013-2021 Stack Abuse. All rights reserved.