Introduction to the ELK Stack

The products we build often rely on multiple web servers and/or multiple database servers. In such cases, we often don’t have centralized tools for analyzing and storing logs. Under such circumstances, identifying different types of events and correlating them with other types of events is an almost impossible mission.

A single exception condition, somewhere in the middle of the system, can be disastrous for both the end user and the development team. A user may end up looking at a blank page after submitting payment for your service, or a large packet-loss can occur within your network, resulting in stretching a simple 10-minute job into a 10-hour headache.

The logical thing would be to know exactly what happened and how to prevent it from happening again. However, this means pulling logs from individual machines, aggregating them to take care of the clock skew and joining them through some kind of transaction id before you can even start asking questions.

This is all under the condition that the logs you should be checking stand in the computer's working memory or when cat + grep + xargs is not sufficient for analysis.

Nowadays, most IT infrastructures have moved to public clouds such as Amazon Web Services, Microsoft Azure, Digital Ocean, or Google Cloud.

Public cloud security tools and logging platforms both became necessary companions to these services. Even though they are separate tools, they were built to work together.

ELK Stack is created exactly for such situations.

What is the ELK Stack?

ELK stands for:

  • ElasticSearch - Used for deep search and data analytics. It's an open source distributed NoSQL database, built in Java and based on Apache Lucene. Lucene takes care of storing disk data, indexing, and document scanning while ElasticSearch keeps document updates, APIs, and document distribution between ElasticSearch instances in the same cluster.
  • Logstash - Used for centralized logging, log enrichment, and parsing. It's an ETL (Extract, Transfer, Load) tool that transforms and stores logs within ElasticSearch.
  • Kibana - Used for powerful and beautiful data visualizations. It's a web-based tool through which ElasticSearch databases visualize and analyze previously stored data.

These three tools are necessary for analysis and visualization of events in the system.

Log Analysis

Performance isolation is hard to reach, particularly whenever systems are heavily loaded. Every relevant event within your system should be logged. It is better to overdo it with logging than to not have information about a potentially problematic event.

A good "target" for logging at the application level is everything related to any user interaction with the system - be it authorization of the user to the system or a user request to some URL, an email sent to the user, etc. Since you log a variety of information, this log is not structured at all, but it should have some basic information so that you can access it more easily.

The best advice regarding logging in distributed systems is to "stamp" any relevant event at the source, that is propagated in some way through the distributed system, whether it touches more parts of the system or not. In the case of a request for a web page, for an example, a load balancer or web server would be put on such a "stamp". This sealed event is transmitted further until the end of its life. These "stamps" are often realized as UUIDs.

This ensures that the linearity of the event is not lost and that you can group and tie together events - for example, you can link that page request later with the query to the database that is necessary for the page to appear.

Logstash

When the event is logged in the log file, Logstash comes into play. Logstash ensures that your entries are transformed into one of the supported target formats that will later be submitted to the ElasticSearch server.

It collects data from different sources and then streams the data in the form of a data processing pipeline to the ElasticSearch instance.

It's safe to imagine ElasticSearch as a database and LogStash as the streaming component which pushes the logs or files on to it.

Logstash processing takes place in three stages:

  • Input - The entry, besides a file, can also be a syslog, Redis, or Lumberjack (now logstash-forwarder). Redis is often used as a pub/sub broker in a centralized Logstash infrastructure, where Logstash producers send their messages, and one of the instances processes them further.
  • Transformation and Filtering - This stage takes place according to the set of previously defined rules, where besides, for an example, a date format transformation, the IP address information (city/country of the IP) can be registered as well. The free MaxMind database is used for this purpose.
  • Output - The output can be transformed through plugins to almost any format, In our case, it will be ElasticSearch.

Let's have a look at a Logstash configuration file and an example of an application log:

logstash.conf:

input {  
    file {
        path => ["/var/log/myapp/*.log"]
    }
}

filter {  
    // Collect or process some data (e.g. 'time') as timestamp and
    // store them into @timestamp data.
    // That data will later be used by Kibana.

    date {
        match => [ "time" ]
    }

     // Adds geolocation data based on the IP address.
    geoip {
        source => "ip"
    }
}

output {  
    elasticsearch {
       hosts => ["localhost:9200"]
    }
    stdout { codec => rubydebug }
}

Application log::

{
    "action": "action log",
    "user": "John",
    "time": 12:12:2012,
    "ip": "192.168....",
    "transaction-id": "r5e1244-32432-1465-q346-6ahsms57081x4"
}

ElasticSearch

The moment Logstash finishes its job and forwards the logs, ElasticSearch can already process the data.

With the simple curl command curl -XGET 'http://192.168.(host ip):9200/_search' you can see "documents" in the database.

192.168.(host ip) is, in this case, the address of boot2docker VM, and 9200 is the default exposed port.

Since the result is in JSON format, your results are ready for further processing. curl -XGET 'http: //192.168.(host ip):9200/_search?hello' returns a more readable version of the document for quick inspection. Depending on the development, it's common practice to log the payload request/response, correlation ID, etc, so that you could track the error through Kibana's UI dashboard later.

Kibana

Kibana is basically just a static UI (HTML+ CSS + JS) client which displays the data in the way you wanted over the ElasticSearch instance where you see different reports and analytics. Except that through Kibana you can easily make queries over ElasticSearch indexes, the main advantage is that for that queries, no matter how complicated they can be, you can "arrange" them in a very transparent "dashboard" and that "dashboard" to save and share with others.

Kibana comes with a very powerful set of visualization tools and so, for example, you can see how often a similar event is repeated over time, you can aggregate events according to various criteria (for example, how many requests came from a particular IP address in the last hour, whether the number of errors on a particular server has increased, or even notice an increase in the number of requests for a particular page). Also, this is a great tool for detecting anomalies that are caused by changes in your system.

Kibana

Trying out the ELK Stack

To be able to try the ELK stack and try some of the commands we are about to cover, you'll need to install ELK and Docker, or boot2docker, locally.

Docker is a tool designed to make it easier to create, deploy, and run applications by using containers.

Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package. This article will not cover Docker and boot2docker in detail - the official resources are more than excellent and it's very easy to set up this combination.

For the purpose of this test, we will use ELK Docker Image:

$ docker pull sebp/elk
$ docker run -p 5601:5601 -p 9200:9200 -p 5000:5000 -it --name elk sebp/elk
$ docker exec -it elk /bin/bash
$ /logstash/bin/logstash -e 'input { stdin { } } output { elasticsearch { host => localhost } }'
# Now write the message e.g.
This is a test message.  
CTRL-C  

From the host machine, we'll use curl:

$ curl -s -XGET 'http://192.168.(host ip):9200/_search?hello&q=message'

The result will look like this:

{
    "took" : 3,
    "timed_out" : false,
    "_shards" : {
        "total" : 6,
        "successful" : 6,
        "failed" : 0
    },
    "hits" : {
        "total" : 1,
        "max_score" : 0.4790727,
        "hits" : [ {
        "_index" : "logstash-2018.08.26",
        "_type" : "logs",
        "_id" : "Z8kCtFYHOWWinjAN6pHUqE",
        "_score" : 0.5720636,
        "_source":{"message":"This is a test message.","@version":"1","@timestamp":"2018-08-26T09:41:17.634Z","host":"5136b631e113"}
        } ]
    }
}

As you can see, our search resulted in 1 log instance to be returned to us in JSON format.

ElasticSearch and Data Loss on Network Partitions

As with all other tools, ELK Stack comes with its own set of problems. During the previous year and this year, numerous articles have been written about how ElasticSearch loses data due to network partition, even if the network partition conditions last only a few seconds.

Here you can find more information on how ElasticSearch behaves during a transitive, non-transit network partition, as well as in the singular node partition.

What is interesting is that even micro-partitions can cause a lag of 90 seconds within which the new leader within the cluster works and everything that was inserted during that cluster can be potentially discarded, so it's a good idea to keep your logs in the original format and not rely on 100% on ElasticSearch, even for "mission-critical" logs.

Conclusion

With this, we only scratched the surface and presented a very small set of possibilities of the ELK Stack.

The problems solved in this article are quick and simple analysis of logs and visualization of events in the system, using a monolithic interface that can serve as a basis for further instrumentation.

Happy logging!

Author image
About Vuk Skobalj