Introduction

David Landup
David Landup

In a practical world of large-scale data-driven architectures, data lies at the core of the system. We are living in an increasingly data-driven world, and all of this data doesn't just lie around. It flows from endpoint to endpoint, being processed both online and offline, with the amount of data in circulation continuing to grow day by day.

Statista keeps track of (estimated) created data worldwide:

"The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 64.2 zettabytes in 2020. Over the next five years up to 2025, global data creation is projected to grow to more than 180 zettabytes."

Most people don't intuitively grasp this number as a large one - yet, it is an enormous number. As business requirements become more complex and volatile, handling the transmission or streaming of data keeps you on your toes, and without a proper system to manage it all - you'll be left overwhelmed.

In addition to data processing (as if that's a small task), you often need systems that can exchange messages, stream events, and implement event-driven architectures to facilitate all the functionality required to handle this onslaught of data. This is where Apache Kafka comes into play.

The official slogan, which as of the time of writing can be found at kafka.apache.org, briefly explains what Kafka is:

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

However, a more descriptive image could be painted as:

Apache Kafka is to data-driven applications what the Central Nervous System is to the human body.

It has built-in stream processing, can connect to a wide variety of data sources, offers high scalability and availability, and ensures high integrity with zero message loss and guaranteed message ordering.

Apache Kafka is flexible - and it provides several APIs that can be used as standalone features, or facets of the library, for different tasks. These are, namely, the Producer API, Consumer API, Streams API, and Connector API.

One of the strongest aspects of Kafka is its fluidity. You can use certain parts of it to achieve specific design patterns and completely ignore other functionalities. You can even utilize Kafka in different contexts within the same project, depending on the microservice you're working on at the time. Kafka addresses many problems and use cases, and you get to choose which aspects of it to use.

Kafka has widespread support for many libraries and frameworks you might already be using daily. For instance, Kafka Streams work well with Spring Cloud Stream and the Spring Cloud Dataflow ecosystem, which we'll cover in Chapter 10. You can pair Spring WebFlux with Reactive Kafka for reactive, asynchronous applications, which we'll cover in Chapter 13. Kafka works great with Apache Spark - a unified engine for data processing, data engineering, data science, and machine learning, and we'll explore real-time data streaming with Spark and Kafka in Chapter 14. You can implement excellent monitoring solutions with Kafka, Prometheus, Grafana, ElasticSearch, and LogStash in Kibana, which we'll discuss in Chapter 15. Amazon's Managed Streaming was built for Apache Kafka and is covered in Chapter 17. Kafka also pairs well with DeepLearning4J, which brings TensorFlow/Keras to Java environments, and allows you to serve deep learning models in production, in real-time! Kafka integrates seamlessly with many frameworks for various use cases - making it extremely versatile.

Applications of today are built on top of data, and what most companies bring to the table is a way to interpret that data and make actionable choices from it. If you want to learn how your users behave, how your applications are being used, trends in user-generated content, etc., you might very well build Machine Learning models to analyze this data and provide insights for you or your clients.

This learning can occur online (real-time feeding data into a model) or offline (exporting the data offline and learning from it statically). In both cases - to learn from data you must obtain data, and Kafka plays an instrumental role in this process. Whether you use it to feed a Machine Learning System online or export it offline is up to you and your model. It's easy to imagine a wide range of possibilities that this sort of data transfer allows you in the commercial realm, but it's worth noting that the same ideas can be applied far beyond economics and classical software engineering.

In the 21st century, we've seen a rise in biometric data - the average worker today can collect their biometric data easily through non-intrusive gadgets. From consumer-priced EEG helmets that can track high-level data emitted from your brain activity, to heart-rate monitors, wrist-based FitBits, and other similar performance and fitness trackers - you can actually extract quite a lot fairly easily and inexpensively. This field is still relatively young, and the viability of these products will only increase over time. Frameworks like Kafka can be used as the central nervous system of applications that monitor these devices continuously and alert appropriate individuals of anomalies, such as elevated heart-rate or stress patterns as seen from EEG. Through an optimist's lens - in the future, devices integrated into our daily life could alert doctors to anomalies in our heartbeat, keyboard typing patterns, and blood pressure levels. Our cars could know when we're sleepy, based on data from portable, small EEG headsets we could wear - resulting in the car slowing down safely and getting us off the road before an accident occurs. Our lifestyle could be tracked and analyzed before you go to the doctor - there's so much that happens in between doctor visits that they're not aware of, so adding any data could help doctors help you. This field is called precision medicine, and at the core of it is data collection and streaming, before processing.

Kafka can be used exactly for this. Naturally, biometric data is very private, and in the future, a fully valid objection and concern is just who gets to see and process the data generated by your day-to-day life on such an intimate basis. However, this isn't the only application of Kafka, just a small glimpse into the possibilities in the future!

Smart urban projects could also benefit from high-speed data throughput - imagine an automated city where autonomous cars drive on the roads, there's no need for traffic lights, and the actions of one agent can affect many other agents in real-time. Other than just reporting, second-to-second coordination depends on high-throughput of data between many agents.

Kafka can also be used as a message broker, instead of more commonplace brokers such as RabbitMQ and ActiveMQ, for website user activity tracking (which we'll be covering later on as well), real-time data processing, metric monitoring, etc. Given how generically most of Kafka can be applied - it's a very loose definition to say it's just one thing, and the way one team uses Kafka might very well differ from other teams.

Is Kafka a silver bullet?

No, Kafka's not a silver bullet, and it can't do everything. More importantly, it's not the best solution for some of the things it can do - such as real-time data processing and message brokering. Yes - it can replace RabbitMQ and ActiveMQ, but it's not always the wisest choice, as it's pretty heavyweight compared to them. Even though Kafka isn't the best real-time traffic system, it's still commonly used for that task. It's considered a near real-time system, not a hard real-time system. Kafka is meant for high throughput systems and enterprise applications, and the verbosity required to make this work makes it a poor fit for smaller applications and simpler contexts.

While it is used extensively to orchestrate data between many microservices together, it wouldn't make sense to use for just a couple.

The fact is - Kafka requires operational maintenance.

In the first chapter, we are going to take a look at some of the following concepts and topics, before getting our hands dirty with Kafka:

  • How Event-Streaming came into existence
  • Why do we need Event-Driven architectures?
  • When should we prefer using Kafka over other messaging systems?
  • When to use Kafka outside of messaging contexts?
  • Kafka alternatives and use-cases
  • Overview of the several applications we'll be building throughout the various chapters in this book

How Event-Streaming Came into Existence

Event-Streaming is a staple in the microservice world, and it's an important concept to grasp. The shift to event-driven microservices makes a lot of sense for certain applications if you consider the hurdles someone has to go through when trying to orchestrate multiple services to work together.

Here goes the story of hypothetical Steve and Jane. There were a few friends, Steve and Jane, who thought of building an e-commerce website as a startup, where customers can browse for items, select items as per their choice, and then book those items to be delivered within a given time frame. Now when they designed their first application, it was pretty simple. They created a simple monolithic server with a charming and attractive UI and a database to store and retrieve data:

It worked as all applications worked at the time. There's a codebase, sitting on a server - a client can connect and send requests to the codebase, which would perform operations, consult the data in a database, and return responses back. The code base was indivisible, it was a programmatic expression of their shop, and whenever they wanted to add a new feature, it was as simple as adjusting the codebase to have another method or endpoint and adjusting the UI to let the user know of the new feature.

As they were adding new features, each on top of the other, and divided tasks between each other - things got a bit more complex. They had logic covering categorization, order management, inventories, payment and purchasing, couriers, and notifications. This was all shared in the same codebase, and making a change to any of these usually meant making a change to other systems as well.

The logic was tightly-coupled and highly-cohesive.

Highly-cohesive, because all systems affected each other, and tightly-coupled because you couldn't really separate them as they were. If the notification logic misbehaved, and it was tightly-coupled with, say, the payment logic, Steve and Jane realized they were in for a ride. No software is without errors and bugs, and no matter how much they tested it, there's always a chance something will misbehave, and negatively affect the workings of some other logical piece, which intuitively shouldn't be affected by such misbehavior.

Why would the payment system suffer if notifications don't work? Not to mention the cascading effect it can have - if one piece of logic goes haywire, which affects another one, which affects another one... etc. - the entire application might suffer, if not come to a grinding halt, because of a simple issue that could have been contained. Additionally, when coordinating what they were doing, it wasn't easy to get messages across:

Steve: "Hey, can you add feature X to the, um, the part where we fulfill orders?"

Jane: "Sure! Could you fix that bug produced by, the uh, part after we send notifications?"

Fusing everything together, without clear lines, made it harder to convey meaning, so they started organizing their code in smaller modules in the same codebase for convenience.

Steve and Jane realized that they wanted loosely-coupled but highly-cohesive code, where you can "switch out" modules and systems, add them or remove them, without making other pieces suffer. Each system has to be self-contained (can work alone) and take care of a single aspect of the application. This was at the core of their idea:

The Notification System takes care of notifications, can send notifications to users assuming it has their information, and if it fails to do so, the fire is contained to the Notification System.

They ended up having an Item Categorization System, Order Management System, Inventory System, Payment/Booking System, Order Fulfillment System, Courier Booking System, Notification Management System, etc. and many more.

Each of these had a simpler, smaller codebase, and the services each helped each other work (highly-cohesive) but were loosely-coupled, and you could play around with them like with Lego. An important advantage they now had was that they could test each module separately, and if they made changes to the Notification Management System, they didn't need to test other services, since if the Notification Management System produces the same outputs, they'll work just fine.

There are no side-effects other than the outputs, which cleaned up their codebase, decoupled the logic, allowed them to scale up more easily, add new services, and communicate better.

What Steve and Jane did was - they invented the microservice architecture:

Microservices were a huge paradigm shift in how software engineering was done, and affected most large-scale applications in the world. Famously, in 2015, Netflix switched to Spring Boot, and adopted the microservice paradigm.

In 2016, Josh Evans, the Director of Operations Engineering at Netflix, held a panel under the name of "Mastering Chaos - A Netflix Guide to Microservices" at QCon in San Francisco, where they outlined some of the systems they made in a huge migration to this new system, where they re-made most of their logic in Spring Boot microservices, with large-scale data transfer between them:

Credit: InfoQ on YouTube

In the talk, which is a wonderful one, Josh drew parallels between highly complex systems such as the human body, and the microservice architecture.

Netflix has since been a major contributor to the landscape of tools used in Spring Boot applications, including tools such as Zuul, Feign, Hystrix, etc., which are all open-source, used at Netflix for their own services, and open to the public to use for their own applications. Microservices themselves introduced new problems, which weren't as prevalent before, and Netflix tackled the issues concerning Discovery Clients and Servers, Client-Side Load Balancing, Security Flow, Fault Tolerance, and Server Resilience, etc.

We'll be using some of these tools in the book as well, as they're some of the most widely-used solutions to these common microservice problems.

Back to Steve and Jane. They managed to figure out how to separate all of these modules in a big push to this new architecture, figured out how to control the traffic between these services, maintain security, create a server to keep track of all the services and act as an intermediary between their communication, and went live with the website. Eureka, it works!

Note: Presumably because of the fact that this list of tasks wasn't really easy, Netflix's Service Discovery Server is, in fact, called Eureka.

When a user purchases an item, the client UI makes a simple HTTP request to the server, containing the JSON representation of the item. The client then waits for the server to process the whole order, step-by-step, and finally return a response back. Their store gains traction due to their attractive offer of choosing the time for delivery, so many new users visit their store and start ordering items.

Each user waits for the process to finish step-by-step, which eventually becomes a bottleneck. Steve and Jane gain a lot of benefits from switching to microservices, but another issue arises (alongside the ones they resolved, like load balancing, security flow, and service discovery, which previously weren't issues). Their users become frustrated due to the long processing times, which, really weren't longer than a few seconds, but we've become accustomed to near instant responses, so waiting for a page to load for a few seconds feels like eternity.

Steve and Jane didn't have Kafka back in the day, so they couldn't offload this issue to an existing, fast service, which would process these streams of data and requests in fractions of seconds, as we'll see in the following chapters. They had to think of a new solution and spearhead another paradigm shift within the new field of microservices.

Their main issue was in the request-response module, and to scale up, they had to fix the blocking nature of that module.

Users are parallel and scattered, and their requests come in like confetti. It would be great if you could enter the room, collect all the confetti on the table (orders), and go back to the administration room and process them. Then, instead of putting each confetti piece into a different box (microservice) which kicks off an automatic process based on the color of the confetti, you put the confetti piece on display, and each box has a recognition system when seeing that piece on display.

Some microservices react in one way, some in a different way, and some don't react at all, waiting for other microservices to finish their job and react to their results. The arrival of this request is an event and the microservices each, in parallel, respond to this event.

This is known as the event-driven architecture, which is a key feature of microservice architectures, and this is the mechanism that allowed microservices to scale up:

You're getting a constant data stream of requests to deal with, dumping them in the pipeline for the downstream microservices to take care of, instead of sequentially letting the requests go through a multiple-step system. Each event goes through the pipeline (white boxes in the diagram) and doesn't stop until it reaches the end and is terminated. Each relevant service on the way listens to this event when it's their time. The services that should react to it do so, taking the data from the original event and performing certain operations, notifying other services if need be, and returning a result.

The aggregation of these results ends up back on the client's UI, and the user doesn't have to wait for each service to sequentially finish, but rather, lets them process the request in parallel, asynchronously.

Event-Driven vs Command-Driven

You could have taken another choice as well. Instead of having a request emit events that microservices listen to, you could have had it send commands to the relevant microservices.

This is analogous to someone holding up a piece of paper with news and having the crowd react on their own, each knowing what to do, compared to someone sending a letter of instructions (a command) to each person in the crowd. This is exactly the difference between the Event-Driven Architecture and Command-Driven Architecture:

Both Events and Commands are messages, and both approaches can really be summed up into the Message-Driven Architecture. Events and Commands are messages, just implemented differently. Thus, you'll often see Message Brokers and the term "message" being used as a common abstraction of the concept, while the concrete implementation can either be forced or left to you to implement.

Once Steve and Jane implemented Events into their system and changed the architecture once again, they could start scaling up and serving more customers. Now, it was a question of how to efficiently implement and develop event streaming as well as how to make the interaction between events and microservices efficient.

They were delighted to solve the issue and started documenting the architecture for their internal use - little did they know, they were spearheading a massive paradigm shift that would enable a new generation of data-driven applications in the cloud.

Why Event-Driven?

In the previous section, we took a look at some of the problems that arose with the synchronous request-response modules Steve and Jane originally built, and went through a process of making them asynchronous to remove the bottlenecks.

They removed it through event-streaming. Technically speaking, it's a practice or a design pattern employed to capture real-time data or events from various event sources. These sources can be databases, IoT sensors, application/device logs, 3rd party applications, cloud services, or most commonly internal microservices that emit events in a stream.

These events are stored, processed, manipulated in real-time, and routed to different target destinations (or deleted) depending on what you'd like to achieve. Generally speaking, here's how the event streaming layer can look like:

The core features of any event streaming platform are:

  • Processing: Processing the stream of events as they get published in a continuous fashion
  • Publisher-Subscriber: Writing (Publishing) and Reading (Subscribing) data continuously.
  • Data Storage: Performing data retention of events durably and reliably for a given time interval.

Generally speaking, the Publisher-Subscriber model, also known as the Pub-Sub and Publish-Subscribe model, is typically implemented in one of two ways - through queues and topics. In the former, a Publisher publishes data in the form of messages that are queued and transferred to recipients directly. In the latter, a Publisher publishes a message to a topic, which is a central hub for subscribers, from which they retrieve the messages:

So, isn't this just the difference between events and commands, outlined in the previous chapter? Well, the line is actually somewhat blurred, and it depends on who you ask. There's a surprising amount of discourse on the topic, but it generally boils down to the point.

The Publish-Subscribe architecture is an Event-Driven Architecture, where event-sources are called publishers and the services that react to the events are called subscribers.

The terminology is sometimes used interchangeably, and sometimes, engineers make distinctions, but the core concept remains the same.

While the core concepts are the same - they are relatively loose, which is the source of the discourse and ambiguity, so different frameworks will sometimes use the same terminology in (slightly) different contexts. In the next section, we'll take a quick look at what most messaging systems today offer - some represent themselves as Message Queues, some as Publisher-Subscriber Frameworks, and some as Stream Processing Frameworks.

Event Streaming serves a wide range of use-cases. Some of them include:

  • Continuously capturing and monitoring logs or sensor data and providing anomalies/alerts to secure your systems from attacks or misuse.
  • Processing real-time transactions and payments and notifying of anomalies and attacks to an end-user at lightning-fast speed.
  • Processing asynchronous messages to build a message-driven platform.
  • Feeding real-time data to online Machine Learning models, or exporting it for offline models.
  • Enabling the communication between many loosely coupled services in a "plug and play" fashion, which enables rapid scaling and balancing.
  • Primarily performing Extract, Transform, and Load (ETL) across data platforms.

In a nutshell, event streaming serves as a foundation for data-driven platforms, and choosing a durable, highly efficient, and reliable tool to handle event streaming is quite important. So let's look at various messaging systems available in the market and how Kafka stands out among all of them.

Why Kafka? Why Not Other Messaging Systems?

We've briefly touched upon the loose definition of event-driven systems in the previous section. We do know that event-streaming plays a critical role in the architecture of Steve and Jane's application, but we should also consider the nuances and flavors in which we can implement the concepts. Predominantly, there are three types of messaging frameworks available:

  • Message Queues - A traditional message queue architecture where the data is supposed to be processed in a particular order and among the fixed point-to-point system. Some popular tools and libraries include Apache's ActiveMQ, Amazon's SQS, RabbitMQ, and RocketMQ, as well as Apache Kafka, though, due to the fact that it's much more heavy-weight than other tools, it's not as common as the other tools.
  • Publisher/Subscriber - Derived from the Message Queue to solve some of its cons. While it can be based on message queues, people typically think of a topic-based pub-sub model. It provides higher scalability, dynamic network topology, and serves distributed frameworks. This is where Google Cloud's Pub/Sub lies, besides Amazon's SNS, and where people often use Apache Kafka, even though it can be used as a message broker/queue as well.
  • Stream Processing - These are libraries or utilities which developers integrate with their code to process high amounts of stream data. This is also where Apache Kafka can be used extensively.

Traditional Message Queues work amazingly for certain applications, but fail to have the same applicability to other types. Let's take a step back to Steve and Jane's eCommerce store to see where a Message Queue would first serve as a great solution, but then start to bottleneck again:

Let's first take a look at what a Message Queue would imply. Consider that once the user has checked out an order and completes the payment, an order needs to be checked, processed, and finally notified. The orders are now forwarded to the message queue:

The microservice that takes care of the payment would act as the Producer and would publish the orders in a message queue. Various types of orders would be queued one after the other, and the processing would happen in that order. Different microservices that react to these orders, known as Consumers, would consume each order coming in, complete tasks, and notify the user and owner of the store. This allowed us to scale much further than with a request-response architecture, but it again starts to fail at a much larger scale.

Note: For an eCommerce application to suffer from this scale would require it to be massive, and unless you're handling thousands of orders per hour, you won't really turn the queue from a solution back into a problem. However, this can become a problem for other application types that depend on much more data - and especially biometric applications. Medicine is now generating terabytes of data daily, at which point, the queue can no longer serve as the solution.

Let's entertain the idea that Steve and Jane's shop became so popular that they're dealing with an enormous number of orders. The queue is hosted on the server, and the server has limited memory. At one point, it'll run out, and the queue will become another bottleneck. The first solution is to distribute the load from a single queue onto multiple queues, which enables more computing power. Congestion is easily created on the border that only has one traffic lane. By employing just another human on another lane, the waiting times are halved, and double the number of vehicles can pass the border. Then again, a 50% increase would net yet another open lane, slashing the waiting time to a third of what they were.

This raises a question as to how we can distribute the queue paradigm - since by design, the queue data structure pays respect to the ordering of its elements. With queues, the order of processing is the same as the order of insertion, and bigger orders may take more time than smaller ones, slowing down the entire queue:

If we would just break up the queue into multiple queues, this wouldn't hold anymore, and the order of processing would start looking a lot more random. If we had guaranteed times for each order, they'd have some structure - Queue 1, Order 1; Queue 2, Order 1; Queue N, Order 1, followed by Queue 1, Order 2; Queue 2, Order 2, etc. However, we don't really have any guarantee about the times of processing.

By design, a message queue is not built to scale into a distributed cluster.

On the other hand, the ability to scale into a distributed cluster is the basic requirement of Apache Kafka. All data (events) sent into Kafka pipelines require a distribution strategy. It's based on the Publisher-Subscriber architecture and utilizes Kafka topics to solve this issue. All distributed topics are partitioned by a partition key. The Producer randomly writes data into these topics, and the Consumers read from the topics in a distributed order.

The Consumers can be parallelized too, by adding a consumer per partition, though this isn't necessary:

Each topic partition is a log of messages, and Kafka doesn't track which messages are read by which Consumers. It's up to the Consumers to handle the messages, which takes a lot of overhead computational time and power from the partitions, allowing for a higher throughput of data.

According to Confluent, Kafka's throughput is about double that of Pulsar's throughput, which is another Apache project, and has a throughput of about 15-fold compared to RabbitMQ:

Credit: Confluent's Benchmarking

Additionally, Kafka provides the lowest latency among these at high-throughputs, but is outperformed by RabbitMQ when dealing with low message throughputs of about 30k messages per second. RabbitMQ doesn't scale as well, so at 200k messages per second, Kafka outperforms it significantly:

Credit: Confluent's Benchmarking

In Chapter 2, we'll dive into more detail regarding Kafka's architecture and key terminology. After installing and setting it up, we'll dedicate some time to understanding Kafka components - Topics, Partitions, Brokers, Clusters, Offset, and Replication.

For starters, what you've learned so far is enough to illustrate how and why Kafka addresses the main problems associated with trying to scale up Message Queues, through a highly-efficient and durable messaging system for our event-streaming application.

When to Use Kafka and When Not To?

In the previous section, we glossed over the reasons Kafka is preferred over other messaging systems and in which cases. Now let's look into the various scenarios which can help us decide on when to use and when not to.

Kafka was first designed and developed by LinkedIn where they used it as a message queue. But over time, it evolved into something more than a message queue. It proved to be a powerful tool to work with various data streams and its usage revolves around that. So let's look into its various features to understand its usage.

  • Scalability: It is highly scalable due to its impressive design of a distributed system. It can be scaled without any lag or downtime. It is designed to handle enormous volumes of data on the scale of terabytes. Hence, if you need to stream vast amounts of data in a highly scalable environment, Kafka would prove to be one of the best choices.
  • Reliability: Kafka has the capability to replicate data. It can also support consumption of data from multiple subscribers. On top of that, it has the ability to balance the consumers in case of any failures or errors. This makes Kafka more reliable than other messaging systems we compared in our earlier section.
  • Durability: It is highly durable as it can persist the messages for a longer period.
  • Performance: Kafka provides high throughput for publishing and subscribing to messages. It utilizes disks which offer great performance while dealing with many terabytes of stored messages.

So, if you are struggling with any of these features with another solution (or if you're building one from scratch), Kafka could prove to be of great help. However, once we start using it and become accustomed to it, we often tend to solve every problem with Kafka. It's important to be able to take a step back and assess other options if necessary. This also applies to all other technologies - it just so happens that when one technology builds an entire ecosystem around it, it's easier to stick to it as you don't have to pursue other solutions. Kafka has proven to be an efficient data-streaming platform but can easily become overkill:

  • If you want to process a small amount of messages per day, then we should use simpler and cost-effective systems like traditional message queues. Kafka is designed to handle a massive amount of data, so it may not be the best fit. To be more precise, based on Confluent's benchmark, it processes up to 600MB per second, which is 2.1TB of data per hour. For those who want to check if it really is a metric ton - that's one standard-sized Hard Drive of 2TB per hour - with a physical volume of 389cm³ or 23.7in³. Accounting for the materials of an HDD weighing around 3g/cm³ on average, that's 115-130g (around 0.25 pounds), which is approximately 0.0028 metric tons per hour. Thus, to process a metric ton of average-sized 2TB hard drives, you'd need 14 days. Take the calculation with a grain of salt, as it's mainly meant to satisfy the curiosity of some.
  • If you want to perform simple tasks in a queue, then we can use other platforms or mechanisms. Kafka would be overkill.
  • Kafka is often the best solution for various kinds of ETL (Extract, Transform, Load) jobs. It also has various APIs to support streaming but it is not the best fit for hard real-time processing. Hence, Kafka should be avoided where hard real-time data transformation and processing is a primary requirement. Again, Kafka performs near real-time, so if absolute precision is needed, you might want to switch to another tool. Hard real-time is difficult to achieve and is used in Time-Sensitive Networking. You want a no-latency, no-spike system for things that can go really wrong if latency or spikes are introduced. For instance, you don't want a small spike in latency to cause someone's artificial organ-helper to misbehave, or an invasive BCI (brain-computer interface) to respond slowly to brainwaves. Mind you, slowly here can be milliseconds. As you might've imagined - these cases are not that common and highly specialized - so you're good for the vast majority of cases.
  • It is not best suited to perform as a database. If there is a requirement for a database, you should, well, use a database. With storage of redundant data, the cost to maintain Kafka can exponentially increase.

The discussion on real-time and near real-time processing can take turns, as the terms are actually quite loose. Context matters, a lot. Kai Waehner wrote a great article on the topic, titled "Apache Kafka is NOT Hard Real Time BUT Used Everywhere in Automotive and Industrial IoT", in which he states that:

"Kafka is real-time. But not for everybody’s definition of real-time."

So really, it depends on whether Kafka is real-time for you or not. A good rule of thumb is that it is, unless you're building a highly specialized system that hangs on the thread of millisecond accuracy, it indeed is real-time.

Most companies that are using Kafka employ it in a wide range. However, some of the most popular narrower use-cases are:

  • Activity Tracking: This is the original use-case for which Kafka was first developed by LinkedIn. It was used to build an activity pipeline in the form of a set of publish-subscribe real-time feeds. Various kinds of site activities, such as viewing, searching, or other actions, are sent through a publisher to topics per activity type. This information is then aggregated and processed for real-time analytics, monitoring, and reporting. This kind of data is generated in bulk, where high volumes of data are processed for each page a user views.

  • Messaging: Since Kafka has better throughput, replication, default partitioning, and high-availability, it makes the best fit for message processing applications. Most of the companies today prefer Kafka over Message Queues due to its ability to solve multiple use-cases.

  • Stream Processing: Today, real-time processing has become the need of the hour. Many IoT devices would be significantly less useful if they didn't possess real-time data processing capability. This is also required in finance and banking sectors to detect fraudulent transactions. Thus, Kafka introduced a lightweight yet powerful streaming library called Kafka Streams.

  • Logging/Monitoring System: Due to Kafka's data retention capability, many companies publish logs and store them for longer periods. It helps in further aggregation and processing to build large data pipelines for transformation. It is also used for real-time monitoring and alerting systems.

Due to its efficient usage, Kafka is heavily used in the big-data space as a reliable means to ingest and move large amounts of data without any problem. According to StackShare, Kafka is being used by more than 1200 companies worldwide, including Uber, Shopify, Spotify, Udemy, Slack, etc.

But apart from various message queues already available in the market, including Kafka itself, a few other messaging systems have been introduced:

  • RabbitMQ: Uses the Advanced Message Queuing Protocol for messaging. It is a lightweight messaging system that takes a simpler approach of having clients consume queues.

  • Apache Pulsar: An open-source distributed messaging system managed by Apache itself. It was originally developed as a message queue system which has the blend of both the properties of Kafka as well as RabbitMQ. It has been enhanced to add event streaming features and make use of Apache BookKeeper for storage of data.

  • NATS: An incredibly fast, open-source messaging system which is built on a simple yet powerful core. It uses a text-based protocol, made for cloud-native systems with minimal configuration and maintenance, where you can just literally telnet into the server to send and receive messages.

Project Overview - What We'll Be Building

We've thoroughly been introduced to Kafka on a higher level. We'll have to go through some of the key terminology and building blocks through illustrations and code before diving into a concrete project. In this book, we'll mainly be working on Web-based projects. In the Java ecosystem, the most widely used and adopted framework for web development is the Spring Framework. We'll be using Spring Boot to spin up microservices and allow communication between them with Kafka. We'll also have the opportunity to use some of the tools Netflix developed to help facilitate microservice development.

We'll be building a few different projects, each of which will be based on Spring Boot and powered by Kafka.

  • From Chapter 2 and Chapter 3, we'll be installing Kafka and exploring its components.

  • In Chapters 4, 5, and 6, we'll be exploring the Producer API, Consumer API, and Streams API, building a foundation for applying Kafka "in the wild". In Chapter 7, we'll be working with the Connect API.

  • In Chapter 8, we'll be exploring the marriage between Kafka and Spring Boot, diving deeper into Spring Boot's Kafka Components in Chapter 9.

  • In Chapter 10, we'll tie Kafka to Spring's Cloud Stream module and see how we can use Spring Cloud Stream's Kafka Binder and Kafka Stream Binder for cloud applications.

  • In Chapter 11, we'll dive into Reactive Kafka and Reactive Spring.

  • In Chapter 12, we'll build an end-to-end ETL data pipeline project to stream data into and from SQL and NoSQL databases such as Apache Cassandra, performing transforms into Avro, using Spring Cloud Stream and Reactive Cassandra.

  • In Chapter 13, we'll perform reactive stream processing with Kafka and Spring WebFlux and explore Server-Sent Event Streaming by building two applications - one for pushing real-time stock prices to a UI and one for building a chatbot/messaging system.

  • In Chapter 14, we'll explore real-time data streaming using Apache Spark and Kafka, taking advantage of Spark Streaming and real-time aggregation.

Lessson 1/14
You must first start the course before tracking progress.
Mark completed

© 2013-2024 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms