The Connect API - Building a Pipeline to Connect, Convert and Transform
Kafka Connect is a data pipeline that was introduced in Kafka 0.9.x to provide a runtime component that can be used to connect, convert, and perform transformations of the data before pushing it to a Kafka topic or to an application or data source. All these components are pluggable and configurable to provide the best of its use-cases. In this chapter, we are going to take a look at each of these components and understand a high-level overview of how they work and how we can use them together to create a useful data pipeline.
Kafka Connect Architecture
Kafka Connect is a distributed, fault-tolerant, and scalable service, which is being used as a data pipeline to reliably stream series of data between Kafka and other systems. Kafka Connect at its core deals with three major components:
- Connector - It primarily performs three operations. Firstly, it copies data by having the users define jobs at the Connector level, which are then further broken into smaller tasks. Secondly, it provides data parallelism and asks the connector to consider how the jobs could be broken down into smaller subtasks with selective granularity on the immediate receipt of data. Finally, by providing an API to register source and sink interfaces, it makes integration of a variety of data streams much easier.
- Worker - It enables scaling the application. It can either run on a single worker standalone process where it itself acts as its own coordinator or in a distributed or clustered environment where connectors and tasks are dynamically scheduled on workers.
- Data - Kafka Connect primarily focuses on simply copying the data. There are lots of streaming tools available that can be integrated or used as an ETL process for further processing. This makes Kafka Connect simple both from the conceptual and implementation perspectives.