In most of the applications in various domain, an enormous volume of data is used. Regarding data, we have two main challenges. The first challenge is how to collect volumes of data and the second challenge is how to analyze the collected data. To overcome these challenges, you need a robust messaging system.
Messaging System and its Types
A messaging system is responsible for transferring data from one application to another, so the applications can focus on data, but not worry about how to share it. Distributed messaging is based on the concept of reliable message queuing. Messages are queued asynchronously between client applications and messaging system.
The main types of Messaging System are as below:
- Point to Point Messaging System
- Publish-Subscribe Messaging System
Point to Point Messaging System
In a point-to-point system, messages are persisted in a queue. One or more consumers can consume the messages in the queue, but a message can be consumed by a maximum of one consumer only. Once a consumer reads a message in the queue, it disappears from that queue. Eg: Order Processing System
Publish-Subscribe Messaging System
In the publish-subscribe system, messages are persisted in a topic. Unlike point-to-point system, consumers can subscribe to one or more topic and consume all the messages in that topic. In the Publish-Subscribe system, message producers are called publishers and message consumers are called subscribers. Ex: Dish TV
History of Apache Kafka
Previously, LinkedIn was facing the issue of low latency ingestion of huge amount of data from the website into a lambda architecture which could be able to process real-time events. As a solution, Apache Kafka was developed in the year 2010, since none of the solutions was available to deal with this drawback, before.
However, there were technologies available for batch processing, but the deployment details of those technologies were shared with the downstream users. Hence, while it comes to Real-time Processing, those technologies were not enough suitable. Then, in the year 2011 Kafka was made public.
Today we run several clusters of Kafka brokers for different purposes in each data centre. We generally run off the open-source Apache Kafka trunk and put out a new internal release a few times a year. However, as our Kafka usage continued to rapidly grow, we had to solve some significant problems to make all of this happen at scale. In the years since we released Kafka as open source, the Engineering team at LinkedIn has developed an entire ecosystem around Kafka.
Introduction to Kafka
Apache Kafka is a fast, scalable, fault-tolerant, publish-subscribe messaging system. Basically, it designs a platform for high-end new generation distributed applications. Also, it allows many permanent or ad-hoc consumers. We use Apache Kafka when it comes to enabling communication between producers and consumers using message-based topics. One of the best features of Kafka is, it is highly available and resilient to node failures and supports automatic recovery. This feature makes Apache Kafka ideal for communication and integration between components of large-scale data systems in real-world data systems.
There are numerous benefits of Apache Kafka such as:
- Tracking web activities by storing/sending the events for real-time processes.
- Alerting and reporting the operational metrics.
- Transforming data into the standard format.
- Continuous processing of streaming data to the topics.
Kafka Use Cases
There are several use Cases of Kafka that show why we use Apache Kafka.
For a more traditional message broker, Kafka works well as a replacement. We can say Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large-scale message processing applications.
For operational monitoring data, Kafka finds the good application. It includes aggregating statistics from distributed applications to produce centralized feeds of operational data.
- Event Sourcing
Since it supports very large, stored log data, that means Kafka is an excellent backend for applications of event sourcing.
Using the following components, Kafka achieves messaging:
- Kafka Topic – Basically, how Kafka stores and organizes messages across its system and essentially a collection of messages are Topics. In addition, we can replicate and partition Topics. Here, replicate refers to copies and partition refers to the division. Also, visualize them as logs wherein, Kafka stores messages. However, this ability to replicate and partitioning topics is one of the factors that enable Kafka’s fault tolerance and scalability.
- Kafka Producer – It publishes messages to a Kafka topic.
- Kafka Consumer – This component subscribes to a topic(s), reads and processes messages from the topic(s).
- Kafka Broker – Kafka Broker manages the storage of messages in the topic(s). If Kafka has more than one broker, that is what we call a Kafka cluster.
- Kafka Zookeeper – To offer the brokers with metadata about the processes running in the system and to facilitate health checking and broker leadership election, Kafka uses Kafka zookeeper.
Benefits of Apache Kafka
- Low Latency: Apache Kafka offers low latency value, i.e., up to 10 milliseconds. It is because it decouples the message which lets the consumer to consume that message anytime.
- High Throughput: Due to low latency, Kafka can handle a greater number of messages of high volume and high velocity. Kafka can support thousands of messages in a second. Many companies such as Uber use Kafka to load a high volume of data.
- Fault tolerance: Kafka has an essential feature to provide resistant to node/machine failure within the cluster.
- Durability: Kafka offers the replication feature, which makes data or messages to persist more on the cluster over a disk. This makes it durable.
- Reduces the need for multiple integrations: All the data that a producer writes go through Kafka. Therefore, we just need to create one integration with Kafka, which automatically integrates us with each producing and consuming system.
- Easily accessible: As all our data gets stored in Kafka, it becomes easily accessible to anyone.
- Distributed System: Apache Kafka contains a distributed architecture which makes it scalable. Partitioning and replication are the two capabilities under the distributed system.
- Real-Time handling: Apache Kafka can handle real-time data pipeline. Building a real-time data pipeline includes processors, analytics, storage, etc.
- Batch approach: Kafka uses batch-like use cases. It can also work like an ETL tool because of its data persistence capability.
- Scalability: The quality of Kafka to handle large number of messages simultaneously make it a scalable software product.
The result of the performance of the Kafka system totally depends on how much accuracy occurs in data delivery. The performance depends on the delivery of every message to its consumer correctly, on time without more lag of time. The message delivery must be done at a perfect time and if not received within that time, alerts start to raise.