Apache Kafka for beginners

What is Apache Kafka, and why should you care about it? This article is for those who have heard about Apache Kafka and want to know more about why Apache Kafka is so widely used by modern companies, the basics of Apache Kafka architecture, the core components of event-driven architecture as they relate to Kafka, and some common business use cases that would benefit from Kafka.

Why is Apache Kafka so widely used by modern companies?

As companies grow, they end up with more systems, and passing information between those systems becomes a complex chore, often slowing the business down. Kafka lets companies decouple those systems to make the interactions less brittle and, therefore, more resilient.

You could think of it simply like this: Kafka solves data exchange problems that most companies will eventually face.

Kafka has developed a large following and ecosystem. While there are other tools that address the same problems, Kafka has become the market leader. Its value snowballs as the community around it grows. As you’ll see, the bulk of Kafka’s value is in the ease of connecting systems. More developers in the ecosystem results in more connectors being built. (More on connectors shortly.)

Apache Kafka architecture

At the highest level, the overall architecture for Kafka looks sort of like this:

Apache Kafka architecture diagram

To dive in a bit deeper and use some of the terms we’ll encounter, let’s zoom in to a more useful level of detail:

Apache Kafka architecture detailed

What are these parts, and how do they work together? Let’s cover these core components. After we’ve become familiar with these concepts, we’ll see how they would fit together in a real use case.

Core components of Apache Kafka

Events

While events are not pictured above, they are the most important piece of our system! You’ll see the terms “event,” “message,” and “record” used interchangeably in various places, including the Apache Kafka documentation. While the variations do technically mean different things, you can get by for now by thinking of them as being the same thing.

Let’s simply think of events as the bits of data going into and out of Kafka: a recorded fact about something that happened. An example event would be a user’s click of a button.

Producers and consumers

Producers are the applications that “produce” or “publish” events. This just means there is some code that creates a message in a particular format and sends it over network to Kafka to be written to disk. The producer is the event “writer,” so to speak.

Consumers are applications that “listen” for (or “subscribe” to) topics, which are like a place where events (more on these in a second) are stored. They are the event “readers.”

A simplified example of an event going end to end would be a producer sending an event to notify that “Jim paid $10 to Susan.” The consumer would read that event and deduct $10 from Jim’s account and add it to Susan’s. 

Brokers

Brokers are request handlers, responsible for the following:

  • Receiving the record of events from the producers
  • Ensuring events are safely stored
  • Serving as a source of updated events that can be requested by consumers

Having multiple brokers is what makes up a “cluster,” and this allows for greater scaling and fault tolerance.

Topics

Events are organized into “topics.” Producers will send their messages to a topic, and consumers will be interested in a particular topic. The main thing to know about topics is that they’re partitioned and can span across brokers. Think of them as a high-level place to send messages that are related.

Partition

Partitions are the low-level places where messages are stored, and they’re one of the main features that allow Kafka to scale and protect data in case a broker fails. A topic will have multiple partitions that are spread across brokers. 

Extending core concepts

Certain other concepts are important, though they aren’t “core” to Kafka. Still, it’s highly likely you’ll leverage some of these concepts in an actual implementation — so let’s cover them.

Kafka Streams

“Streams” (or “event streams”) are both the concept and the tool for processing the flow of events passing through Kafka. While there’s a java library and set of API named Kafka Streams, the term is also used in general to refer to the continuous processing of events. The main thing to know about these is that there is an API and libraries to help you process messages with a focus on “real time.” You can read more about Kafka Streams here.

Connectors

Connectors are integrations built to take messages from Kafka and send them to another location, such as a data store or a notification service. Connectors sit in the middle, talking to the Connect API on one side and another system on the other. For example, a connector might watch for events on a topic and send them to Amazon S3.

If you’re just starting with Kafka, you’ll likely spend much of your time working with connectors. Many vendors provide and actively maintain a Kafka connector with their service. Other connectors are open source and publicly available. If your needs aren’t met by available connectors, you can write your own.

Apache Kafka in action

Now that we have a general sense of all the moving pieces in Kafka, what might they look like in action?

Let’s say we want to track all the API requests for our site. We want to know which endpoints get used the most and combine that with some other data in our data warehouse. In this example, we could use the Kafka Log plugin for Kong to send messages to an Apache Kafka topic. Then, we would use the Snowflake Connector to watch for those messages/events and add them to our Snowflake instance.

The Kong plugin acts as our “producer,” and our Kafka broker stores messages in a partition of our configured topic. Our Snowflake Connector acts as our “consumer,” waiting for Kafka to notify it of any new messages. The connector knows where to put them in Snowflake for us to make use of the data. The components and basic flow would look like this:

Apache Kafka components and basic flow

When (and when not) to use Apache Kafka

Apache Kafka is most effective if you need to connect multiple systems or if you have a need for high throughput.

Considering the effort and expense of implementing and supporting a Kafka instance, you’ll want to make sure you have use cases that make sense. For example, if you only need a low-volume message queue for worker processes, Kafka may be overkill.

However, perhaps you need to ingest high volumes of messages, logs, audit trails, and click data. Or maybe you need to connect microservices and track database change events and propagate those to a data lake. If this sounds like your use case, Kafka may be just the tool to help you.

In addition to use cases mentioned specifically in Kafka documentation, the following high-level use cases may also warrant the use of Kafka:

  • Microservices-backed systems that are complex, with services owned by many teams
  • Change data capture, when a change needs to be propagated to many other systems for analytics, metrics, security alerts, auditing, etc.
  • Scale and/or redundancy, when uptime is critical or you have a high-traffic application
  • Data archiving, batch processing, or real-time processing, when you need to handle incoming streams of data to generate insights

You can read more about these patterns here.

Conclusion

Apache Kafka centers around the basic idea that some applications produce messages while other applications consume messages to do something with them. In the middle, you need a broker that handles those messages in a robust and reliable manner. That is the role of Apache Kafka.

When you’re ready to get started with Kafka, you can consider hosted services with free or inexpensive offerings, such as Heroku, Confluent, and Aiven. You can also spin up Kafka locally with Docker.