

Discover more from hrbrmstr's Daily Drop
The Drop's WPEs are still in “light mode” since I am 100% sure it's still summer, given that it's been a billion °F in Maine this past week (#4 even has a “heat day” — vs. a “snow day” — from high school today).
Looking back across the 357 Drops, to-date, I'm finding it difficult to believe I've only mentioned Kafka three times. For those even remotely familiar with Kafka: fear not! We won't be touching Java at all in this Drop (traditional Kafka is Java-based). For those who aren't super familiar with Kafka: rest assured we'll take a look at what it is, what problems it solves, and why you should be more familiar with it before we get to today's mission.
Taking A Dip In The Stream
Apache Kafka is an open-source distributed streaming platform designed to handle high volumes of data in real-time. It's named after the author Franz Kafka, as Jay Kreps — one of the co-creators of Apache Kafka — chose this name because the system is optimized for writing; and, Jay was a fan of Franz Kafka's work. It is often used for building real-time data pipelines, streaming applications, and event-driven architectures. Almost every decent-sized internet-facing service you use every day very likely uses the Kafka protocol in some way. As we'll soon see, that doesn't necessarily mean they're running Kafka-proper.
Kafka's core architectural concept is an immutable log of messages that can be organized into topics for consumption by multiple users or applications. It enables asynchronous data flow between processes, applications, and servers, much like other message broker systems. However, Kafka has very low overhead because it does not track consumer behavior and delete messages that have been read. Instead, Kafka retains all messages for a set amount of time and makes the consumer responsible for tracking which messages have been read.
Said architecture consists of a storage layer and a compute layer. The storage layer is designed to store data efficiently and is a distributed system that can be easily scaled out to accommodate growth. The compute layer consists of four core components: the producer, consumer, streams, and connector APIs, which allow Kafka to scale applications across distributed systems.
I keep using the word “scale”, and that can be fairly intimidating to folks who are platform architects or data engineers. While it is most certainly designed to scale, Kafka is a great tool to use even on a small scale, such as a single node Raspberry Pi. It lets you decouple collection and processing from a single script/program, maintain an event history for some process, or have one event be handled by multiple consumers without worrying about concurrency in a single process.
To make this less a bit less abstract, let's consider the original “podcast transcriber” WPE from a few weeks ago. We’ll extend the use-case a bit (note that this extension is not the WPE mission, as we're still trying to keep to short exercises).
In this extended workflow, we need to:
Check to see if there's a new episode in the podcast's RSS feed
Store the episode metadata in a database (for fast search/retrieval)
Transcribe the audio (so it becomes usable data)
Ingest the transcripts into some searchable system
You can 100% do all that in a single script/executable. However, that may not be such a great idea in the long run.
For starters, that's a great deal of complexity to manage in a single source. It creates significant cognitive load when process bugs crop up (there are always new edge cases that need handling), or when you discover you need to use different components (e.g., switching to postgres from sqlite for the metadata database, using an SasS API vs local Whisper for transcription, migrating from, say Solr for transcript indexing to Quickwit).
Furthermore, what if you want to consume, transcribe, and index multiple podcasts? Migrating a single threaded script/program into a concurrent one introduces a whole set of new potential problems and bugs.
If we decouple the processing into an event-driven architecture, it could look something like this:
where we have a central message broker (“Red Panda”) receiving and broadcasting messages that various consumers are watching. Each consumer does one, focused process and returns success/failure events back to the broker. One of those consumers knows all the events that need to be finished before a new podcast ingestion workflow can be said is “complete” and is responsible for finalizing the entire transaction.
Don't get me wrong, there's definitely some complexity in said architecture, but you can change up the “what/how/where” in each compartmentalized processing node, and more easily support handling more than one podcast ingestion at one time.
Jumping On The 🐼 Express
You may have noticed that the central message broker is named “Red Panda”. While I do like 🐼, the name based on a an actual tool (GH), which is a C++-based alternative to running icky Java apps on your system. Traditional Kafka has more moving parts and dependencies, but you can get started in the brave, new message queue world pretty easily:
$ brew install redpanda-data/tap/redpanda && rpk container start
$ curl -1sLf \
'https://dl.redpanda.com/nzc4ZYQK3WRGd9sy/redpanda/cfg/setup/bash.deb.sh' \
| sudo -E bash
$ sudo apt-get install redpanda
Docker/container-fans can also join in the fun.
At the end of the installation dance, you should be able to execute and see something like the following:
$ rpk cluster info
CLUSTER
=======
redpanda.0ca7481f-0dcd-4334-9001-4867592b29ae
BROKERS
=======
ID HOST PORT
0* 127.0.0.1 65511
You're now ready to journey into the world of distributed message processing!
If you'd like a more guided install, hit up the first chapter of “How to Install Redpanda & Use the rpk CLI”. The Redpanda University guides used to be accessible without “registering”, but they're all free, and come with transcripts (if you're like me, you prefer text to video).
Streaming Topics
The data stored on brokers like Redpanda is organized into topics, which are named streams or channels where related events are published. For instance, you could have a “new episode” topic containing events related to new podcast episodes for a given show. Generally, a topic has a similar level of granularity as tables in a database. It is possible to create any number of topics, each identified by a human-readable name.
Each record within a topic is represented as a key-value pair, with a timestamp and an optional set of headers. The key of each record can be used to group related events, and the value contains the payload (perhaps some JSON), with all of the details about the event. “Producers” post messages/events to a topic, and “Consumers” listen for and receive/process those messages/events.
Let's make this a bit more real.
Let's create a topic for the extended podcast scenario.
$ rpk topic create "NewPodcastEpisode"
TOPIC STATUS
NewPodcastEpisode OK
Now, let's send a new podcast event to that topic. We'll use some mock JSON representing one podcast RSS feed item stored in epidose.json
:
{
"title": "Example Podcast Episode",
"description": "This is an example podcast episode from an RSS feed.",
"pubDate": "2023-09-07T12:00:00Z",
"duration": "01:30:00",
"url": "https://example.com/podcast/episode1.mp3"
}
☝🏽 all should be on one line as rpk
treats each line as a separate event/message
$ cat episode.json | rpk topic produce NewPodcastEpisode Produced to partition 0 at offset 0 with timestamp …
Our Producer does not care a whit about what happens next with that event. It did its job, and did it well.
Now, we'll consume it:
$ rpk topic consume NewPodcastEpisode -n 1
{
"topic": "NewPodcastEpisode",
"value": "{\"title\": \"Example Podcast Episode\", …}",
"timestamp": 1694167649722,
"partition": 0,
"offset": 0
}
We're using “-n 1
” so the rpk
command returns you back to the shell. Otherwise, it would just continue listening and printing new events.
We can also configure Redpanda topics to clean themselves up after a while, since we want to be smarter than Toyota's IT Operations team, and not eat up all our drive space. We can do this at topic creation time or afterward (600,000 below == 10 minutes retention):
$ rpk topic alter-config NewPodcastEpisode --set retention.ms=600000
Kafka/Redpanda keeps messages around forever (unless told otherwise) since it provides a history of events, which is super important for debugging workflows or reprocessing data.
There are many properties you can play with, but we don't need to add any more complexity to the upcoming diminutive assignment.
Your Mission
First, play around with rpk
and get comfortable creating topics, deleting topics, and sending/receiving messages (split your terminal/open a new one and watch the post/consume flow). You can even go through some of the University topics as the sole mission.
Next, get familiar with the documentation. Those docs can go deep since you can do YUGE things with Redpanda/Kafka, but you can stick with getting familiar with the core concepts and settings.
For folks who want a more guided effort, work through the chat app, which has full, working project code in Go, Python, JavaScript, and (ugh) Java. You really don't want to live in the shell and rpk
when creating a full workflow. Event-driven processes do mean working asynchronously, and these examples are also a great way to dive furhter into working with async idioms in those languages.
For extra credit, way back, in Drop #296, we showed how to consume data from the Bluesky firehose (no 🦋 account required). Take that base and post each message to a Redpanda “NewPost
” topic and create multiple, separate consumers:
one that captures and stores all URLs
one that looks for a particular hashtag (or multiple ones) and saves off those messages
one that watches for things you post (or, someone you know that has a Blueksy account) and re-posts it to Mastodon, Matrix, or even just SQLite.
If you do go the extra credit route, make sure you set the retention period to something your system can handle. Messages from the firehose are small, but those bytes do add up.
FIN
A big reason I decided to focus on Kafka/Redpanda in this drop was this post on the RisingWave blog. In “Why Kafka Is the New Data Lake?” their founder makes a pretty convincing argument that there's nothing stopping Kafka from becoming the foundation of the modern data lake. Heck, we've already got KSQL (SQL ops on streaming data!), and one could even consider event consumers to be postgres triggers on steroids.
Drop a note with any q's you may have or issues you run into if you do decide to take on either or both parts of this WPE challenge! ☮