

Discover more from hrbrmstr's Daily Drop
Drop #320 (2023-08-18): Weekend Project Edition
Turn Podcasts Into An Information Repository (Yes, Moar 🔊 RSS Work Today)
Rust GUIs will have to wait for longer nights. Despite how cool Iced (see yesterday’s Drop) is, a “from scratch” full-on Rust project just wouldn't be something most folks could reasonably tackle in a weekend. It can wait until we have some darker & colder days in the Northern Hemisphere.
We'll continue both the “keep it simple” theme and the “RSS” theme from the previous week, just in a different context.
Rather than turn a data source into an RSS feed, we're going to turn an (audio) RSS source info usable, searchable, and even structured data.
The Plan
It should come as no surprise that my podcast selections are a tad, um, boring. Well, at least to most folks. I focus mainly on sources that are part of this whole “stop the death of democracy” thing, as well as general news, and information. As such, I tend not to think of them as “ephemeral” since people, places, and events tie together over time.
Unfortunately, audio is a time-consuming resource to loop back through, and most still do not have transcripts. There’s no way I’m taking copious notes when listening to one, either.
So, this weekend, your challenge is to build a podcast text archiver. We'll give you all the pieces you'll need, but, make no mistake, sleeves will be rolled up, if only for a little while.
Here's the gist of it:
Given a podcast RSS feed URL, you'll create a process that:
identifies any new episodes that have been released since the last run
for each new episode:
extract the metadata
identify the audio URL
convert the audio to something we can feed to a transcriber
transcribe the audio and have the transcriber attempt to mark different speakers
This means the first run will do the above for all of the episodes unless you choose to only go back so far in the history.
We’ll work with one feed and show how to do the four third-level bullets.
The 7
We'll use a short podcast for this exercise, since patience can wear thin, especially for folks like me who have no fancy GPUs to run even fancier AI models on.
So, we’ll use WaPo's “The 7”. Each episode is about current, (mostly) relevant events, and (most importantly) trends between five-to-ten minutes per-episode. Since we'll be using both ffmpeg
(https://github.com/FFmpeg/FFmpeg — binaries are around for every platform) and Whisper.cpp
(see below) to work with this data, small files are super important, especially for a proof-of-concept project.
The RSS URL for the podcast is: https://podcast.posttv.com/itunes/the-7.xml
.
Sweet Nothings
If you haven't played with Whisper.cpp yet, you will today. It's a high-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model. It runs pretty much anywhere, and there are scads of “smaller” models that have oddly decent accuracy and speed.
Your first step is to get it installed. The Whisper.cpp repo has tons of information on how to do that. But, for more extensive help, macOS folks can follow the instructions in this blog post. You'll need to clone the repository, install XCode or XCode Command Line Tools, and then use the make
command to compile the code.
For Windows, refer to this guide, or this issue — and let GitHub do the work for you like. You'll need to have git, curl, and Anaconda installed. After cloning the repository, you'll need to use Microsoft Visual Studio to build the project.
For Linux, you can follow similar steps as for macOS. Clone the repository and use the make
command to compile the code.
There are pre-compiled binaries floating around. Just be thoughtful before downloading random things from random sites.
Ensure Whisper is working before continuing. Give a shout out if you need a hand.
By Your Command [Line]
We'll work through the process in the shell, which should make it possible to translate the process to any scripting language. We'll need some help to work with XML, and for that we'll turn to some XML utils covered in a previous Drop. I'll use Xidel to see if we can actually get an episode's audio:
# Pull down a copy the RSS feed
$ curl -s -o feeds/the-7.xml \
"https://podcast.posttv.com/itunes/the-7.xml"
# Poke at it a second to make sure things are copasetic
$ cat ./feeds/the-7.xml | \
xmllint --nocdata \
--xpath "//item/title/text()" - | head -2
Thursday, August 17, 2023
Wednesday, August 16, 2023
# Grab the first audio URL
$ enclosure=$(cat ./feeds/the-7.xml | xidel - --silent --extract "//item/enclosure/@url" - | head -1)
$ guid=$(cat ./feeds/the-7.xml | xidel - --silent --extract "//item/guid" - | head -1)
$ echo "${guid}\n${enclosure}"
64ddfc9c4cedfd000a140968
https://chrt.fm/track/7429E/podtrac.com/pts/redirect.m…[TRUNCATED]
Now we're cookin' with ga…er, induction!
Note that we got lucky that the GUID is a real GUID and not a URL. You should likely figure out a better way to deal with this uncertainty in your setup. Further note that there isn't always a filename in the enclosure
URL.
Equivalent Exchange
Whisper.cpp requires audio files to be converted (transcoded) to a specific format, particularly with a 16 kHz audio sampling rate. This is due to the underlying Whisper model being designed to work with this specific rate, based on the pre-training corpus of the Whisper models.
We can cover both the download and transcode in one step:
# Convert the audio to an optimized format for Whisper
$ user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
$ ffmpeg -user_agent "${user_agent}" \
-i "${enclosure}" \
-acodec pcm_s16le -ac 1 -ar 16000 \
"wav/${guid}.wav"
LOTS of ugly verbose output which you can silence with CLI options
# Check out our work (get file size + name)
# linux: stat -c "%s %n" wav/64ddfc9c4cedfd000a140968.wav
$ stat -f "%z %N" wav/64ddfc9c4cedfd000a140968.wav
12785500 wav/64ddfc9c4cedfd000a140968.wav
I'm feeding ffmpeg
a user-agent, since some “clever” podcast hosters think they can stop curl
. They're so adorable when they're young!
More Sweet Nothings
Since we want Whisper to try to identify when different folks are speaking (which isn't usually “a thing” on The 7), we'll need to grab a specially trained model that was purpose-built/trained for speaker disarisation.
$ whisper \
--threads 8 \
--tinydiarize \
--model "/path/to/models/ggml-small.en-tdrz.bin" \
--language en \
--output-json \
--output-vtt \
--output-txt \
--output-file "transcripts/${guid}" \
--file "wav/${guid}.wav"
$ stat -f "%z %N" ./transcripts/*
11779 ./transcripts/64ddfc9c4cedfd000a140968.json
5455 ./transcripts/64ddfc9c4cedfd000a140968.txt
6393 ./transcripts/64ddfc9c4cedfd000a140968.vtt
That process took less than a minute on my MacBook Pro.
Some notes:
play with
--threads
and--processors
to see what works optimally on your systemconsider using
--prompt
to give the model some hints at the content and what you're looking for.if you want to skip past any usual start-of-podcast cruft, use
--offset-t
to specify where in the stream Whisper should start.
What We Generated
Let's see what we've got.
The txt
file is the most human-consumable:
$ head -2 transcripts/64ddfc9c4cedfd000a140968.txt
Whether you're all about fun in the sun or rest in relaxation, Volkswagen has S_U_V_s to get you revved up for summer. Have a season packed with adventure in the versatile atlas with spacious cargo room. Stroll to the sunset in the compact Taos with V_W_ digital cockpit. Or make the most of your summer energy in the all-electric I_D_ four S_U_V_. Get your summer on the road at your local Volkswagen dealer today.
Trump's possible trial date rising, cancer rates in young people and a medical breakthrough using pig organs. That sum of what we'll get to on the seven from the Washington Post. I'm Jeff Pierre. It's Thursday, August seventeenth. Let's get you caught up with today's seven stories.
The vtt
file can be used in any WebVTT context:
$ head -10 transcripts/64ddfc9c4cedfd000a140968.vtt
WEBVTT
00:00:00.000 --> 00:00:29.020
Whether you're all... [TRUNCATED]
00:00:29.020 --> 00:00:50.940
Trump's possible... [TRUNCATED]
00:00:56.180 --> 00:01:08.100
First up Donald... [TRUNCATED]
And, we get structured data in the json
:
$ jq '.transcription[0]' transcripts/64ddfc9c4cedfd000a140968.json
{
"timestamps": {
"from": "00:00:00,000",
"to": "00:00:29,020"
},
"offsets": {
"from": 0,
"to": 29020
},
"text": " Whether you're... [TRUNCATED]",
"speaker_turn_next": false
}
Put A Bow 🎁 On It
With the plan, pointers to tools, and example runs in hand, you should have a solid foundation to build a basic “podcast-to-data” workflow in a weekend.
Virtually every scripting and programming language will support shelling out to run those commands, and we've learned (in previous WPEs) about how to machinate databases pretty well, too.
Some ideas:
Use a better directory structure. This is nice for a hack:
├── feeds
│ └── the-7.xml
├── transcripts
│ ├── 64ddfc9c4cedfd000a140968.json
│ ├── 64ddfc9c4cedfd000a140968.txt
│ └── 64ddfc9c4cedfd000a140968.vtt
└── wav
└── 64ddfc9c4cedfd000a140968.wav
but is ill-suited for a large podcast corpus.
Keep historical copies of the podcast feed around (content can change, and some older episodes may drop off). Git was tailor-made for this.
Better yet, also store any new or updated metadata entries from the RSS feeds into a database.
Consider making the texts even more useful with something like the Tantivy full-text index/search utility. That link has a full tutorial, but I'll be covering Tantivy in this weekend's Bonus Drop, too.
Use any of the automation/workflow tools we've covered in previous Drops to have all this work seamlessly in the background for you.
FIN
Give a shout if you run into any issues bending Whisper.cpp or ffmpeg to your will. ☮
Drop #320 (2023-08-18): Weekend Project Edition
In the spirit of the Planning poster: great progress was made on last month’s progress report.