Drop #178 (2023-01-16): Making YouTube (Et al.) Work For You

ytgrep; yt-dlp; ffmpeg

For some folks (like me), video content is the least optimal way of consuming information. Doubly so if streaming is required, especially if on the move in rural areas. When learning something, I’d rather have text and static images any day (for most content). Unless something extremely useful is being conveyed visually (a rarity), I’d also prefer an mp3 I can listen to on a walk, doing chores, or cooking.

Today’s drop features tools to help increase the utility of video-centric content. This includes ways to check to see if target content has what you’re looking for, download said content for offline use, extract textual content from videos, and yank the audio/captions from a video.

ytgrep

Image

Before I even consider downloading a video file from a new source, I try to make sure the content is worth the time sink. Watching videos kills a great deal of time since your eyes must be fixated on a glowing rectangle. Sure, some video content — such as watching someone perform a menial task, so you can mimic it more accurately — requires such visual attention. Yet, even in those cases, if the speakers aren’t covering content you need, then you just wasted some time that could have been spent elsewhere.

The ytgrep utility can help perform this triage for videos hosted on YouTube. Beyond mere triage, it’s also useful just for yanking target content out of user-supplied or platform-generated subtitles associated with YT videos.

For instance, let’s say you were like me and $WORK got in the way of watching all the recent, and most excellent NormConf talks. It’d be handy to be able to sift through them to see which ones you can safely avoid. For example, here’s how I avoided all talks that mention python:

# grab the list of normconf individual talk URLs
aria2c https://rud.is/dl/normconf-urls.txt # alternatively curl or wget 

# look for all the ones that mention python; this take a minute or so
cat normconf-urls.txt | xargs ytgrep -links 'python' > to-avoid.txt

The -links parameter will add a “view at this time offset” parameter to each found line. I’m not showing the output due to the size, but you can recreate it with the above steps and check it out on your own.

Note that the output from ytgrep will contain control characters that are used for color-highlighting the found string/pattern (one of us should PR into that utility to add an option to just output text).

yt-dlp

black and white ipad case

Folks may have heard of and possibly (likely?) used youtube-dl, a “command-line program to download videos from YouTube.com and a few more sites. They captured some extra attention a couple of years ago after Microsoft messed up and killed their GH repo, then was forced to save face and reinstate it. (Gosh they have good PR disaster comms writers.)

It’s been around a while.

Over the years, the rate of new feature inclusion slowed down quite a bit. This begat the — now ~defunct — youtube-dlc project, which kept new youtube-dl features updated whilst also letting the community contribute experimental/new features that the main utility could then include.

The yt-dlp project took over where yt-dlc left off. It yanks in relevant updates from youtube-dl while also focusing on providing new features and site support. In fact, the new feature set is so robust that I have to just suggest you hit the yt-dlp repo and start reading.

Remember, too, that despite the “yt” in the name, it’s a generic media downloader for scads of sites. For instance, here’s an incantation to grab media content from a tweet:

# while this works, i'd recommend adding the https:// scheme
# i had to leave it off b/s substack is daft

yt-dlp twitter.com/PineTreeWeather/status/1614912324875259906?s=20

For me, one regular incantation is to grab a YT video with subtitles. Say, for instance, one of the NormConf talks:

# while this works, i'd recommend adding the https:// scheme
# i had to leave it off b/s substack is daft

yt-dlp --sub-lang en --write-auto-sub --sub-format srt 'www.youtube.com/watch?v=I4wkCSd7iMM'

That asks the utility to grab English subtitles in SRT format along with the video itself. It produced these two files:

Ethan Rosenthal and the M1 misadventure -  Ethan Rosenthal [I4wkCSd7iMM].en.vtt
Ethan Rosenthal and the M1 misadventure -  Ethan Rosenthal [I4wkCSd7iMM].webm

The WebVTT format was pulled instead of SRT due to the captions being auto-generated by the platform vs. inserted by the creators. User-supplied captions may also be in WebVTT format, but I’ve seen it more often for auto-generation.

We’ll work with both of these formats in the final section.

This wiki has some great tips and one-liners to help you get the most out of yt-dlp.

ffmpeg

clap board roadside Jakob and Ryan

I’m going to make a broad assumption folks reading this newsletter know what ffmpeg is, whether it is part of your direct toolbox or not. It powers tons of other command line utilities — such as yt-dlp! — and has a rich built-in feature set of its own.

Nether the WebVTT format nor SRT format are good choices for just reading the text of subtitles, and WebM videos aren’t just going to slide into most audio players as-is. We’ll need to convert the two files we downloaded in the previous section into some useful formats if we are not inclined to dedicate time to watch the entire video.

For the subtitles, we can convert the vtt file to an srt file via:

ffmpeg -i 'Ethan Rosenthal and the M1 misadventure -  Ethan Rosenthal [I4wkCSd7iMM].en.vtt' out.srt

Now, out.srt will be especially ugly due to the AI-generated captioning. Open it up and you’ll see what I mean. One way to turn that SRT into something more easily readable is:

rg -v '^[[:digit:]]+$|^[[:digit:]]{2}:|^$' out.srt | tr -d '\r' | rg -N "\S" | uniq 

which:

  • filters out the SRT metadata lines

  • removes carriage returns

  • removes blank lines

  • removes consecutive, repeating lines which are due to some timestamp/display-related redundancies

To turn the talk into something you can listen to vs. watch, it’s just another basic incantation:

ffmpeg -i 'Ethan Rosenthal and the M1 misadventure -  Ethan Rosenthal [I4wkCSd7iMM].webm' out.mp3

The same wiki as above also has some dope ffmpeg content as well.

FIN

Hopefully, today’s drop helped a few folks optimize some workflows. Drop links in the comments if you have better/additional ways of performing any of the tasks we’ve covered today. ☮

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.