Drop #166 (2022-12-26): What We Left Behind: Part 1

htmlq

Thanks to the 💣🌀 and generator/⚡️🔌🪫 woes at the Maine compound, we’re staying a tad longer than expected at my grandson’s abode. Which means .1 continues to compel most of my personal compute cycles. As such, we’ll also continue the “quick drops” motif (until the schedule is back to normal), but with a twist: single item drops that failed to make it into any of the previous 165 editions with some expanded commentary (for some definition of “expanded”).

htmlq

person holding Roo Panes painting

I may have mentioned this before, but I tend to prototype new coding adventures either in R or at the command line. Iterating on an idea is usually faster in interpreted languages, and both bash/zsh and R are expressive, readable, easily extended, and succinct. However, thanks to many, sadly biased OS distributions, fairly antiquated CRAN rules, and lack of implicit safety guardrails, R isn’t the best choice to birth a new CLI utility into the metaverse. Most folks won’t have R installed (unlike Python and some other scripting languages), if they do, it probably isn’t the latest version (and they may or may not have dependent packages installed). Furthermore, you can’t easily ship an executable R script with a package that’s destined for CRAN, then auto-link said script to an accessible \$PATH on package install (which you can — dangerously — do with, say, Python or npm). I’m convinced more folks would use R on the daily if said affordances were made. But, I digress.

Earlier this year, I set out to make utilities that work with the code of Observable notebooks. This meant spending some time in browser Developer Tools looking for where the Observable JavaScript code and metadata lived (it’s in an in-page JSON <script> block). Then, spending a bunch more time in R prototyping how to turn all that JSON crunchy goodness into Quarto documents with {ojs} blocks.

This year, I finally started using Observable Collections to organize the 30-Day Map Challenge and Advent of Code entries. While said collections are handy, I wanted some way to archive them to Quarto docs without having to use my Chrome extension on each one or the CLI utils I build on each URL. I just wanted to give it the collection URL and have it do the rest. I ended up coding a tiny shell script:

which wrapped the Rust version of the single-notebook archiver.

That was a long setup for today’s drop of htmlq, which is “like jq, but for HTML[, and] uses CSS selectors to extract bits of content from HTML files”.

An eariler drop noted a similar utility, but I set a rule to only use 2-letter CLI utils for that post.

Of all the HTML CLI content extraction tools I’ve played with, htmlq is the hands-down winner. It has:

  • the best CSS selector support of all the ones I’ve tried

  • a neat --remove-nodes option that can be used multiple times and which helps in customizing the output

  • support for extracting tag attributes (--attribute <attribute>)

  • HTML output support (--pretty)

  • tag/node text-only extraction (--text)

  • the ability to ignore superfluous whitespace (--ignore-whitespace)

  • the concept of a “base URL”, which you can specify (--base) if it guesses the base URL wrong

Here’s the output of just the htmlq portion of the above coll2quarto snippet:

curl --silent "https://observablehq.com/collection/@hrbrmstr/2022-30-day-map-challenge" |\
  htmlq --text "script#__NEXT_DATA__" | jq | head -10
{
  "props": {
    "pageProps": {
      "collection": {
        "id": "b4e853ad10f18a38",
        "type": "public",
        "slug": "2022-30-day-map-challenge",
        "title": "2022 30-Day Map Challenge",
        "description": "Collecting any/all Observable Entries for the 2022 30-Day Map Challenge",
        "update_time": "2022-11-15T13:33:29.539Z",

It can be handy just reformatting HTML into something more human-usable (with --pretty), or extracting “metadata”, such as the date of each post in the initial archive view of this newsletter:

curl --silent https://dailyfinds.hrbrmstr.dev/archive | \
  htmlq --attribute "datetime" time | \
  sed -e 's/T.*//'
2022-12-24
2022-12-23
2022-12-22
2022-12-21
2022-12-20
2022-12-19
2022-12-18
2022-12-17
2022-12-16
2022-12-15
2022-12-14
2022-12-13

It does just a few things, but does them very well.1

FIN

This tool has a perma-home in my CLI toolbox. ☮

1

Shamelessly stolen tile from https://www.imdb.com/title/tt6332276/

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.