

Discover more from hrbrmstr's Daily Drop
I'm goofing around with some Mastodon and other feeds, so I decided to check the market in command line XML parsing tools.
It turns out, many (most?) of the “modern” ones written in Go, or Node.JS kind of suck. They either fail to process XML due to non-existent errors, force you to unnecessarily deal with namespaces, or do not process all target nodes. And, don’t even get me started about Rusty XPath CLI tools.
So, I turned my gaze back to three old XPath pals to [re]introduce them to y'all.
For all the ones I looked at (including four that failed to work, which I won't mention today), I had them process the feed to this Substack newsletter, and asked each to spit out the //item/title
text.
All three produced this:
Drop #149 (2022-12-05): 🤖 Outsourced Edition
Drop #148 (2022-12-02): Weekend Project Edition
Drop #147 (2022-12-01): Quriosities
Drop #146 (2022-11-29): Scraper Capers
Drop #145 (2022-11-29): HTTP RIGHT NOW!
Drop #144 (2022-11-28): Next-Level JavaScript & CSS
Drop #143 (2022-11-22): 🦃 Drop!
Drop #142 (2022-11-23): Data/Vis/Git Toolbox Tidbits
Drop #141 (2022-11-22): We shall build a tower so tall, we will mine the very stars themselves!
Drop #140 (2022-11-21): Arc You Ready For A New Browser?
Drop #139 (2022-11-17): Random Time Sink Drops
Drop #138 (2022-11-16): Delightful Surprises In Small Packages
Drop #137 (2022-11-15): Quick Drop → Finding Needles in Your Twitter Archive Haystack
Drop #136 (2022-11-14): Knowledge Surfing & GTD in a Sea of 8B Humans
Drop #135 (2022-11-11): 🛠️ Weekend Project Edition
Drop #134 (2022-11-10): 🔥 Down The 🐦 House
Not-So-Daily This Week
Join my new subscriber chat!
Drop #133 (2022-11-04): Weekend Project FedIratEdition 🐘
Drop #132 (2022-11-03): War: What Is It Good For?
Let’s begin!
xmllint
The xmllint program parses one or more XML files, specified on the command line or from standard input. It prints various types of output, depending upon the options selected, and is useful for detecting errors both in XML code and in the XML parser itself.
If you do any Python or R coding you likely have access to xmllint, as it usually comes along for the ride with libxml[2]
. This tool has been around a long time and it sports one of the most refined XPath treking experiences out there.
Basic usage is very straightforward:
$ cat /tmp/feed.xml | xmllint --nocdata --xpath "//item/title/text()" -
There are scads of command line options which you can read through via xmllint --help
. No, seriously, look at that help output. So. Many. options.
xidel
Xidel is an equally elderly command line tool that does tons more than just filter and print XPath. It can download and extract data from HTML/XML pages and handle JSON API wrangling.
It's also written in Pascal. O_o
Check out the Monroney Sticker on this thing:
Extract expressions:
CSS 3 Selectors: to extract elements unchanged
XPath 3.0: to extract values and calculate things with them.
XQuery 3.0: to create new documents from the extracted values and to build Turing-complete scripts.
Pattern matching: to extract several expressions in an easy way using an annotated version of the input page for pattern-matching.
XPath 2.0/XQuery 1.0: compatibility mode for old XPath/XQuery versions.
JSONiq: to work with JSON APIs (deprecated by XPath 3.1)
Following:
HTTP Codes: Redirections like 30x are automatically followed, while keeping things like cookies.
Links: It can follow (all) links on a page, meta refreshs, or any extracted value.
HTML Forms: It can fill in arbitrary data in the input elements and submit the form.
Arbitrary HTTP requests: In any query, you can call a function to make other requests.
Output formats:
Adhoc: just prints the data in a human-readable format.
XML: encodes the data as XML.
HTML: encodes the data as HTML.
JSON: encodes the data as JSON.
bash/cmd: exports the data as shell variables.
fn:serialize: implements the W3C XQuery Serialization standard.
Connections: HTTP / HTTPS as well as local files or stdin.
It has equally straightforward usage, but — like its xmllint cousin — you can go super deep/complex when needed.
cat /tmp/feed.xml | xidel - --silent --extract //item/title
You really should spend some minutes with the output of xidel --help
to really get a sense for what this can do, and it might help reduce tooling for your command line-oriented lambdas.
xpath (macOS/Perl)
This ships-with-macOS standard component is powered by Perl's XML::XPath
module and takes any number of XPath pointers and tries to apply them to each XML document given on the command line or standard input.
When multiple queries exist, the result of the last query is used as context for the next query and only the result of the last one is output. The context of the first query is always the root of the current document.
$ cat /tmp/feed.xml | xpath -q -e "//item/title/text()"
You can get it for any operating system, and the command line options are quite brief, making this the least complex of our three featured utilities.
FIN
It's remarkable that the “state of the art”, most useful tools in XML/XPathing are each ~20 years old.
BONUS DROP: Run a virtual machine inside ChatGPT. ☮