Discover more from hrbrmstr's Daily Drop
Drop #327 (2023-09-04): Back In The Saddle
Wget2 2.1; gitwatch; Collie
Apologies to all for the unplanned Drop hiatus last week. It was an especially gnarly re-uppance of long covid symptoms, leaving me with little energy or sleep for the week, and needing to devote what little there was to fam & work. The week's rest seems to have helped a tad, so the Drops resume this week! Plus, someone's gotta work on Labor day, right?
This is an AI-generated summary of today's Drop.
Perplexity has won the “summary” contest, so for a while, it will be what I use for the TL;DR section. The prompt — “The attached file contains a blog post in Markdown format with three main sections of content. I would like a very concise three bullet summary of it. Each bullet should succinctly describe a section and include the link to the primary resource being covered.” will remain unchanged.
Please drop a note if you want me to try any other local or online LLM/GPTs for this section.
Wget2 2.1: A major update to the GNU Wget2 utility, which is designed for non-interactive downloading of files from the internet, now supports ingesting and mirroring sites from sitemaps, improved TLS/SSL and proxy options, and C++ compilation support for
libwget. Wget2 2.1 Announcement
gitwatch: A simple bash script that watches a file or folder and automatically commits changes to a git repository, allowing you to focus on the retrieve-transform-save process while
gitwatchtakes care of history preservation. gitwatch GitHub Repository
Collie: A lightweight, cross-platform, Rust-based RSS reader app built with Tauri, allowing you to subscribe to multiple RSS/Atom feeds, receive real-time notifications for new items, and save items for later reading. Collie GitHub Repository
When it comes to machinating the retrieval of network-accessible content, the venerable curl` reigns supreme. While it is extremely versatile and supports a bonkers number of protocols, it can take a minute to absorb the vast number of CLI options. In a way, curl is the “assembly language” of internet content fetchers, more so than most of the more focused or task-specific tools out there.
Wget2 — released in 2021 — is the successor to the original Wget, a longtime popular utility for non-interactive downloading of files from the internet. It was designed and written from scratch, wrapping around the libwget library, which provides the basic functions needed in this super handy web client. The core of said functionality is support for the HTTP and HTTPS protocols, as well as retrieval of content through HTTP(S) proxies. It does quite a bit more, too.
I respect y'all's ability to both tap and read lists, so I'll refrain from stealing the bullets from the README's non-exhaustive list of features section, which succinctly documents Wget2's prowess.
We're dropping it today since it recently received a pretty major update that:
enables it to ingest and mirror sites from sitemaps
has improved TLS/SSL and proxy option
C++ compilation support for
Wget2 is also more robust when handling retrieving content over janky network connections (hello, fellow rural Mainers!), and supports resuming incomplete downloads.
If you have need to mirror a site or robustly download a nummber of files, Wget2 is likely a much better option than scripting a
curl solution to do the same.
While it pays solid homage to its predecessor, I do miss support for WARC output.
Note: GNU Wget2 is licensed under GPLv3+. Libwget is licensed under LGPLv3+.
I have (more than) a few processes where I tap a site (programmaticaly) for some content, perhaps transform it in some way, and then output an updated file (or three) to a directory. I, both have scripts for this and use some other automation tools we've covered before to do that, and part of each dance is committing the changes to the (usually local) git repository that output is in.
While those processes are backed by some reusable templates, I'd truly like to just focus on the retrieve-transform-save bits and have something else worry about history preservation.
Enter gitwatch, a spiffy bash script that watches a file or folder and automatically commits changes to a git repository, optionally pushing it to remotes.
While that is a pretty solid summary of what it does (💙 “one thing—well” utilities), we can work through a simple example of how I can preserve change history for CISA's Known Exploited Vulnerabilities Catalog JSON file.
$ sudo apt install inotify-tools # # macOS folk: brew install fswatch coreutils $ git clone https://github.com/gitwatch/gitwatch.git $ cd gitwatch $ sudo install -b gitwatch.sh /usr/local/bin/gitwatch # optional
Now, we'll assume we're in a directory where files will be dropped after processing (or just retrieved). All the commands here assume I'm in that directory and that said directory is also a git repository.
Let's start watching for changes every five seconds:
$ gitwatch -s 5 -m "Files updated at %d" .
Now, let's make a change:
$ wget2 --timestamping --page-requisites \ "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json"
--timestamping option to
Wget2 tells it to not waste bandwidth if the file hasn't changed, and
--page-requisites says to get everything necessary to recreate the “page”. With that last one turned on,
Wget2 will also fetch
robots.txt (just like y'all should too when you scrape things), so add
robots.txt to the local
.gitignore so we don't have to include those in the updates.
If that Wget2 call resulted in an update, we should see that in the stdout log:
[batman 5816605] Files updated at 2525-07-12 09:30:01 1 file changed, 10887 insertions(+) create mode 100644 www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json
You can run
gitwatch in the background with
tmux, or use one of the supplied service installers.
Tis a simple, focused tool that may reduce some script and process complexity for you.
Then recent and ongoing en💩ification of many major internet platforms has breathed new life into RSS. While I rely heavily on many of the paid features of Inoreader for personal and professional content consumption, most folks likely just need a basic reader to keep up on a handful of feeds. We'll introduce more of these over the remainder of H2, but one relatively new, lightweight, and cross-platform GUI one is Collie. This Rust-based, minimal RSS reader app is built with Tauri and enables you to:
subscribe to multiple RSS/Atom feeds to organize your own news feed
receive a real-time notification when a new item is added to the subscribed feed
save the items to read again or later
The section header is the index view of a test run of the tool, and the following is a feed list view:
Content opens in the browser, but this is a great example of a minimal Tauri app that is super hackable. So, if you either want inline content, or you've been wanting to hack on Tauri, this may be a great opportunity.
We may even be doing that this Friday.
It's good to be (on the way to) “back”, and 🤞🏽 that was the last vestiges of the remnants of my spike protein invasion.
Hope all have an astoundingly astonishing week! ☮