

Discover more from hrbrmstr's Daily Drop
Archiving URLs
Fundamentally, my job involves studying the internet. At my soon-to-be former employer, I essentially ended up counting IP addresses based on certain classification parameters. A core component of that task involves, to a large degree, web scraping, since the content at those IP addresses is used in many classification contexts.
There is, however, a much broader universe of web archiving. I've mentioned the ODSU Web Science and Digital Libraries Research Group before (and will again!), as they are a solid bunch to stalk to keep up with the cutting edge of this scraping space. But, there are other great, information resources on this subject that have weathered time quite well.
One of them is Archiving URLs by Gwern Branwen, a resource about "archiving the Web, because nothing lasts forever: statistics, online archive services, extracting URLs automatically from browsers, and creating a daemon to regularly back up URLs to multiple sources.". This site has been around since 2011!, with a recent(-ish) update in 2019.
Gwern is a prolific writer across many subjects, but this "how to" resource on web archiving, especially when it comes to detecting and preventing link rot, is pure bookmarkable gold.
"What's link rot?", you say?
Links on the Internet last forever or a year, whichever comes first. This is a major problem for anyone serious about writing with good references, as link rot will cripple several% of all links each year, and compounding.
To deal with link rot, I present my multi-pronged archival strategy using a combination of scripts, daemons, and Internet archival services: URLs are regularly dumped from both my web browser’s daily browsing and my website pages into an archival daemon I wrote, which pre-emptively downloads copies locally and attempts to archive them in the Internet Archive. This ensures a copy will be available indefinitely from one of several sources. Link rot is then detected by regular runs of linkchecker, and any newly dead links can be immediately checked for alternative locations, or restored from one of the archive sources.
It covers quite a bit of ground:
Link rot
Detection
Prevention
Remote caching
Local caching
Caching Proxy
Batch job downloads
Daemon (setting up one)
Cryptographic timestamping local archives
Resource consumption
URL sources
Browser history
Document links
Website spidering
Reacting to broken links
It also has a great section with useful external links.
For anyone using the internet for long-term research (which likely means most readers of these newsletter editions), Gwern's definition of the problem space and tool guidance should help you ensure the integrity of referenced information resources.
CopyChar
Even for those of us who may have oddly long-term and accurate recall for a diverse array of information, is is difficult to justify dedicating little grey cells to memorizing minutiae that are quickly and easily referenced.
CypyChar (a.k.a. "⌘+C") is "a basic app that allows you to find and copy special characters to your clipboard. Click or tap on a character and it will be copied to your clipboard."
Sure, on modern macOS systems we can sue ^+⌘+space
to bring up an emoji picker, but what if you're a macOS user that finds yourself (horror!!) at a Windows system, Chromebook, or (shudder!!) linux destop. Those other (albeit, lesser) operating systems have their own key-based incantations to bring up similar pickers, but it's likely easier to remember copychar.cc
, where you have a platform-agnostic way to do the same thing.
The interface is very utilitarian, and you can even copy the HTML Entity strings as well as the special character.
For example, "Broken Circle With Northwest Arrow" is U+238B
, ⎋
, or just ⎋.
I think it's a good site to both remember (it's short and memorable), bookmark, and make a habit of using if you're not into the memory game.
Whoa[.css]
I fancy a bit of digital whimsy now and again, perhaps due to the nature of the internet back in the days of the horrendous animated gifs you were introduced to in a previous edition.
[Whoa.css] is a website that lets you experiment with "animations for eccentric developers". The site is just an interactive shell for the single CSS file that makes it all work.
It even has a GitHub repository in the event you want to help create even more potential seizure-inducing animations.
FIN
I hope these resources let you have some fun, save you some time, and help make managing information easier! Remember to stay kind in the comments. ☮
2022-05-09.01
“Links on the Internet last forever or a year, whichever comes first.” What a great opening line!
The archiving resources are super helpful—I'm good about including the date of when I accessed a resource, but that's basically like saying “worked on my machine”.