

Discover more from hrbrmstr's Daily Drop
There may not be any horsemen in this edition, but plenty of other headless wonders await your attention. I can lead you to these resources — and you may eagerly delve into each one; but, as the poem says, "What happens after is up to you."
Browserless
Even though the responsibilities and mission of my current role have little to do with reaching out and touching web-based content, I still poke at web scraping for side projects, data archival, and fun/mischief. Readers embedded in the R ecosystem may be familiar with my [now defunct due to a dependency lapse by another package author] {splashr} package, an R wrapper around ScrapingHub's standalone/self-installed 'Splash' scraping service. One of the nice features of Splash was a robust REST API that enabled quick orchestration of simple tasks — such as taking screenshots or scraping content from javascript-created page resources — along with the ability to script browser actions via a quaint Lua API.
Alas, as noted, {splashr} is no more (on CRAN), and the world-wide web is getting complex enough that the cute Qt WebKit browser embedded in Splash just doesn't cut it for some real-world tasks. Sure, you could just make due with headless Chrome and be on your way, but you don't have to settle for such a pedestrian experience.
The folks at Browserless have made one of their components — 'browserless' Chrome — freely available for all to use. This self-hosted/managed service "allows for remote clients to connect, drive, and execute headless work; all inside of docker. It offers first-class integrations for puppeteer, playwright, selenium's webdriver, and a slew of handy REST APIs for doing more common work. On top of all that, it takes care of other common issues such as missing system-fonts, missing external libraries, and performance improvements. We even handle edge-cases like downloading files, managing sessions, and have a fully-fledged documentation site."
It's everything Splash was and more, including being susceptible to all Chrome vulnerabilities, of which there are regularly legion numbers of. (I do somewhat miss the simple Lua scripting.)
Like all the cool kids (or, should I say, k8s) these days, you can get up and running quickly with Docker:
docker run -p 3000:3000 browserless/chrome
See this section if you're on fancy Apple Silicon or super cool Linux arm64 boxes.
and get into the weeds with the robust documentation provided by the Browserless folks.
They can give this tech away because anyone serious about web scraping at-scale knows how hard it is and would rather some other org deal with service uptime, parallel operations, and abuse complaints.
Like Splash, you can use it interactively, as shown in the section header image, or, you can hit the REST API:
The next time you're stuck in a javascript-rendering bind, or just need more power than basic web scraping tools afford, make sure to give Browserless a go.
NOTE: While I've focused solely "scraping", Browserless is a fine environment to use in website testing setups, especially since it supports webdriver and Selenium out of the gate.
Headless Recorder
Browserless (mentioned in the previous section) is great! But, to use the full power of the service, it helps to know a bit about puppeteer or playwright (links above). If you're new to those ecosystems, they are "one more thing you need to learn." Those "things" can pile up quickly on new projects. Thankfully, all you really need to learn to use those frameworks is something you already know how to do: waste time browsing sites on the internets. I say that because while you go about your feline-slinging and twitter ranting, Headless Recorder will keep an event log of everything you do, and give you back something you can use with Browserless, or directly in puppeteer/playwright orchestration/testing scripts.
The documentation should get you up and running pretty quickly, and I suspect you'll become competent in puppeteer/playwright in no time after some initial forays with the extension.
shot-scraper
I'd be remiss in my crypt-keeper curating duties if I did not include a link to Simon Willison's (@simonw) shot-scraper [GH]. Originally built to be just another command line screenshot taker, shot-scraper is a simple-yet-robust tool for capturing js-enabled site content in many forms, including bitmaps.
The tool works with firefox, webkit, chrome, and chrome-beta. Yes, that means you have to roll your sleeves up a bit higher to start working with it, but not too much:
pip3 install shot-scraper # the tool itself
shot-scraper install # chrome w/puppeteer
shot-scraper https://dailyfinds.hrbrmstr.dev/archive
That last one generated this long image.
As the docs note, you can do a great deal more, including dumping the entire DOM tree, script page actions, and much more.
A real-world use of it outside of Simon's own ecosystem is news-homepages a resource by @palewire, dropped in a previous installment:
The Twitter bot itself is fully powered by shot-scraper.
There are instructions for using shot-scraper as a GH Action, and it's a handy tool to have in the toolbox.
FIN
Just a scant few days until spooky edition themes come to a welcome end! ☮