hrbrmstr's Daily Drop

Share this post

Drop #123 (2022-10-20): Decapitated

dailyfinds.hrbrmstr.dev

Drop #123 (2022-10-20): Decapitated

Browserless; Headless Recorder; shot-scraper

boB Rudis
Oct 20, 2022
Share this post

Drop #123 (2022-10-20): Decapitated

dailyfinds.hrbrmstr.dev

There may not be any horsemen in this edition, but plenty of other headless wonders await your attention. I can lead you to these resources — and you may eagerly delve into each one; but, as the poem says, "What happens after is up to you."

Browserless

Even though the responsibilities and mission of my current role have little to do with reaching out and touching web-based content, I still poke at web scraping for side projects, data archival, and fun/mischief. Readers embedded in the R ecosystem may be familiar with my [now defunct due to a dependency lapse by another package author] {splashr} package, an R wrapper around ScrapingHub's standalone/self-installed 'Splash' scraping service. One of the nice features of Splash was a robust REST API that enabled quick orchestration of simple tasks — such as taking screenshots or scraping content from javascript-created page resources — along with the ability to script browser actions via a quaint Lua API.

Alas, as noted, {splashr} is no more (on CRAN), and the world-wide web is getting complex enough that the cute Qt WebKit browser embedded in Splash just doesn't cut it for some real-world tasks. Sure, you could just make due with headless Chrome and be on your way, but you don't have to settle for such a pedestrian experience.

The folks at Browserless have made one of their components — 'browserless' Chrome — freely available for all to use. This self-hosted/managed service "allows for remote clients to connect, drive, and execute headless work; all inside of docker. It offers first-class integrations for puppeteer, playwright, selenium's webdriver, and a slew of handy REST APIs for doing more common work. On top of all that, it takes care of other common issues such as missing system-fonts, missing external libraries, and performance improvements. We even handle edge-cases like downloading files, managing sessions, and have a fully-fledged documentation site."

It's everything Splash was and more, including being susceptible to all Chrome vulnerabilities, of which there are regularly legion numbers of. (I do somewhat miss the simple Lua scripting.)

Like all the cool kids (or, should I say, k8s) these days, you can get up and running quickly with Docker:

docker run -p 3000:3000 browserless/chrome

See this section if you're on fancy Apple Silicon or super cool Linux arm64 boxes.

and get into the weeds with the robust documentation provided by the Browserless folks.

They can give this tech away because anyone serious about web scraping at-scale knows how hard it is and would rather some other org deal with service uptime, parallel operations, and abuse complaints.

Like Splash, you can use it interactively, as shown in the section header image, or, you can hit the REST API:

The next time you're stuck in a javascript-rendering bind, or just need more power than basic web scraping tools afford, make sure to give Browserless a go.

NOTE: While I've focused solely "scraping", Browserless is a fine environment to use in website testing setups, especially since it supports webdriver and Selenium out of the gate.

Share

Headless Recorder

Headless recorder demo

Browserless (mentioned in the previous section) is great! But, to use the full power of the service, it helps to know a bit about puppeteer or playwright (links above). If you're new to those ecosystems, they are "one more thing you need to learn." Those "things" can pile up quickly on new projects. Thankfully, all you really need to learn to use those frameworks is something you already know how to do: waste time browsing sites on the internets. I say that because while you go about your feline-slinging and twitter ranting, Headless Recorder will keep an event log of everything you do, and give you back something you can use with Browserless, or directly in puppeteer/playwright orchestration/testing scripts.

The documentation should get you up and running pretty quickly, and I suspect you'll become competent in puppeteer/playwright in no time after some initial forays with the extension.

Leave a comment

shot-scraper

black Canon EOS camera
Photo by Sara Kurfeß on Unsplash

I'd be remiss in my crypt-keeper curating duties if I did not include a link to Simon Willison's (@simonw) shot-scraper [GH]. Originally built to be just another command line screenshot taker, shot-scraper is a simple-yet-robust tool for capturing js-enabled site content in many forms, including bitmaps.

The tool works with firefox, webkit, chrome, and chrome-beta. Yes, that means you have to roll your sleeves up a bit higher to start working with it, but not too much:

pip3 install shot-scraper # the tool itself
shot-scraper install # chrome w/puppeteer
shot-scraper https://dailyfinds.hrbrmstr.dev/archive

That last one generated this long image.

As the docs note, you can do a great deal more, including dumping the entire DOM tree, script page actions, and much more.

A real-world use of it outside of Simon's own ecosystem is news-homepages a resource by @palewire, dropped in a previous installment:

hrbrmstr's Daily Drop
2022-08-02.01
Corpus Tools If the section on Trafilatura in yesterday's edition piqued your interest, you'll really like this section's featurette on Corpus Tools. Corpus Tools is a joint portal of the ​Masaryk University's NLP Centre and ​Lexical Computing dedicated to a number of software tools for corpus processing including a well…
Read more
8 months ago · 1 like · boB Rudis

The Twitter bot itself is fully powered by shot-scraper.

There are instructions for using shot-scraper as a GH Action, and it's a handy tool to have in the toolbox.

Share

FIN

Just a scant few days until spooky edition themes come to a welcome end! ☮

Share this post

Drop #123 (2022-10-20): Decapitated

dailyfinds.hrbrmstr.dev
Previous
Next
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 boB Rudis
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing