

Discover more from hrbrmstr's Daily Drop
2022-04-29.01
Scraping The Heap; Mars Garbage; The Columbo Cinematic Universe; Just One More Thing
We're riffing off of a few previous editions today and covering some fairly wide ground, so let's dig in!
Scraping The Heap
It may be, now, legal (again?) to scrape web pages, but that doesn't mean it is easy to do so. Many sites create a twisty DOM maze that even the most seasoned of scrapers could find daunting when attempting to craft the perfect CSS or XPath selector to yank the content that they are desperately seeking.
Adrian Cooney (@adrian_cooney) articulately laments the situation, but was not content to just sit back and be thwarted by overaggressive anti-web-scraping tech. Thus was born a new and clever way to access desired content without losing your mind: grab the data via Chrome javascript runtime heap snapshots.
Chrome's developer tools have a memory heap profiler tool that shows the memory distribution of a page's javascript objects and related DOM nodes. Here's how Adrian figured out how to turn this data into a new way to scrape information out of sites:
"I plucked out a unique string from the data visible on the web page and took a heap snapshot of the browser's Javascript runtime via Chrome's Dev Tools. A heap snapshot is a raw dump of everything in the web app's memory (or heap). Using the Dev Tools' search function and my unique string, I managed to find a nice, well-structured object containing my string and, adjacently, all the data that my app needed. It was from this point that I focused my energy on automating this process to find and extract the data from the heap snapshot. puppeteer-heap-snapshot is born."
There's an example on Adrian's post where YouTube video metadata is extracted:
puppeteer-heap-snapshot query \
--url https://www.youtube.com/watch\?v\=L_o_O7v1ews \
--properties channelId,viewCount,keywords --no-headless
You can visit the site for the amazingly useful output. I cannot believe how simple it is (to use) new technique.
Under the hood it uses Puppeteer to orchestrate Chrome's remote API to eventually gain access to the heap and content.
I'll be giving this a go over the weekend on sites that have become increasingly gnarly to pull content from and report back in a future edition.
Mars Garbage
Jupiter's moons have some newsworthy competition this week, though not in a good way.
You may remember the Perseverance mission; but, if not, here it is in NASA's own words:
The Mars Perseverance rover mission is part of NASA's Mars Exploration Program, a long-term effort of robotic exploration of the Red Planet. The Mars Perseverance mission addresses high-priority science goals for Mars exploration, including key questions about the potential for life on Mars. The mission takes the next step by not only seeking signs of habitable conditions on Mars in the ancient past, but also searching for signs of past microbial life itself. The Mars Perseverance rover introduces a drill that can collect core samples of the most promising rocks and soils and set them aside in a "cache" on the surface of Mars.
The mission also provides opportunities to gather knowledge and demonstrate technologies that address the challenges of future human expeditions to Mars. These include testing a method for producing oxygen from the Martian atmosphere, identifying other resources (such as subsurface water), improving landing techniques, and characterizing weather, dust, and other potential environmental conditions that could affect future astronauts living and working on Mars.
The landing maneuver was complex and involved the use of a parachute to have the equipment touch down safely. Well, the bits that were only necessary for said landing are now officially litter.
The above mess image was captured by NASA's Ingenuity Mars Helicopter on its 26th flight!
NASA describes it a bit more tactfully:
"The parachute and cone-shaped backshell protected the rover during its fiery descent toward the Martian surface on Feb. 18, 2021. Engineers working on the Mars Sample Return program requested images be taken of the components from an aerial perspective because they may provide insight into the components' performance during the rover's entry, descent, and landing."
I wonder what Elon charges for interest on unpaid littering fines.
The Columbo Cinematic Universe
I am a YUGE Columbo fan and recently discovered the "Full Cast And Crew" (@fullcastandcrew) podcast, by Jason Cilo and Chris Kipiniak of
Meetinghouse Productions. In each pod, they take a deep dive on a movie (which can include made-for-TV movies, like Columbo episodes), "coaxing forth meaning, dancing with its animal spirits, divining the divine fire that burns at its very heart, and reading, like runes, the unlikely connections, weird trivia, and strange quotes to determine where this film sits in the grand cave painting that is human culture."
I've burned through a few episodes and caught wind of what they call the "Columbo Cinematic Universe", where they prove (in each pod) that every movie or TV series is just one more thing separated from the rumpled and beloved Lieutenant.
They have an entire episode dedicated to Columbo, and dig into "Try and Catch Me" (a truly great one) with satisfying skill and expertise.
The pods are especially great for long walks.
Also: CCU >>> MCU
Just One More Thing
Yesterday, I mentioned the relatively new and innovate terminal app Warp, and it caught the eye of Noam Ross (@noamross), a most excellent human and fellow data scientist.
Noam posited some concerns about the cloud connectivity of Warp, and I do share a bit of his consternation, since terminal apps can provide access to systems deep inside organizations, and would be a great way to exfiltrate information. Their business model is built around making terminal use more productive for teams, and security is going to be a hurdle for them as they start to engage cautious corporations.
I'm not using Warp to access sensitive work systems, just playing with it in a personal context. But, I have all the tools necessary to monitor how often it phones home for various things and how large the payloads are, so if I catch it doing anything I think is nefarious, I'll follow up again with a cautionary tale.
One feature Noam (and I) liked was the use of OpenAI which enables one to type in natural language and get a (mostly) working command line. Noam asked if anything else had this feature and he found zsh_codex
, which looks all kinds of cool, especially if you are regularly stumped as to what CLI tool options to use.
(Noam: as promised, I'm sending you 50% of the $0.00 I made on this edition.)
FIN
We all made it to Friday! Congrad!
If you engage in the comments, remember that the only rule is kindness. ☮
2022-04-29.01
I've been eagerly awaiting your Warp takes as I tinker with it myself (those workflows are so nice for someone, like me, who's not a command line pro), and missed Noam's comments. So glad you're doing this, esp as I back off from Twitter more and more.