

Discover more from hrbrmstr's Daily Drop
Interest in web scraping has grown substantially over the past decade, yet most folks needing to mine the web for data still default to using ancient Python or Java-based technology to get the job done. There are modern alternatives, some with highly specialized functionality, three of which we'll cover today.
Katana
This nascent entry from the fine folks at Project Discovery excels at acquiring a list of URLs for further scraper targeting. Billed as a "next-generation crawling and spidering framework", Katana is tailor made for the bug bounty crowd, helping folks find tasty web paths to aim their fire and fury at, but it's not just for the cyber crowd. It's a solid tool for quickly extracting URLs for further processing.
It is self-described at:
Fast And fully configurable web crawling
Standard and Headless mode support
JavaScript parsing / crawling
Customizable automatic form filling
Scope control - Pre-configured field / Regex
Customizable output - Pre-configured fields
Input from: stdin, URLs, and URL lists
Output to: stdout, file(s), and JSON
It's easily install able via go install
or multi-platform GH releases, and has a pre-buit Docker image ready to use.
The defaults are sane and should not get you in much (if any) trouble. Said defaults made quick work of the primary URL lists from my namesake domain starting point.
It's super new, so check back on their GH repo for updates. The Nuclei team is constantly updating all their froody tools, and Katana is only going to keep getting better.
Colly
If you know some Golang and don't mind rolling up your sleeves a bit, Colly [GH] provides a super clean and feature rich framework for scraping content. No, seriously, this is all the Golang you need for a basic link scraper:
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(
// Visit only domains: hackerspaces.org, wiki.hackerspaces.org
colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),
)
// On every a element which has href attribute call callback
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
// Print link
fmt.Printf("Link found: %q -> %s\n", e.Text, link)
// Visit link found on page
// Only those links are visited which are in AllowedDomains
c.Visit(e.Request.AbsoluteURL(link))
})
// Before making a request print "Visiting ..."
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
// Start scraping on https://hackerspaces.org
c.Visit("https://hackerspaces.org/")
}
Colly:
Is fast (>1K request/sec on a single core)
Manages request delays and maximum concurrency per domain
Has automatic cookie and session handling
Supports synchronous, async., parallel, distributed, and proxied scraping
Has a caching subsystem
Performs automatic encoding of non-unicode responses
Will honour
robots.txt
, if desired
and much more.
Colly's GH page also has links to other tools/projects made from Colly.
Huginn
We're going to cover Huginn in more depth in a single issue (likely a Weekend Project edition), so this section is super brief. This is their pitch:
Huginn is a system for building agents that perform automated tasks for you online. They can read the web, watch for events, and take actions on your behalf. Huginn's Agents create and consume events, propagating them along a directed graph. Think of it as a hackable version of IFTTT or Zapier on your own server. You always know who has your data. You do.
Suffice it to say, with a bit of "docker run" know-how and some configuration data, you could be up and running with a bulletproof scraper-archiver in mere minutes. I made a version of my CISA KEV to RSS scraper in just a few clicks and tiny amount of data entry.
Give it a look to get familiar with it before our deeper dive.
FIN
It's only Wednesday, and we've got a moon mission live stream, a massive, active lava flow, seditious conspiracy convictions, and sane marriage protections (in the U.S.). Hopefully the universe will pace things a bit better in '23. ☮