Drop #287 (2023-07-03): Query Overload

SQLite Package Manager; Cage Fight: Clickhouse-local vs. DuckDB; Pharmacy Deserts

I’m fairly certain I’ve aged a month over the past week, and appreciate the messages of support folks have sent our way. Interested folks can follow along my spouse’s daily chronicle of what’s going on (I should note that it’s elder healthcare related, in the event such content is uncomfortable for some folks).

Today, we have three “query” related topics that span diverse data store “ecosystems”. We’ll skip the usual intro banter and just dig into the resources.

SQLite Package Manager

brown cardboard boxes on brown cardboard box

SQLite Package Manager (GH) is an unofficial directory of SQLite extensions, add-ons, and tools maintained by Anton Zhiyanov. It aims to provide an easy way to install and manage SQLite extensions. SQLite is everywhere, the barrier to entry is low, and extensions can add a lot of functionality.

The registry contains only package metadata, so binaries and sources are located elsewhere. Casual users of SQLite may not be super-familiar with how to install extensions, which makes the companion sqlite-cli super handy.

Unlike the vast ecosystems of R’s CRAN, Python’s PyPI, and Rust’s Crates, this new SQLite equivalent is quite tiny, with total tracked extensions only numbering in the dozens. It’s a solid start and I appreciate the effort to bring some semblance of “package management” to this lean and ubiquitous database.

Cage Fight: Clickhouse-local vs. DuckDB

silhouette of dome sculpture

Longtime readers know I am a DuckDB fanboy. I’ve been using it in many ways, including an Apache Drill replacement, as my primary Drill use case was turning tons of JSON into Parquet (or direct querying JSON directories). That use case hit a brick wall the other day with some work data. Without going into detail, some of our JSON response records can be huge, and that forced me to work around some single-system memory limitations that impeded speedy transformation of ~six months of data.

As a result, I’ve poked my head up out of the sand and began a re-survey of the current, local, data crunching options. I knew about Clickhouse-local — a standalone binary that runs a Click house server in-memory, but I’d never actually used it in a “get work done” capacity. Messing around with it and my work data were items on the menu last week, but were interrupted by the aforementioned events. So, y’all can look forward to some future sections involving Clickhouse as I progress through my investigation.

Meanwhile, I wanted to link to a “more interesting than two pasty white billionaires” cage fight involving Clickhouse and DuckDB. NOTE: you should start with this basic comparison before hitting that post.

The “cage fight” links goes to an analysis of over two billion rows of cloud infrastructure spending data. The Vantage team performed several tests comparing the speed of both systems using a 16 GB MacBook Pro.

TL;DR, Clickhouse-local “won” but there are many caveats that should give anyone pause to just switchover immediately. In fact, as I’ve poked around a bit since coming back from my travels, I can 100% see a world where I use both Clickhouse-local and DuckDB to get different jobs done. I appreciate the Vantage team for highlighting that nuanced reality.

DuckDB is (for the moment, at least) still my go-to for rapid JSON transformation, and Clickhouse-local may become my go-to for larger datasets that require more raw performance. But DuckDB’s simpler setup, ease-of-use, and more standard SQL support mean it’ll stick around my desktops for many use cases.

Pharmacy Deserts

WARNING: R-heavy “data journalism” section.

I’ve been catching up on missed news and learned about Walgreens closing ~150 stores in the coming quarters. Pharmacy deserts — areas with no or limited access to pharmacies — were already an issue in many parts of the U.S. and these closures threaten to exacerbate the problem.

While Walgreens has yet to provide a list of stores that will close, I wanted to see how many Walgreens were in Maine (where I live) and New Hampshire (we live in rural Maine but near NH border towns).

Overpass Turbo provides an interactive interface for issusing OQL queries. This is what I used to get “Walgreens” locations in ME/NH:

[out:json][timeout:25];
(
  area["ISO3166-2"="US-ME"]->.Maine;
  area["ISO3166-2"="US-NH"]->.NewHampshire;
  node["amenity"="pharmacy"]["name"~"Walgreens", i](area.Maine);
  node["amenity"="pharmacy"]["name"~"Walgreens", i](area.NewHampshire);
);
out body;
>;
out skel qt;

The JSON for that isn’t too bad to wrangle:

# Read the JSON file I made from OSM
json_data <- jsonlite::fromJSON("~/Data/osm-walgreens.json")

# Extract the elements from the JSON data
elements <- json_data$elements

# Filter out the nodes (where type == "node")
nodes <- elements[elements$type == "node",]

Unfortunately nrow(nodes) only had 33 entries, and many columns were NA. Remember, OSM is only as good as the user input is and how well you query it. When I also searched for ways and nodes I got back over sixty entries, but the metadata was janky enough that I didn’t trust it.

Walgreens has a store location on the main website that uses a hidden API. Using Developer Tools, it was pretty easy to get the data right from the company itself. These are the whittled down “copy as curl” commands for each state:

$ curl 'https://www.walgreens.com/locator/v1/stores/search' \
  -H 'accept: application/json, text/plain, */*' \
  -H 'content-type: application/json; charset=UTF-8' \
  --data-raw '{"p":"1","r":"25","requestType":"locator","requestor":"headerui","s":"100","sameday":"true","lat":"43.1938516","lng":"-71.5723953"}' \
  --compressed > walgreens-nh.json

$ curl 'https://www.walgreens.com/locator/v1/stores/search' \
  -H 'accept: application/json, text/plain, */*' \
  -H 'content-type: application/json; charset=UTF-8' \
  --data-raw '{"p":"1","r":"25","requestType":"locator","requestor":"headerui","s":"100","sameday":"true","lat":"45.253783","lng":"-69.4454689"}' \
  --compressed > walgreens-me.json

It looks like the API uses lat/lng and some other body parameter values to perform the statewide search. You should 100% look at the results for larger states to see if you need to change-up anything to ensure you get complete results.

# Read in JSON from the website
nh_stores <- jsonlite::fromJSON("~/Data/walgreens-nh.json")
me_stores <- jsonlite::fromJSON("~/Data/walgreens-me.json")

# We only need lat/lng to plot them
rbind.data.frame(
  me_stores$results[, c("latitude", "longitude")],
  nh_stores$results[, c("latitude", "longitude")]
) |> 
  st_as_sf(
    coords = c("longitude", "latitude"), 
    crs = 4326
  ) -> walgreens_me_nh

Now, nrow(walgreens_me_nh) is 74, but I was suspicious of the results since there is no (obvious) “state” parameter in the hidden API.

c(
  me_stores$results$store$address$state,
  nh_stores$results$store$address$state
) |>
 table()
## MA ME NH VT 
## 27 16 30  1 

As I suspected, the query is geo-distance-based. Still, 46 is greater than the response we got back from OSM, so we’ll use it to see where these stores are:

# We only need lat/lng to plot them
rbind.data.frame(
  me_stores$results[, c("latitude", "longitude")],
  nh_stores$results[, c("latitude", "longitude")]
) |> 
  transform(
    state = c(
      me_stores$results$store$address$state,
      nh_stores$results$store$address$state
    )
  ) |> 
  st_as_sf(
    coords = c("longitude", "latitude"), 
    crs = 4326
  ) -> walgreens_me_nh

# plot the basemap and the found locations
ggplot() +
  geom_sf(
    data = me_nh,
    fill = ipsum_pal()(8)[5],
    color = alpha("#c3c3c3", 1/2),
    linewidth = 1/2
  ) +
  geom_sf(
    data = walgreens_me_nh[walgreens_me_nh$state %in% c("ME", "NH"),],
    size = 1.5,
    shape = 21,
    color = "white",
    fill = ipsum_pal()(8)[2]
  ) +
  theme_ipsum_inter() +
  theme(
    axis.text = element_text(color = "white"),
    plot.background = element_rect(fill = "gray30")
  )

The result, as you have likely guessed by now, is in the section header.

There is a great deal of metadata in the store JSON. So, I’m thinking of, perhaps, Friday’s WPE being something that uses this data (e.g., combine the store info with population data and, perhaps, 2020 census migration data + some economic data to predict which stores are being nuked).

I promised GitLab’s chief data wrangler (she’s quite brilliant!) that I’d stop using GitHub for new stuff this year, so I’m going to start making good on said promise today! You can find the code and data for the above over at this GitLab repo.

FIN

Y’all really need to check out issue 44 of Lynn’s “Things I Think Are Awesome”. The resource link to an engaging prompt injection game really took the edge off one of the more harrowing days last week.

And, all U.S.-based Drop readers should strongly consider reading and sharing today’s “Letters from an American”. ☮

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.