Drop #379 (2023-11-29): Data Three Ways

git commit folders; Grist; Flyscrape

Today we look at how to get easy access to old, committed data in git repos, discover my replacement for EtherCalc, and try out a new way to scrape data from the web.

TL;DR

This is an AI-generated summary of today’s Drop.

Now I’m just “the author” to Perplexity. o_O I kinda wish we had access to the git history of daily system prompt changes.

  • The first section discusses a tool called git-commit-folders created by Julia Evans. This Golang code turns a git directory into a mountable filesystem, allowing users to navigate through commits as if they were directories. The author finds this tool particularly useful for managing data in git workflows, as it provides an efficient way to access historical data without the need for expensive and time-consuming reprocessing. The tool can be built and used with a series of commands provided in the post source.

  • In the second section, the author introduces Grist, a hybrid database/spreadsheet tool that he considers a good replacement for EtherCalc. Grist allows users to import data, link records across tables, and arrange data in a custom layout. It supports full Python syntax in formulas, includes many Excel functions, and offers an AI Assistant for formula generation. Grist is based on SQLite, so users have full access to raw data in a SQL setting. It also provides a REST API and can be used containerized or via an Electron app source.

  • The edition concludes with a mention of Flyscrape, a tool for internet scraping written in Golang. Flyscrape uses an embedded JavaScript interpreter for configuration and processing functions. The author plans to convert his Capitol insurrection DoJ scraper to Flyscrape and will report back on the results. The tool can be installed via a command provided in the post or through pre-built binaries on the site source.


git commit folders

pile of printing papers

The highly talented coder and EPIC maker of zines, Julia Evans, had an idea and recently made it reality. That git-commit-folders repo is a bit of Golang code that turns a git directory into a mountable filesystem so you can traipse back through commits like you would cd and ls in directories.

I’m almost positive there’s a 100% overlap of folks who read the Drop and also monitor everything Julia creates, but I have a take on this that may help out some data folks who may use git workflows to wrangle data on a regular basis.

At $DAYJOB I have a GitHub Action (GHA) that does some work for me so that I can use some data bits in an Observable notebook. One of those actions just make a current copy of the last thirty days of activity of our detection rules. While we maintain all the raw data from our fleet of sensors, the production processed data ages out after a while. While I can reprocess it, when necessary, that’s a potentially expensive and time-consuming process. We don’t need an active history of all our tagged sensor data, but it is nice to be able to go back in time when performing more research-y things.

In the past, I would normally use a {git2r} workflow for this, but Julia’s filesystem hack is way better. But, don’t just take my word for it, let’s see it in action.

First, we need to build the git-commit-folders project:

$ git clone git@github.com:jvns/git-commit-folders.git
$ cd git-commit-folders
$ go install 
$ cd ~/oddly-long-path/labs-viz-data
$ mkdir /tmp/data
$ git-commit-folders -type nfs -mountpoint /tmp/data
$ cd /tmp/data
$ Rscript -e 'list.files("/tmp/data", "kev-tags-30d.json", full.names = TRUE, recursive = TRUE) |> writeLines()' | head -10
/tmp/data/branch_histories/main/00-f5b69c0df1580d5ff1ff744eb4f462aaf1498835/docs/kev-tags-30d.json
/tmp/data/branch_histories/main/01-795c23f2620f46be5f36c24717d5a2004d16fd70/docs/kev-tags-30d.json
/tmp/data/branch_histories/main/02-334a33750e6bf078a69444576703630b7b673e70/docs/kev-tags-30d.json
/tmp/data/branch_histories/main/03-adefbba7baa1bdc92cf15656164591f863819aee/docs/kev-tags-30d.json
/tmp/data/branch_histories/main/04-5f5464e9788a12ca65920406741a31754eaa6886/docs/kev-tags-30d.json
/tmp/data/branch_histories/main/05-dff485ba565836d43ba453239d6597552d20ab52/docs/kev-tags-30d.json
/tmp/data/branch_histories/main/06-2a3803f90503eb0295c6de7f54daeab4646c93c3/docs/kev-tags-30d.json
/tmp/data/branch_histories/main/07-0cc68f07cef5b5b7bb9500612737c707fb2e32bb/docs/kev-tags-30d.json
/tmp/data/branch_histories/main/08-17dfb8f74b9bf4f622f759adac832244817c3629/docs/kev-tags-30d.json
/tmp/data/branch_histories/main/09-988557a72055e303b5187c9199f7c88f3a622f73/docs/kev-tags-30d.json

Each one is jsonlite::fromJOSN-able, and I get the entire history I need without having to shell out to git show from R.

This is something that will come along for the ride with each new server/workstation setup.

Grist

I’m not really a fan of “cloud” services such as Google Sheets, Microsoft’s or online version of Excel. I blame some of that on using proper local apps (back in the day) and others on just not wanting to shove my data into the insecure world of the cloud.

In the pre-container era, I’ve always had a home-LAN copy of EtherCalc running for spreadsheet-y data tasks. I even built a small R utility package for it to help handle some workflows that relied on it. But, it has not aged well.

Enter Grist (GH), a hybrid database/spreadsheet, where columns work like they do in databases: they are named, and they hold one kind of data. This means you can import your data, link records across tables, and arrange your data in your ideal layout. This further means that your spreadsheet is now also your custom data application.

Grist has a plethora of features, but some key ones are:

  • support for full Python syntax in formulas, including the standard library.

  • the existence of many functions right from Excel

  • an AI Assistant specifically tuned for formula generation (using OpenAI gpt-3.5-turbo — if you like paying the OpenAI tax — or Llama via llama-cpp-python).

  • it’s based on SQLite so you have full access to the raw data whenever you want in a SQL setting

  • works containerized or via an Electron app (or freemium plan from Grist)

  • bonkers simple dashboarding

and a whole lot more.

Despite having raw access to the SQL database, there’s also a full REST API that I will no doubt write an R package for at some point.

If you’re looking to get your team off of Google Sheets (et al.), this might be a great alternative.

Flyscrape

In my previous job, we did a great deal of internet scraping. While I do much more listening to what packets others are flinging our way, these days, I have need to perform an oddly large number of scraping tasks in both my personal and professional capacities.

I still do most of said scraping in R, but I’m a big fan of single binary files that are easily and cheaply deployable anywhere. So, I’ve been poking a bit at Flyscrape (GH), and it’s a pretty great tool if you, too, have regular scraping needs.

Flyscrape is written in Golang, but you use the embedded JavaScript interpreter to write the scraping configuration and processing functions. I won’t even try to show examples here as the site’s documentation, and copious README examples do the job better than I would.

I’m going to work on converting my Capitol insurrection DoJ scraper over to it and report back on how that goes.

Give it a go via go install github.com/philippta/flyscrape/cmd/flyscrape@latest or the pre-built binaries on the site.

FIN

Tis hard to believe it’s the penultimate day of November, 2023. ☮️

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.