hrbrmstr's Daily Drop

Share this post

2022-04-13.01 Roundup

dailyfinds.hrbrmstr.dev

2022-04-13.01 Roundup

CLI Data Wrangling; Tolkien Maps; DataVis Design Principles

boB Rudis
Apr 13, 2022
2
Share this post

2022-04-13.01 Roundup

dailyfinds.hrbrmstr.dev

CLI Data Wrangling

This category will come up fairly frequently as there are so many great CLI tools to work with various kinds of data. If the acronym is unfamiliar, it stands for Command Line Interpreter, which is a fancy way of saying that unadorned black-background, text-only window that gets you to the terminal/shell of your favorite operating system.

For macOS users, there’s the built-in Terminal.app, but there are more modern alternatives such as iTerm and WezTerm. Linux users likely already know what’s going on CLI-wise, and Windows folks should really consider using Microsoft’s modern Terminal. As with all things “app”, there are scads of alternatives on all three operating systems. If you are new to the command line, there are also scads of learning resources, including this one.

There are many established “go-to” CLI data wrangling tools, including some built-in to most operating system command libraries that we may cover at a later date, but today we’re focusing on trdsql, a tool that can execute SQL queries on CSV, LTSV, JSON and TBLN and output to various formats.

“CSV” should be familiar to most folks reading this as it’s the central way to platform agnostically move tabular data around. “LTSV” stands for Labeled Tab Separated Values and is a variant of Tab-separated Values (“TSV”) and differs from a traditional TSV in that:

Each record in a LTSV file is represented as a single line. Each field is separated by TAB and has a label and a value. The label and the value have been separated by ':'. With the LTSV format, you can parse each line by spliting with TAB (like original TSV format) easily, and extend any fields with unique labels in no particular order.

“TBLN” (for the life of me I can’t figure out what it expands to) is like a CSV file but contains more metadata and can also include comments. (FWIW, I’ve never seen a TBLN file IRL.) While the linked site has amazing documentation, I’ll drop this tiny example of turning the output of the ps command (which displays running processes) into newline delimited JSON:

$ ps -ef | trdsql -ojson -id " " "SELECT * FROM -"
[
  {
    "c1": "UID",
    "c2": "PID",
    "c3": "PPID",
    "c4": "C",
    "c5": "STIME",
    "c6": "TTY",
    "c7": "TIME",
    "c8": "CMD"
  },
  {
    "c1": "0",
    "c2": "1",
    "c3": "0",
    "c4": "0",
    "c5": "Thu03PM",
    "c6": "??",
    "c7": "70:07.66",
    "c8": "/sbin/launchd"
  },… 

(NB: I’ll likely refrain from using Substack code blocks given how poor the choices are for rendering)

What impresses me the most is that it’s both fast and memory efficient:

benchmarks of various data wrangling tools
Source: https://colab.research.google.com/github/dcmoura/spyql/blob/master/notebooks/json_benchmark.ipynb#scrollTo=sME36iEb9Foj

While R is my 🔨, I’m looking to “play the field” a bit more this year. I may just try to crank out a {dbplyr} back-end (or at least a package wrapper) for this handy CLI tool.


Tolkien Maps

Our entire clan can quote large passages from J. R. R. Tolkien’s works, and I find one particular feature of his LoTR and Hobbit books incredibly compelling: his hand drawn maps.

The Tolkien Estate recently dropped high resolution versions of selected maps Tolkien created along with some commentary on the creation process.

One of Tolkien's maps
Source: Tolkien Estate

The above is a tiny version of one of them as the Estate seems to have gone to great lengths to attempt to copy protect the site (the maps page is one, giant SVG!), and I prefer to honor such efforts at least when I feel they are justifi.

The caption for the map notes tht it “grew ove time” (by taping sheets together) as Tolkien was world-building. I can only imagine what he would have crafted with modern tooling.


DataVis Design Principles

If you thought that my tioning of “design principles” in the previous post was a signal there’d be future inents, each diving into specific ones, you were correct!

I 💙 data visualization, and making good datavis is hard. Design principles can help by providing structure and guidance to the creative process to help ensure you’re crafting the desired narrative.

I found this “Dos and don’ts of data visualisation” resource by the European Envit Agency (EEA) to be one of the better principled design guides when it comes to datavis. Each principle, such as “Do tell the ‘why’ and ‘how’: annotations,” has:

  • a quick overview of the principle

  • a larger solid exposition with detailed, specific guidance

  • on-page good/bad chart examples

  • links to IRL published materials that embody the for opinions

EEA covers quite a bit of ground:

Hilight Your Message

  • Do tell the ‘why’ and ‘how’: annotations.

  • Do highlight what’s important, tell one story

  • Hierarchy of the information

Choose Your Chart

  • Tables are preferable to graphics for many small data sets

  • Exploratory/explanatory: do choose the right format (flow chart)

  • Static or interactive?

  • Do choose the chart type wisely

  • Bar chart: do use the full axis and avoid distortion

  • Pie charts: cons (and pros)

  • Small multiples

  • Stacked charts are difficult for comparing data

  • Dual axis charts, pros and cons

Make Charts Easy To Read

  • Do use clear language and avoid acronyms

  • Do remove any visual clutter (increase data-ink ratio, Tufte’s principle)

  • Do rotate bar chart when category names are too long

  • Don’t use a legend when you have only one data category

  • Do use direct labelling wherever possible, avoiding indirect look-up

  • Do sort your data for easier comparisons

  • Don't use more than (about) six colours

  • Do be aware of colour blindness (colour vision deficiency)

Make Charts Correct

  • Do use consistent intervals on axis (be transparent on data gaps)

  • Do use proper aspect ratio to minimise dramatic slope effects

  • Don't confuse correlation with causation

  • Do adjust for inflation in long-time series

  • Do be careful about how you treat ‘no-data/missing data’

  • Don't compare apples with oranges

  • Do show the level of confidence

Dashboard

  • 10 best practices for building effective dashboards

Final Checks

  • Data visualisation checklist

  • Do ask others for opinions

A personal goal for 2022 is to start using guides like this in a more regular and deliberative fashion and in a future edition I’ll reference some of the tools I use to keep resources like this handy.

FIN

That’s a wrap for this post! If you choose to interact in the comments, the only rule is to be kind to each other. ☮

Share this post

2022-04-13.01 Roundup

dailyfinds.hrbrmstr.dev
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 boB Rudis
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing