

Discover more from hrbrmstr's Daily Drop
Drop #332 (2023-09-11): The AWK Programming Language 📘 @ 35 & CLI Sparkline Bars
For folks who did not 👀 the Bonus Drop (free for all!) over the weekend, but are still somewhat concerned over Google's rollout of their laughable "Privacy Sandbox" (it is anything but that), you can use this checker I built to see if your Chrome configs need some tweaking.
No TL;DR, today, as we're taking a deep dive into two combined topics, and there's going to be quite a bit to go through; plus, some of the bash code examples have fun bits in them that may be useful in other contexts.
Finally, given the code-heavy nature of this edition, I've made another Quarto version you can peruse, which may be easier on the eyes than what Substack churns out.
The AWK Programming Language 📘 @ 35 & CLI Sparkline Bars
We're combining coverage of two resources in this section, since I want to use spark (GH) to spice up some of the AWK examples. Spark does one thing and does it pretty well: make sparkline bar charts at the CLI. It's 100% bash, so it's super lightweight and runs everywhere. You'll need to install it if you're going to run the examples on your own.
The second edition of The AWK Programming Language comes out at the end of September. It's been thirty-five years since the first edition was released, making me feel less bad about the nine years it has been since Data-Driven Security has been around. AWK itself is nearly 50 years old.
Fundamentally, AWK “just” scans text input files and splits each input line into fields automatically, leaving you to process that with AWK's somewhat arcane-yet-fairly-easy-to-grok processing language.
While the AWK ecosystem has certainly evolved over the years, a big change in the one, true awk is the direct support for CSV files. Yes, AWK finally groks CSV, and each CSV column gets put into the ordered input field. The awk
binary that ships with my macOS and Ubuntu systems does not have the --csv
option (“yet”, I guess). If you head to the link in the first sentence of this paragraph, clone the repo, check out the “csv
” branch, run make
, and then mv a.out cawk
, you can follow along with the upcoming examples.
With this change to the AWK program, the second edition has an entire chapter on exploratory data analysis (EDA) where the authors (all three original authors!) walk through some use cases.
We'll do a bit of the same, here, with a World Bank “tourism” file I grabbed in 2021. We'll stick the filename in a variable to make the examples shorter:
# https://rud.is/dl/world-bank-tourism-arrivals-2000-2020.csv
$ DATAFILE="${HOME}/Data/world-bank-tourism-arrivals-2000-2020.csv"
$ head "${DATAFILE}
year,country,arrivals,lat,lng
2000,Albania,317000,41.3317,19.8172
2001,Albania,354000,41.3317,19.8172
2002,Albania,470000,41.3317,19.8172
2003,Albania,557000,41.3317,19.8172
2004,Albania,645000,41.3317,19.8172
2005,Albania,748000,41.3317,19.8172
2006,Albania,937000,41.3317,19.8172
2007,Albania,1127000,41.3317,19.8172
2008,Albania,1420000,41.3317,19.8172
(The dataset is a decent reminder of how bad 2020 was for humanity #NeverForget
#CovidIsNotDoneWithUs
.)
One suggested use of AWK by the authors (in combo with other *nix utils) is file structure validation. The authors go into more detail than I will here, but it's stupid easy to make sure all records in a CSV file have the same number of columns:
$ ./cawk --csv 'NR > 1 { print NF }' "${DATAFILE}" | uniq
5
We need to include NR > 1
since AWK knows the CSV format, but won't exclude the header (if it exists) by default.
To prove it truly groks CSV, let's see the first three countries in the file:
$ ./cawk --csv 'NR > 1 { print $2 }' "${DATAFILE}" | uniq | head -3
Albania
Algeria
American Samoa
It does! But, it's fairly clear the authors don't actually do a ton of formal, reproducible EDA since they continue to use the $NUMBER
syntax for column references, which every decent data scientist knows is not great. We can do better, and this is a more readable version of the command we just ran:
$ ./cawk --csv \
-v country="2" \
'NR > 1 { print $country }' ${DATAFILE}
The -v
lets us map variables to values, so we have a (janky) way to get column names back.
Let's pick a random country from countries that have records for 2020 (you'll get different results):
$ ./cawk --csv \
-v year="1" \
-v country="2" \
'NR > 1 && $year == "2020" { print $country }' \
"${DATAFILE}" | \
shuf -n 1
Singapore
Let's use spark
to see if there was a stark drop-off in tourism arrivals for Singapore:
$ ./cawk --csv \
-v country="2" \
-v arrivals="3" \
'NR > 1 && $country == "Singapore" { print $arrivals }' \
"${DATAFILE}" | \
xargs spark
▄▄▃▄▄▄▄▄▄▅▅▅▅▅▆▇▇███▂
As expected, tourism was super hurt at the start of the pandemic.
AWK's a fully baked language, so we can do some data ops on it, like see the total tourism influx to Singapore for all the years in the data file:
$ ./cawk --csv \
-v country="2" \
-v arrivals="3" \
'$country == "Singapore" \
{ arrivals_cum_sum += $arrivals } \
END \
{ print arrivals_cum_sum } \
' "${DATAFILE}"
245411500
You can write full-on programs in AWK, so it can do much of what you may be used to in Python, R, Perl, etc. I'm not sure your team would appreciate that, though, since AWK is not really the stats cruncher in any modern data science stack.
Let's do one more example. First, we'll save off the list of countries that have records in 2020:
$ ./cawk --csv \
-v year="1" \
-v country="2" \
'NR > 1 && $year == "2020" { print $country }' \
"${DATAFILE}" > /tmp/2020-countries
Now, we'll use that list to work on only countries with 2020
entries.
The NR==FNR
line creates an associative array from /tmp/2020-countries
which we can use to test for inclusion, and we use another associative array to keep the grouped values together (the next three lines). The last lines iterate over that array and prints out country:#### #### #### #### …
. We, then, rely on some core bash idioms to get our CLI dashboard:
$ ./cawk --csv \
-v year="1" \
-v country="2" \
-v arrivals="3" \
'NR==FNR \
{ countries2020[$1]++; next } \
$country in countries2020 \
{ arrivals_by[$country] = arrivals_by[$country] $arrivals " " } END \
{ for (country in arrivals_by) \
print country ":" arrivals_by[country] } \
' \
/tmp/2020-countries "${DATAFILE}" | \
head -20 | \
while IFS=: read country arrivals; do
echo "${country}\t$(echo ${arrivals} | xargs spark)"
done | \
column -t -s $'\t'
Cote d'Ivoire ▁▁▁▁▁▁▁▂▅▆▆▇█▂
Denmark ▃▃▅▅▅▅▅▅▅▅▅▅▆▆▇▇█▁
Belize ▁▁▂▄▆▄▃▃▃▄▄▄▄▄▅▅▆▆█▇▁
Namibia ▃▃▃▃▄▄▄▅▅▅▅▅▆▆▆▇▇▇▇█▁
St. Lucia ▃▃▂▃▄▃▃▅▅▅▅▅▅▅▅▆▅▆▇█▁
Liechtenstein ▂▂▁▁▁▁▁▄▄▃▃▃▂▂▂▂▃▅▆█▂
Estonia ▃▃▃▃▄▄▅▆▆▇▇▇▇█▇▇▁
Mongolia ▂▂▃▂▃▄▅▅▅▅▇▇▇▆▆▅▅▆▇█▁
Spain ▃▄▄▄▄▅▅▅▅▅▅▅▅▆▆▆▇▇▇█▁
Andorra ▇▇▇▇█▇▆▆▆▅▄▄▃▃▃▃▄▄▄▄▁
Togo ▁▁▁▁▁▁▁▁▁▁▂▃▂▃▂▂▃▄▅█▄
Indonesia ▁▁▁▁▁▁▁▁▂▂▂▃▃▃▄▄▅▆▇█▁
Montenegro ▁▁▁▁▁▃▃▃▃▄▄▄▄▅▅▆▆█▁
New Zealand ▂▃▃▃▄▄▄▄▄▄▄▄▄▅▅▆▇▇▇█▁
Japan ▁▁▁▁▁▁▁▂▂▁▂▁▂▂▃▄▆▇▇█▁
El Salvador ▁▁▁▁▂▃▃▄▅▃▄▄▄▅▅▅▅▆▇█▁
Brunei Darussalam ▁▁▁▁▁▁▁▇▇█▇▁
Belgium ▅▅▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆▇▇█▁
Bolivia ▁▁▁▁▂▂▂▂▃▃▃▃▄▄▅▅▅▇▇█▁
Greece ▂▂▂▂▂▃▃▄▅▅▆▆▇█▁
We'll cover one more spark
example (that also uses AWK's new CSV powers), but before we do that, I encourage everyone to grab the book, read the EDA chapter, and keep Appendix A (the AWK reference manual) around. It's chock-full of useful snippets that will make your CLI life easier, and help you out in a pinch.
We can use AWK's CSV parsing capabilities and spark
to see when it might rain over the coming hours with a little help from our old pal Tomorrow.io. ip-api.com
lets you grab your IP geolocation information sans key and in CSV format, which we can parse with AWK.
# Get what ip-api thinks is our lat/lng
# (it's very very wrong for me).
#
# And, we are also pretending ip-api doesn't have
# the field= query parameter.
#
# And, yes, we could have just used JSON and jq.
latlng=$(curl -s http://ip-api.com/csv/ | \
./cawk --csv '{ print $8 "," $9 }')
# Get the forecast for that location
fcast=$(curl --silent --header 'accept: application/json' \
"https://api.tomorrow.io/v4/weather/forecast?location=${latlng}&apikey=${TOMORROWIO_API_KEY}")
# Get a graph of precipitation % chance over the coming hours
echo ${fcast} | jq ".timelines.hourly[].values.precipitationProbability" | xargs spark
▁█▇▂▁▁▁▁▂▁▁▁▁▁▁▃▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▄▅▇▇▇▇▇▆▅▅▄▁▁▁▁truncated
Both the new and improved AWK and spark
are fun and useful tools that make doing some CLI work a bit more engaging and speedy.
FIN
I highly doubt AWK will unseat any of the popular CLI data science tools any time soon. And, I have no idea when/if the CSV support will come baked into distros or package updates. But, it's easy to compile, comes self-contained, and requires fewer resources and dependencies than, say, R or Python. It might not be a bad idea to use it plus some other CLI tools to do some data validation before production scripts run, or you dig into a new dataset.
I've put the full second edition table of contents online and am looking forward to replacing my dead tree copy of the first edition with the updated one. ☮