rs; jot; datamash
Today's edition is all about data ops at the command line, and not some new-fangled, fancy Rust-based level-ups of ancient utilities (e.g.
ripgrep). We're talking 'bout old school, super-tiny binaries with powerful features that you likely (needlessly) fire up Python and R for.
rs utility "reads the standard input, interpreting each line as a row of blank-separated entries in an array, transforms the array according to the options, and writes it on the standard output. With no arguments it transforms stream input into a columnar format convenient for terminal viewing."
datamash section (below) we work with a list of IP addresses:
188.8.131.52 184.108.40.206 220.127.116.11 18.104.22.168 22.214.171.124 126.96.36.199 188.8.131.52 184.108.40.206 220.127.116.11 18.104.22.168 22.214.171.124 126.96.36.199 188.8.131.52 184.108.40.206 220.127.116.11 18.104.22.168 22.214.171.124 126.96.36.199 188.8.131.52 184.108.40.206 220.127.116.11 18.104.22.168 22.214.171.124 126.96.36.199 188.8.131.52 184.108.40.206 220.127.116.11 18.104.22.168 22.214.171.124 126.96.36.199
The simplest use of
rs takes the above and turns it into (apologies, again, for Substack’s less than stellar code blocks):
rs < ips 188.8.131.52 184.108.40.206 220.127.116.11 18.104.22.168 22.214.171.124 126.96.36.199 188.8.131.52 184.108.40.206 220.127.116.11 18.104.22.168 22.214.171.124 126.96.36.199 188.8.131.52 184.108.40.206 220.127.116.11 18.104.22.168 22.214.171.124 126.96.36.199 188.8.131.52 184.108.40.206 220.127.116.11 18.104.22.168 22.214.171.124 126.96.36.199 188.8.131.52 184.108.40.206 220.127.116.11 18.104.22.168 22.214.171.124 126.96.36.199
A surprising number of (older?) data-loggers put field values on individual lines. Say, a hostname, IP address, and count of connections from said address:
alpha 188.8.131.52 2 beta 184.108.40.206 4 delta 220.127.116.11 10 gamma 18.104.22.168 99
We can turn that into a handy CSV without using tidy R ops or ugly Python code via:
$ rs -C, -e 0 3 < sad-records | sed -e 's/,$//' alpha,22.214.171.124,2 beta,126.96.36.199,4 delta,188.8.131.52,10 gamma,184.108.40.206,99
Incredibly handy functionality (hit up
man rs for full info) in ~167 kilobytes, and available (or installable) on any decent OS.
Apple has the source for
rs if you're curious as to the inner workings.
Sometimes you just need to generate a sequence of (possibly random) items in a hurry. Kids today would likely import an
npm package that has 1,300 dependencies just to kick out a list of 10 random usernames. Well, you don't have to rely on any fancy programming language package ecosystem to deal with sequences like this with
jot around. This commands sole purpose in life is to "print sequential or random data."
So, about those 10 random usernames…
$ jot -r -w "user-%02d" 10 user-06 user-14 user-30 user-76 user-14 user-17 user-63 user-63 user-03 user-04
Or, you could just repeat yourself:
$ jot -b "the daily drop is awesome" 5 the daily drop is awesome the daily drop is awesome the daily drop is awesome the daily drop is awesome the daily drop is awesome
Perhaps even combine it with
rs to make random-looking strings (despite the
user ones above being truly random):
jot -r -c 160 a z | rs -g 0 8 ztuodtzv lgqnwedq dzqqezxa ukplsjis eeowyyzy ncsyqsfs bogkqdjn ukdowzjo aqyjkisn jteghyan wsqidlnl hxmkzmhx qgezccer dozzwmpe ojahtaxc frhzyvaj kfhkijkc miggbjbp fmzgfxhv hdhaojnc
man page has all the deets, and this utility is similarly diminuitive, and as available as its
Apple, once more, has a decent source code archive of
jot available for your perusal.
GNU datamash is "a command-line program which performs basic numeric, textual and statistical operations on input textual data files."
Usage is as such:
datamash [OPTION] op [fld] [op fld ...]
opis the operation to perform. If a primary operation is used, it must be listed first, optionally followed by other operations.
fldis the input field to use.
fldcan be a number (
1=first field), or a field name when using the
multiple fields can be listed with a comma (e.g.
a range of fields can be listed with a dash (e.g.
use colons for operations which require a pair of fields (e.g. '
groupby, crosstab, transpose, reverse, check
base64, debase64, md5, sha1, sha224, sha256, sha384, sha512
bin, strbin, round, floor, ceil, trunc, frac
dirname, basename, barename, extname, getnum, cut
numeric grouping operations:
sum, min, max, absmin, absmax, range
textual/numeric grouping operations:
count, first, last, rand, unique, collapse, countunique
statistical grouping operations:
mean, geomean, harmmean, trimmean, median,
q1, q3, iqr, perc
pstdev, sstdev, pvar, svar
ms, rms, mad, madraw
pskew, sskew, pkurt, skurt, dpo, jarque
scov, pcov, spearson, ppearson
It's quite a remarkable tool.
If you're familiar with the mtcars dataset in R, we can get the "five number summary" for
mpg pretty easily:
$ datamash -t, -H min mpg q1 mpg median mpg q3 mpg max mpg < mtcars.csv min(mpg),q1(mpg),median(mpg),q3(mpg),max(mpg) 10.4,15.425,19.2,22.8,33.9
We can do the same for each
$ datamash -t, -H --sort groupby cyl min mpg q1 mpg median mpg q3 mpg max mpg < mtcars.csv GroupBy(cyl),min(mpg),q1(mpg),median(mpg),q3(mpg),max(mpg) 4,21.4,22.8,26,30.4,33.9 6,17.8,18.65,19.7,21,21.4 8,10.4,14.4,15.2,16.25,19.2
It definitely punches above its weight for a 150 kilobyte binary.
It usually comes with a companion utility
decorate, which is especially handy at sorting IPv lists (amongst other things):
$ cat ips 220.127.116.11 18.104.22.168 22.214.171.124 126.96.36.199 188.8.131.52 184.108.40.206 ...
$ decorate -k1,1:ipv4 < ips 220.127.116.11 18.104.22.168 22.214.171.124 126.96.36.199 188.8.131.52 184.108.40.206 220.127.116.11 18.104.22.168 22.214.171.124 126.96.36.199 188.8.131.52 184.108.40.206
You should see what:
decorate -k1,1:ipv4 < ips | rs
Installs are available for any OS, and macOS folks can
brew install datamash after reading this post.
The GNU folks continue to update the utility’s source — as recently as a few weeks ago!
Unless you're truly working with "big" data, you may not need all these super fancy modern tools as much as you think you do. ☮