

Discover more from hrbrmstr's Daily Drop
Today's edition is all about data ops at the command line, and not some new-fangled, fancy Rust-based level-ups of ancient utilities (e.g. ripgrep
). We're talking 'bout old school, super-tiny binaries with powerful features that you likely (needlessly) fire up Python and R for.
rs
The rs
utility "reads the standard input, interpreting each line as a row of blank-separated entries in an array, transforms the array according to the options, and writes it on the standard output. With no arguments it transforms stream input into a columnar format convenient for terminal viewing."
In the datamash
section (below) we work with a list of IP addresses:
50.232.182.134
32.80.197.134
84.32.229.201
38.238.151.171
51.215.93.251
19.210.242.66
101.61.90.155
18.193.124.250
92.214.83.250
22.29.119.0
29.146.153.104
19.198.112.71
1.154.81.81
104.59.22.67
13.171.111.141
102.122.22.187
69.25.224.252
84.254.74.210
50.161.245.209
59.3.238.97
39.226.151.132
25.112.22.200
118.182.91.0
111.63.49.165
99.17.244.207
67.29.135.114
7.233.30.90
114.67.239.6
9.218.121.157
91.49.173.64
The simplest use of rs
takes the above and turns it into (apologies, again, for Substack’s less than stellar code blocks):
rs < ips
50.232.182.134 101.61.90.155 1.154.81.81 50.161.245.209 99.17.244.207
32.80.197.134 18.193.124.250 104.59.22.67 59.3.238.97 67.29.135.114
84.32.229.201 92.214.83.250 13.171.111.141 39.226.151.132 7.233.30.90
38.238.151.171 22.29.119.0 102.122.22.187 25.112.22.200 114.67.239.6
51.215.93.251 29.146.153.104 69.25.224.252 118.182.91.0 9.218.121.157
19.210.242.66 19.198.112.71 84.254.74.210 111.63.49.165 91.49.173.64
A surprising number of (older?) data-loggers put field values on individual lines. Say, a hostname, IP address, and count of connections from said address:
alpha
50.232.182.134
2
beta
32.80.197.134
4
delta
84.32.229.201
10
gamma
38.238.151.171
99
We can turn that into a handy CSV without using tidy R ops or ugly Python code via:
$ rs -C, -e 0 3 < sad-records | sed -e 's/,$//'
alpha,50.232.182.134,2
beta,32.80.197.134,4
delta,84.32.229.201,10
gamma,38.238.151.171,99
Incredibly handy functionality (hit up man rs
for full info) in ~167 kilobytes, and available (or installable) on any decent OS.
Apple has the source for rs
if you're curious as to the inner workings.
jot
Sometimes you just need to generate a sequence of (possibly random) items in a hurry. Kids today would likely import an npm
package that has 1,300 dependencies just to kick out a list of 10 random usernames. Well, you don't have to rely on any fancy programming language package ecosystem to deal with sequences like this with jot
around. This commands sole purpose in life is to "print sequential or random data."
So, about those 10 random usernames…
$ jot -r -w "user-%02d" 10
user-06
user-14
user-30
user-76
user-14
user-17
user-63
user-63
user-03
user-04
Or, you could just repeat yourself:
$ jot -b "the daily drop is awesome" 5
the daily drop is awesome
the daily drop is awesome
the daily drop is awesome
the daily drop is awesome
the daily drop is awesome
Perhaps even combine it with rs
to make random-looking strings (despite the user
ones above being truly random):
jot -r -c 160 a z | rs -g 0 8
ztuodtzv
lgqnwedq
dzqqezxa
ukplsjis
eeowyyzy
ncsyqsfs
bogkqdjn
ukdowzjo
aqyjkisn
jteghyan
wsqidlnl
hxmkzmhx
qgezccer
dozzwmpe
ojahtaxc
frhzyvaj
kfhkijkc
miggbjbp
fmzgfxhv
hdhaojnc
Again, the man
page has all the deets, and this utility is similarly diminuitive, and as available as its rs
cousin.
Apple, once more, has a decent source code archive of jot
available for your perusal.
datamash
GNU datamash is "a command-line program which performs basic numeric, textual and statistical operations on input textual data files."
Usage is as such:
datamash [OPTION] op [fld] [op fld ...]
where:
op
is the operation to perform. If a primary operation is used, it must be listed first, optionally followed by other operations.fld
is the input field to use.fld
can be a number (1
=first field), or a field name when using the-H
or--header-in
options.multiple fields can be listed with a comma (e.g.
1,6,8
).a range of fields can be listed with a dash (e.g.
2-8
).use colons for operations which require a pair of fields (e.g. '
pcov 2:6
')primary
op
erations:groupby, crosstab, transpose, reverse, check
line-filtering operations:
rmdup
per-line operations:
base64, debase64, md5, sha1, sha224, sha256, sha384, sha512
bin, strbin, round, floor, ceil, trunc, frac
dirname, basename, barename, extname, getnum, cut
numeric grouping operations:
sum, min, max, absmin, absmax, range
textual/numeric grouping operations:
count, first, last, rand, unique, collapse, countunique
statistical grouping operations:
mean, geomean, harmmean, trimmean, median,
q1, q3, iqr, perc
mode, antimode
pstdev, sstdev, pvar, svar
ms, rms, mad, madraw
pskew, sskew, pkurt, skurt, dpo, jarque
scov, pcov, spearson, ppearson
It's quite a remarkable tool.
If you're familiar with the mtcars dataset in R, we can get the "five number summary" for mpg
pretty easily:
$ datamash -t, -H min mpg q1 mpg median mpg q3 mpg max mpg < mtcars.csv
min(mpg),q1(mpg),median(mpg),q3(mpg),max(mpg)
10.4,15.425,19.2,22.8,33.9
We can do the same for each cyl
group:
$ datamash -t, -H --sort groupby cyl min mpg q1 mpg median mpg q3 mpg max mpg < mtcars.csv
GroupBy(cyl),min(mpg),q1(mpg),median(mpg),q3(mpg),max(mpg)
4,21.4,22.8,26,30.4,33.9
6,17.8,18.65,19.7,21,21.4
8,10.4,14.4,15.2,16.25,19.2
It definitely punches above its weight for a 150 kilobyte binary.
It usually comes with a companion utility decorate
, which is especially handy at sorting IPv[46] lists (amongst other things):
$ cat ips
50.232.182.134
32.80.197.134
84.32.229.201
38.238.151.171
51.215.93.251
19.210.242.66
...
$ decorate -k1,1:ipv4 < ips
1.154.81.81
7.233.30.90
9.218.121.157
13.171.111.141
18.193.124.250
19.198.112.71
19.210.242.66
22.29.119.0
25.112.22.200
29.146.153.104
32.80.197.134
38.238.151.171
You should see what:
decorate -k1,1:ipv4 < ips | rs
does, now!
Installs are available for any OS, and macOS folks can brew install datamash
after reading this post.
The GNU folks continue to update the utility’s source — as recently as a few weeks ago!
FIN
Unless you're truly working with "big" data, you may not need all these super fancy modern tools as much as you think you do. ☮