hrbrmstr's Daily Drop

Share this post

2022-09-26.01

dailyfinds.hrbrmstr.dev

2022-09-26.01

rs; jot; datamash

boB Rudis
Sep 26, 2022
Share this post

2022-09-26.01

dailyfinds.hrbrmstr.dev

Today's edition is all about data ops at the command line, and not some new-fangled, fancy Rust-based level-ups of ancient utilities (e.g. ripgrep). We're talking 'bout old school, super-tiny binaries with powerful features that you likely (needlessly) fire up Python and R for.

rs

brown stone wall
Photo by Tim Mossholder on Unsplash

The rs utility "reads the standard input, interpreting each line as a row of blank-separated entries in an array, transforms the array according to the options, and writes it on the standard output. With no arguments it transforms stream input into a columnar format convenient for terminal viewing."

In the datamash section (below) we work with a list of IP addresses:

50.232.182.134
32.80.197.134
84.32.229.201
38.238.151.171
51.215.93.251
19.210.242.66
101.61.90.155
18.193.124.250
92.214.83.250
22.29.119.0
29.146.153.104
19.198.112.71
1.154.81.81
104.59.22.67
13.171.111.141
102.122.22.187
69.25.224.252
84.254.74.210
50.161.245.209
59.3.238.97
39.226.151.132
25.112.22.200
118.182.91.0
111.63.49.165
99.17.244.207
67.29.135.114
7.233.30.90
114.67.239.6
9.218.121.157
91.49.173.64

The simplest use of rs takes the above and turns it into (apologies, again, for Substack’s less than stellar code blocks):

rs < ips
50.232.182.134  101.61.90.155   1.154.81.81     50.161.245.209  99.17.244.207
32.80.197.134   18.193.124.250  104.59.22.67    59.3.238.97     67.29.135.114
84.32.229.201   92.214.83.250   13.171.111.141  39.226.151.132  7.233.30.90
38.238.151.171  22.29.119.0     102.122.22.187  25.112.22.200   114.67.239.6
51.215.93.251   29.146.153.104  69.25.224.252   118.182.91.0    9.218.121.157
19.210.242.66   19.198.112.71   84.254.74.210   111.63.49.165   91.49.173.64

A surprising number of (older?) data-loggers put field values on individual lines. Say, a hostname, IP address, and count of connections from said address:

alpha
50.232.182.134
2
beta
32.80.197.134
4
delta
84.32.229.201
10
gamma
38.238.151.171
99

We can turn that into a handy CSV without using tidy R ops or ugly Python code via:

$ rs -C, -e 0 3 < sad-records | sed -e 's/,$//'
alpha,50.232.182.134,2
beta,32.80.197.134,4
delta,84.32.229.201,10
gamma,38.238.151.171,99

Incredibly handy functionality (hit up man rs for full info) in ~167 kilobytes, and available (or installable) on any decent OS.

Apple has the source for rs if you're curious as to the inner workings.

jot

person holding white and blue plastic blocks
Photo by Bradyn Trollip on Unsplash

Sometimes you just need to generate a sequence of (possibly random) items in a hurry. Kids today would likely import an npm package that has 1,300 dependencies just to kick out a list of 10 random usernames. Well, you don't have to rely on any fancy programming language package ecosystem to deal with sequences like this with jot around. This commands sole purpose in life is to "print sequential or random data."

So, about those 10 random usernames…

$ jot -r -w "user-%02d" 10
user-06
user-14
user-30
user-76
user-14
user-17
user-63
user-63
user-03
user-04

Or, you could just repeat yourself:

$ jot -b "the daily drop is awesome" 5
the daily drop is awesome
the daily drop is awesome
the daily drop is awesome
the daily drop is awesome
the daily drop is awesome

Perhaps even combine it with rs to make random-looking strings (despite the user ones above being truly random):

jot -r -c 160 a z | rs -g 0 8
ztuodtzv
lgqnwedq
dzqqezxa
ukplsjis
eeowyyzy
ncsyqsfs
bogkqdjn
ukdowzjo
aqyjkisn
jteghyan
wsqidlnl
hxmkzmhx
qgezccer
dozzwmpe
ojahtaxc
frhzyvaj
kfhkijkc
miggbjbp
fmzgfxhv
hdhaojnc

Again, the man page has all the deets, and this utility is similarly diminuitive, and as available as its rs cousin.

Apple, once more, has a decent source code archive of jot available for your perusal.

Leave a comment

datamash

GNU datamash is "a command-line program which performs basic numeric, textual and statistical operations on input textual data files."

Usage is as such:

datamash [OPTION] op [fld] [op fld ...]

where:

  • op is the operation to perform. If a primary operation is used, it must be listed first, optionally followed by other operations.

  • fld is the input field to use. fld can be a number (1=first field), or a field name when using the -H or --header-in options.

  • multiple fields can be listed with a comma (e.g. 1,6,8).

  • a range of fields can be listed with a dash (e.g. 2-8).

  • use colons for operations which require a pair of fields (e.g. 'pcov 2:6')

  • primary operations:

    • groupby, crosstab, transpose, reverse, check

  • line-filtering operations:

    • rmdup

  • per-line operations:

    • base64, debase64, md5, sha1, sha224, sha256, sha384, sha512

    • bin, strbin, round, floor, ceil, trunc, frac

    • dirname, basename, barename, extname, getnum, cut

  • numeric grouping operations:

    • sum, min, max, absmin, absmax, range

  • textual/numeric grouping operations:

    • count, first, last, rand, unique, collapse, countunique

  • statistical grouping operations:

    • mean, geomean, harmmean, trimmean, median,

    • q1, q3, iqr, perc

    • mode, antimode

    • pstdev, sstdev, pvar, svar

    • ms, rms, mad, madraw

    • pskew, sskew, pkurt, skurt, dpo, jarque

    • scov, pcov, spearson, ppearson

It's quite a remarkable tool.

If you're familiar with the mtcars dataset in R, we can get the "five number summary" for mpg pretty easily:

$ datamash -t, -H min mpg q1 mpg median mpg q3 mpg max mpg < mtcars.csv
min(mpg),q1(mpg),median(mpg),q3(mpg),max(mpg)
10.4,15.425,19.2,22.8,33.9

We can do the same for each cyl group:

$ datamash -t, -H --sort groupby cyl min mpg q1 mpg median mpg q3 mpg max mpg < mtcars.csv
GroupBy(cyl),min(mpg),q1(mpg),median(mpg),q3(mpg),max(mpg)
4,21.4,22.8,26,30.4,33.9
6,17.8,18.65,19.7,21,21.4
8,10.4,14.4,15.2,16.25,19.2

It definitely punches above its weight for a 150 kilobyte binary.

It usually comes with a companion utility decorate, which is especially handy at sorting IPv[46] lists (amongst other things):

$ cat ips
50.232.182.134
32.80.197.134
84.32.229.201
38.238.151.171
51.215.93.251
19.210.242.66
...
$ decorate -k1,1:ipv4 < ips
1.154.81.81
7.233.30.90
9.218.121.157
13.171.111.141
18.193.124.250
19.198.112.71
19.210.242.66
22.29.119.0
25.112.22.200
29.146.153.104
32.80.197.134
38.238.151.171

You should see what:

decorate -k1,1:ipv4 < ips | rs 

does, now!

Installs are available for any OS, and macOS folks can brew install datamash after reading this post.

The GNU folks continue to update the utility’s source — as recently as a few weeks ago!

Share

FIN

Unless you're truly working with "big" data, you may not need all these super fancy modern tools as much as you think you do. ☮

Share this post

2022-09-26.01

dailyfinds.hrbrmstr.dev
Previous
Next
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 boB Rudis
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing