

Discover more from hrbrmstr's Daily Drop
GCC Rust
The GNU Project started out in 1984 with a goal of building a UNIX-like operating system. Operating systems need components to be usable, and you have likely used or heard about programs such as Make (a tool which controls the generation of executables and other non-source files of a program from the program's source files), and Sed (a non-interactive command-line text editor), and Emacs (a bloated text editor, sitting on top of a lisp interpreter, with grandiose visions of universe domination). These are all part of the GNU ecosystem.
GNU's own operating system, Hurd, has yet to replace Linux as the dominant UNIX-like kernel alternative (news flash: it likely never will, too). You do need a compiler to build such things as operating systems and their components (I guess you could lovingly hand-craft binary bits on your own if you really wanted to), and GCC — the GNU Compiler Collection — was created to do just that. Hurd may be going nowhere fast, but GCC and other GNU programs are used extensively in modern computing, and GCC is used to build the Linux kernel.
Both GCC and clang+LLVM (we've mentioned that pair quite a bit in past editions) "compete" in the sense that each sport large, growing communities, have support for many programming languages, and continue to evolve apace. One core difference between them is that GCC has support for (way?) more programming languages and architectures (which are things like x86, x64, etc.) than clang does.
Today, GCC finally supports Rust. It even has its own website [GH] which describes it as:
[GCC Rust] is a full alternative implementation of the Rust language on top of GCC with the goal to become fully upstream with the GNU toolchain.
As this is a front-end project, the compiler will gain full access to all of GCC’s internal middle-end optimization passes which are distinct from LLVM. For example, users of this compiler can expect to use the familiar -O2 flags to tune GCC’s optimizer
Philip Herron (@the_philbert) spearheaded the development, but there are nine other named major contributors, and it's quite amazing to see their labors come to fruition.
While Rust has been all the hotness for quite some time, it hasn't always been "stable". Furthermore, Rust still lacks complete reference documentation for the entire language. Rust traits, a language feature that tells the Rust compiler about functionality a type must provide, is at the heart of Rust, and chalk was developed by the Rust developer community to help provide a set of rules for trait resolution.
Development/release stability, and the ability to independently verify what a compiler/toolchain produces when compiling Rust, made Philip's (et al.) work truly possible.
Competition is great, and Rust finally becoming a first-class GNU citizen will furthermore make it possible to see the realization of using Rust to develop and extend the Linux kernel.
Despite the announcement, for now, I'd suggest experimenting via the provided Docker image unless you want to dig in and contribute to the project.
Calibre 6.0
Calibre dubs itself "the one-stop solution to all your e-book needs". At the core, Calibre is an e-book manager, or digital library + librarian, and I have to believe most readers have at least heard about Calibre prior to this newsletter edition. I suspect many have tried it out or are regular users of the program, especially since Calibre supports almost every single e-reader ever made.
I've had an on-again, off-again relationship with Calibre over the years, and primarily use it to convert between e-book formats. E-books are still in a weird place in our modern world, since "digital rights management" is still "a thing" and dominant ecosystems like Amazon's Kindle books and Apple Books means you have to jump through hoops to use them in Calibre, unless you're reading PDFs and unprotected e-pubs.
Calibre is excellent at managing a large collection of digital book assets, and has a great developer and user community. This is truly evident in this 6.0 release.
For starters, full-text search is now something Calibre users can optionally enable. If you're a Kindle or Apple Books user and aren't frustrated with their text search capabilities, please drop a note in the comments for what magic incantations you use. Calibre's is pretty remarkable and fast, plus it is library-wide.
Calibre is also, now, Apple Silicon-native! It loads fast on my work and personal M1 systems.
Speaking of native, Calbre can now also use the native text-to-speech capabilities in the underlying operating system to read books aloud. (As someone who heavily relies on audiobooks and podcasts for some types of information consumption, this is a great feature).
Finally, Calibre now has a dedicated URL scheme — calibre://
— which opens up a whole new world of scripting and automation capabilities.
If you haven't used Calibre, this new version is most certainly worth checking out. If you've also been, like me, a fickle Calibre user, perhaps it's finally time to toggle your relationship status to it from "It's Complicated" to at least "In an Open Relationship".
Substrait
Substrait is "a cross-language serialization for relational algebra". If you've used SQL before, you've used relational algebra. In the most basic case, you've had a tabular data, performed some operation on it, and returned tabular data. The tabular data are relations, and the operations performed on them involve relational algebra incantations.
SQL is just one way of expressing these incantations. In the R world, we have packages like {rquery} and {dplyr} that are either complete or near-complete adaptations of this algebra, using sequential pipelines to describe intent.
The {dplyr} ecosystem is extensible via {dbplyr} which will translate {dplyr} chains to SQL to send to some processing engine. A few weeks ago I tweeted out (I'm totally not going to be able to find it) a benchmark of using Apache Arrow Datasets
over a directory of Parquet files and using {dplyr} operations on it said Dataset
vs reading in the Parquet (or equivalent CSV, et al.) files into memory and performing the same operations in-memory. Arrow+{dplyr} won in that particular benchmark, and did so in part because of the vision of Substrait (though Arrow is presently a related technology vs built on Substrate). That is, Arrow handled the "algebra" and returned the final relation. It was faster likely due to threading (I haven't dug into the 'why' fully, yet), but if we expand the {d[b]plyr} intent outside the R ecosystem, things in data science land start getting even more interesting and fun than they already are.
Substrate's ultimate vision is to "create a well-defined, cross-language specification for data compute operations. This includes a declaration of common operations, custom operations and one or more serialized representations of this specification. The spec focuses on the semantics of each operation and a consistent way to describe".
Jacques Nadeau (@intjesus) and others do a fine job explaining Substrate, and I suspect I've whetted a few appetites with this introduction, so head over to substrate.io and the corresponding repo to learn more.
FIN
This has sure been an unexpectedly diverse topic week. Hopefully, the mix has been informative/engaging! ☮
2022-07-12.01
So much excellent new info as always. Thank you.