

Discover more from hrbrmstr's Daily Drop
Other Peoples' Dotfiles
In Unix-like operating systems, any file or folder that starts with a dot character (for example, /home/user/.config
), commonly called a dot file or dotfile, is to be treated as "hidden" (this generally means that — by default — commands that list files or filesystem GUIs won't show them). Most shells are also set up so that the use of wildcard placeholders (e.g. "*
", "?
") will not match files whose names start with .
without you deliberately specifying it.
Somewhere along the line, humans and applications developed an idiom where these dotfiles became places to store configuration information or even reference text. You've likely used or seen .profile
, .zshrc
, and other shell-specific dotfiles.
Dotfiles tend to be very personal things, letting you customize your environment just the way you like it. If you ever have to set up a new system, or if you use multiple systems, you likely want to ensure your customizations come along for the ride. There are many ways to do this, but the folks at GitHub created a resource [GH] to try to get you to be dependent on them even more than you already are.
The good news is that their dotfile resource nexus isn't really dependent on them. Most of the resources listed and linked to will work just fine on less skeezy social code sharing and version control hubs.
Said resources include examples of other peoples' dotfiles, and general dotfile organization strategies for backing up/restoring/syncing preferences and settings.
If you've not been using a git-based dotfile managment strategy, you might want to take advantage of the long weekend (at least in the 🇺🇸), take inspiration from a few of the resources you like, and implement your own dotfile framework.
🍩 & DocQuery
The "document understanding transformer" (Donut), is a "new method of document understanding that utilizes an [optical character recognition] (OCR)-free end-to-end Transformer model. Donut does not require off-the-shelf OCR engines/APIs, yet it shows state-of-the-art performances on various visual document understanding tasks, such as visual document classification or information extraction (a.k.a. document parsing)."
The code and model in the aforelinked repository is based on a paper. Here's the pitch:
Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs. Although such OCR-based approaches have shown promising performance, they suffer from 1) high computational costs for using OCR; 2) inflexibility of OCR models on languages or types of document; 3) OCR error propagation to the subsequent process. To address these issues, in this paper, we introduce a novel OCR-free VDU model named Donut, which stands for Document understanding transformer. As the first step in OCR-free VDU research, we propose a simple architecture (i.e., Transformer) with a pre-training objective (i.e., cross-entropy loss). Donut is conceptually simple yet effective. Through extensive experiments and analyses, we show a simple OCR-free VDU model, Donut, achieves state-of-the-art performances on various VDU tasks in terms of both speed and accuracy. In addition, we offer a synthetic data generator that helps the model pre-training to be flexible in various languages and domains.
I learned about Donut after discovering a super cool utility called DocQuery, which is "is a library and command-line tool that makes it easy to analyze semi-structured and unstructured documents (PDFs, scanned images, etc.) using advanced natural language processing (NLP). You simply point DocQuery at one or more documents and specify a question you want to ask."
The DocQuery creators use Donut in testing their own models, and their tool can also use OCR methods to enhance results.
DocQuery's use cases work across structured, semi-structured, or unstructured documents, and you can ask questions about invoices, contracts, forms, emails, letters, receipts, and many more. Having a somewhat narrow domain lets it excel at what it does, and it handles a robust set of questions across those document types. You can even train the model on your own domain's documents.
This isn't some toy project, either. The folks at Impira (try to forgive them for using server-side rendering) use a version of this model in their for-profit work and have provided this free and robust open-source offering back to the community (I have no skin in the Impira game, I just felt compelled to give them props for this seriously great contribution).
Dig into the papers and details, which should help you slow down the pace of September slipping away this weekend.
ExplainDev
There's a plethora of ML/AI work being done on source code, and not all of them are as sketch as Microsoft’s unethical work in this area. While how their model was trained is opaque (so, it could be sketch as well, but I've poked a bit at the creators' bios and, unlike Microsoft, this seems to be a legit education-mission-driven initiative by them, even if it could still be using sketch-trained underlying models), and the documentation and eventual freemium model is not yet baked, the ExplainDev browser extension and online snippet-sharing app is pretty gosh darn cool.
Let's say you are learning Rust (and you should be!) and you are trying to read this bit of heiroglyphics on a tutorial site:
fn apply<F: Fn(&str)>(x: &[&str], f: F) {
for elem in x {
f(&elem)
}
}
fn main() {
let v = vec!["hello", "world"];
apply(&v, |x| println!("{}", x));
}
With the ExplainDev browser extension loaded, you can tap in the code block to have ExplainDev's model give you some details. And, I mean details:
This program prints the words "hello" and "world" to the console.
What is the type of v?
The variable v has a vector type.
A vector is an array that can grow in size. The elements are stored contiguously, so vectors have excellent cache locality and good performance for iteration. Vectors also provide methods to insert new values at any location or remove existing ones without invalidating other references into the same vector (unlike arrays).
In this case, it's a string slice with two items: "hello" and "world". String slices are immutable strings which point to another memory address where their contents reside; they're useful when you want to borrow data from somewhere else but don't need ownership over it yourself. They're often used as function arguments because they allow functions to take either owned Strings or borrowed &strs depending on what kind of argument was passed in by the caller.
The section header image was generated by the (kind of klunky/MVP) snippet-sharing app. If you watch the XHR requests, you can get a glimpse of what will eventually be an API we can all use to incorporate this into our learning resource content creation endeavours.
I modified the JS/Wasm Mandelbrot example over at RosettaCode by renaming a few identifiers (to strip out references to "mandel" and also change up variable names) and the base description ExplainDev gave me was "This code draws a fractal image on the screen." When I put the unmodified code into the model, it said “This code draws the Mandelbrot set on a canvas.", which may be a big indicator that at least some of the training was done with the RosettaCode corpus (if it was, I hope the freemium model is pretty generous or some FAANG snatches up the startup and provides this resource for free).
These are the stated, covered languages:
Assembly
Apache
Bash
Clojure
C++
C#
CSS
Dockerfile
Go
GraphQL
HTML
Ini
Java
Kotlin
JavaScript/Typescript
JSON
PHP
Plaintext
PowerShell
Python
Ruby
Rust
Scala
SCSS
SQL
Swift
XML
YAML
but, when I tossed in a small R snippet t(chol(matrix(c(10, 10, 10, 10), nrow=2, ncol=2)))
(to try to throw it off) it came back with "This code calculates the Cholesky decomposition of a matrix." (further supporting my RosettaCode training posit).
It's a clever/fun tool that could just help further democratize learning to code and use environments such as linux/macOS command lines:
FIN
Try to take advantage of the extended three-day weekend downtime 🇺🇸 readers! ☮
2022-09-02.01
Disregard my question—there was double clicking involved! My bad!
Thanks for another great edition. Were you able to generate the annotated snippet screenshots on the ExplainDev site? I've tried the example from the bottom with explanation set to both Basic and Advanced, but only come out with the banner explanation.
Either way, thanks for more fabulous resources!