Discover more from hrbrmstr's Daily Drop
Drop #253 (2023-05-03): Multi-threaded Edition v0.5.0
jsonformer; R.I.P. 🔒; Delta; Soupault
Programming note: I’ve had some comms that email delivery is not working consistently. Tis one more reason for me to ditch this platform, but — rest assured — I’ve filed a bug report and will let folks know what my Substack overlords have to say about the matter.
A fair chunk of time I'd normally be dedicating to pre-rendering the weekly Drops has been spent preparing for a talk I'm giving tomorrow up in Portland, Maine. As a result, I'm leaning into the "multi-threaded” idiom I set up since — despite talk prep time — I still process the internet firehose as it hits my feeds.
The proper-theme-based Drops tend to have an emergent property that inherently answers, “why did hrbrmstr link to this resource?” question some Drop readers might have. I realized these “grab bag” may not answer said inquiry. Thus, along with the links and summaries, I'll also include the “why” answer. Please don't hesitate to skip those (they're clearly marked) if you're just here for the freshmeat.
This is a quick one and, perhaps, only useful to folks in the extended HuggingFace ecosystem.
Chatting is great, and all, but what I really want out of these opaque programs created by complex maths and giant training data sets that were made by 21st century slave labor1 is usable, structured data. That's a big reason I dug into a short but solid example of that in a previous Drop. That toy example made it look easy. It is not. Transforming human language to proper JSON can be super painful, as many of you likely experience on-the-regular. Most methods we've been using are fragile and error-ridden; nuanced prompt engineering, painstaking fine-tuning, and laborious post-processing can still manage to produce less than ideal results.
Enter stage left: Jsonformer — a fresh take on the issue, with some seekrit sauce. As the co-founder of
cohere.io puts it:
Problem: Getting models to output structured JSON is hard
Solution: Only generate the content tokens and fill in the fixed tokens
"In structured data, many tokens are fixed and predictable. Jsonformer is a wrapper around HuggingFace models that fills in the fixed tokens during the generation process, and only delegates the generation of content tokens to the language model. This makes it more efficient and bulletproof than existing approaches."
It has support for
object JSON schema data types, and the example in the GH repo shows how straightforward it is to use.
Why: On the “data science”2 spectrum, I self-prioritize fundamental statistics, data visualization, explainable “machine learning”, “library science” and general communication components. This is neither the time nor place to discuss my feels for “AI” in general, but suffice it to say I have not really cared much about that space until it truly started to disrupt traditional NLP machinations a few years ago. Now, as the individual who is privileged to set the vision and direction for how we improve the human3 experience and utility of our platform at work via “data science”, I am literally forced to not only care about this space, but also prioritize it over my preferred “data science” areas, due to the stupid fast pace of new developments4 and disruptions.
(You will 100% see how I incorporated the “why” directly into this short 'splainer.)
I detest the “HTTPS Everywhere” movement. Yes, I want sites to use TLS so you have some solace that the integrity5 of the content you are seeing is true. But, these folks also enable[d] attackers in a way that the “good guys” haven't done in a very long time, and I really kind of hope they have to pay for that some day.
I've railed against the “browser lock icon” for years, and railed much harder when “HTTPS Everywhere” came into being. This “🔒” has been a lie almost since the inception. “Security” is a terrible word and has lost almost all meaning, at least in a “cyber” context. “Safe” is also a very relative term on the internet, and if you were ever lulled into thinking you can trust some site because of that icon, please send Apple, Google, and Microsoft a hate letter, since they helped program that concept into your noggin.
Yesterday, in a very short post, Google told the world that “🔒” is going away for good and being replaced by a “tune” icon, which will let folks have a bit more overt control over site permissions (and more). The tune icon:
does not imply “trustworthy”;
is more obviously clickable; and,
is commonly associated with settings or other controls
If you run the Canary channel, use
chrome://flags#chrome-refresh-2023 to enable it now.
Why: y'all really seemed to like the git-centric issue.
Stealing the feature list from the repo should help convey the utility of it (along with the section header), since it has:
Language syntax highlighting with the same syntax-highlighting themes as bat
Word-level diff highlighting using a Levenshtein edit inference algorithm
Side-by-side view with line-wrapping
Nkeybindings to move between files in large diffs, and between diffs in
log -pviews (
Improved merge conflict display
git blamedisplay (syntax highlighting;
--hyperlinksformats commits as links to hosting provider etc. Supported hosting providers are: GitHub, GitLab, SourceHut, Codeberg)
Syntax-highlights grep output from
Support for Git's
Code can be copied directly from the diff (
-/+markers are removed by default).
Commit hashes can be formatted as terminal hyperlinks to the hosting provider page (
File paths can also be formatted as hyperlinks for opening in your OS.
Stylable box/line decorations to draw attention to commit, file and hunk header sections.
Style strings (foreground color, background color, font attributes) are supported for >20 stylable elements, using the same color/style language as git
Handles traditional unified diff output in addition to git output
It's easy to set up (this is the minimum config I suggest using, and the repo and doc-site have lots of other cool entries you can add):
[core] pager = delta [interactive] diffFilter = delta --color-only
If you spend any amount of time poring over git diffs, you likely already use this (or another, preferred, tool). If not, this may actually make you want to remove that 3,000 line PR from your teammate.
Why: setting this up for a long-form cover later this month, but it's tool cool not to share now in case you’re in the market for a new content generator.
First, it's written in OCaml, a functional programming language I've slated for a personal deep dive in ~September. Second, it does away with daft YAML “frontmatter” and uses the semantic nature of properly crafted HTML to extract metadata (to, say, make a blog post index). Finally, the way it handles “templates”, content injection, and extensibility is pretty novel.
You can keep tabs on my personal riffing with Soupault over here (I’ll get a git link up soon), and will 100% see a full teardown in the near future.
If any of you just happen to be in Portland, Maine tomorrow, please let me know (or, better yet, drop by the MTUG event)! ☮
despite seeing lots of freshly minted hate towards this catch-all discipline moniker, I will brazenly continue to use it
I really dislike the term “user”
note that I did not say “improvements”
certificate authorityy safety practices are kind of a joke; your workplace likely breaks encryption if you proxy through their environment; and, nation states have “god” certs. so, please don't think any comms you make over the internet are at all confidential if someone really wants to get at them. also, don’t pay for TLS certs and use Let’s Encrypt, it’s the attacker’s certificate authority of choice!