

Discover more from hrbrmstr's Daily Drop
Today, we cover a fav Drop meta-topic (JSON) that also includes three of my most favorite [data processing] things: XPath, two-letter CLI utilities, and DuckDB.
Let's dig in!
JSONata
JSONata (GH | NPM) is (yet-another) a lightweight, open-source query and transformation language for JSON data, inspired by the 'location path' semantics of XPath 3.1. It enables us to express fairly sophisticated queries in a compact and intuitive syntax, making it much easier to work with complex JSON data structures, such as those used in pretty much every modern web service and API endpoint.
It has some well-considered “batteries included”:
grokable declarative syntax that is pretty easy to read and write, which allows us to focus on the desired output rather than the procedural steps required to achieve it (something I get hung up in
jq
-land on occasion)built-in operators and functions for manipulating and combining data, making it easier to perform complex transformations without writing custom code in a traditional programming language like python or javascript
user-defined functions that let us extend JSONata's capabilities and tailor it to our specific needs
flexible output structure that lets us format query results into pretty much any output type
I'm a big XPath fan, and really like the fact that JSONata is structured around the same basic concepts. This single feature is a big selling-point (for me) over jq
. While jq
can perform similar transformations, I think JSONata's methods for doing so are more well-considered. An example of this is JSONata's fairly unique approach to handling arrays, treating single values and arrays containing a single value as equivalent. This can drastically simplify expressions in certain scenarios.
JSONata also doesn't throw errors for non-existing data in the input document. If during the navigation of the location path, a field is not found, then the expression returns nothing. This can be beneficial in certain scenarios where the structure of the input JSON can vary and doesn't always contain the same fields.
Both JSONata and jq
can work in the browser (ref: JSONdata example), but jq
has a slight speed edge thanks to WASM. However, said edge comes at the cost of a slow-first-start, and there is something to be said for using pure JS code in-browser.
You can try it out right in your browser to see if it's more to your liking than jq
.
jj
Both jq
and JSONata are “Swiss Army Knife” in nature, letting us slice, dice, extract, and transform JSON data in many ways. But, what if you just need to get or tweak some JSON data — especially deeply nested JSON data?
jj is a Golang command line utility (CLI) that uses GJSON path syntax — yes, YET ANOTHER JSON query path thingy.
It's hard to argue with the simplicity of the examples in the GH repo. For example:
echo '{"name":{"first":"Tom","last":"Smith"}}' | jj name.middle
gets the middle
name from the JSON structure (which is null
in this case — no errors tossed!).
This:
echo '{"name":{"first":"Tom","last":"Smith"}}' | jj -r name.last
will return “Smith”.
It has full support for ndjson as well as setting/updating/deleting values. Plus it lets you perform similar pretty/ugly printing that jq
does.
This is a far more special-purpose utility than either jq
or JSONata, but it's superfast and works everywhere. Definitely hit up the repo for tons of examples.
DuckDB JSON
I realize most Drop readers know DuckDB has robuyst JSON support, but I wanted to include it in this showcase of fancy new JSON query/transform tools since it's just so gosh darn fast and good at said JSON manipulation.
All DuckDB types to be casted to JSON, and JSON to be casted back to any DuckDB types. This is an incredibly powerful feature. It has full support for plain-ol' JSON and ndjson, too, so you're not limited into one type of structure.
It's also a solid replacement for JSON operations I used to do in Apache Drill (at least in a single-node cluster).
A recent example that has cause dme to wax poetic today is my need to get CWE data out of CVE records. I recall recently whining about NVD's forthcoming shuttering of bulk CVE record database downloads in favor of their fancy API. While I do have a local caching proxy for said API, sometimes you need to work on the entire CVE dataset directly on storage. The best way I've found to do this is keeping a clone of the modern CVE list regularly updated and processing each JSON file.
Doing the above with jq
was painful. “Slow” doesn't even begin to describe the process. If you want to see just how fast DuckDB is at this, hit up this source (img in section header), clone the CVE list repo, modify the directory paths and run it. It burns through nearly 220K hideous JSON files in mere seconds, even with some complex JSON operations.
Plus, you can choose to use jq
's familiar JSONPath or good ol' RFC 6901 JSON Pointer syntax when addressing values.
Before you reach for jq
, jj
, or JSONata, give DuckDB a try the next time you need to process some JSON.
FIN
These four-day weeks are brutal. I could really use today to be Tuesday. I hope your week is much less hectic! ☮